Research Guides: Resources for Text and Data Mining: APIs and WebScraping

What is an API?

An API (Application Programming Interface) is a set of third-party specifications, such as Hypertext Transfer Protocol (HTTP) to request messages, along with a definition of the structure of response messages, usually in an Extensible Markup Language (XML) or JavaScript Object Notation (JSON) format. In other words, APIs serve as a means to extract significant amounts of back end (raw data) from a database without requiring user input. The output is usually, as noted above, in JSON or XML format, and will often need to be converted into another format for analysis (although many statistical programs and text mining tools can now easily read this output).

Using an API does require use of an API to access some basic technical or programming language. A good short course is available from Linkedin Learning ("Introduction to web APIs" is the best one to start with).

Where can I find APIs?

A great place to start with publicly available APIs is the Public API GitHub site. Th potentially most useful categories are Books, Geocoding, Government, and Health.

The best current guide for APIs is MIT's APIs for Scholarly Resources. Each entry provides information on how to access, the file format of results, any limitations on access, and contact information. Since the guide was created by MIT staff, there may be resources on the list not available at Emory. Check Databases@Emory for availability. Also, please note that Emory has recently renegotiated its Elsevier (which owns the Science Direct platform) contract, which has some more restrictive terms regarding text and data mining. Please contact Chris Palazzolo (below) prior to pursuing an API.

Note that in some cases you will need to be linked into Emory's instance of the resource (i.e., accessing through the proxy server).

For any general questions or inquiries regarding API allowances, please contact Chris Palazzolo.

What is webscraping?

Webscraping typically refers to the systematic extraction (either automated or manual) from a website into another a spreadsheet or database for later analysis and retrieval. Because of the licensing agreements Emory has with various publishers, using text scrapers or crawlers is typically prohibited, and users must employ the publisher's API in order to access this information, or use other vendor/publisher tools (such as Gale's Digital Scholar Lab). In addition, most websites frown upon such web scraping, citing copyright and legal considerations.