Skip to Main Content

Resources for Text and Data Mining

Guide to text mining resources available through Emory Libraries and the Emory Center for Digital Scholarship.

Things to keep in mind

Appropriate Use of Purchased or Licensed Resources

Most of the library's electronic resources are governed by license agreements that limit use to the Emory community or to individuals who are physically present at Emory University Library facilities.

  • Each user is responsible for ensuring that he or she uses these products solely for noncommercial, educational, scholarly or research use.
  • Systematic downloading, distribution of content to non-authorized users or indefinite retention of substantial portions of information is strictly prohibited. 
  • The use of software such as scripts, agents, or robots, is generally prohibited and may result in loss of access to these resources for the entire Emory community.

Adapted from Yale's Resources for Text Mining Guide

See table below for info about databases that allow TDM. If you don't see the resource you want to use listed here, please contact your subject librarian.

Emory Databases that Support Text Mining

Vendor Fee? Details Help/Guidelines Examples
Adam Matthew FREE

All databases from Adam Matthew (which digitize unique primary source collections) are available for mining. 

Contact Emory Libraries to initiate the process. 

Data Mining/Text Mining Statement from Adam Matthew

 

Data Mining with Adam Matthew Primary Source Collections from UCLA

Being Human

Trading Consequences

Gale Primary Source Collections

Gale Digital Scholar Lab DSL allows you to analyze the Gale digital collections that we have with the help of a suite of DH tools available via this cloud-based platform. Researchers can:

  • download documents
  • gather, clean, curate, and build a corpus of materials for their long-term research
  • use text-mining tools to mine a corpus work on topic modeling (based on MALLET)
  • do a variety of analyses including sentiment, N-grams, and cluster as well as term frequencies
  • share their research output with others and create visualizations

Gale Artemis: Primary Sources, searches across 22 of our Gale primary source databases covering 1500-2012, has a Term Frequency search option and Term Clusters viewer (available from the articles results list).

  • View results over time by entering a word or phrase,
  • Compare multiple terms.
  • Graph either the frequency of your search term (the number of documents per year) or its popularity (the % of total documents each year).
  • Click on a point on the graph to retrieve search results for that year
  • Click and drag to select a time period to zoom in on.

To download large datasets Emory Libraries will have to request data on your behalf from our Gale sales representative. It can take several weeks to process requests. Gale will send a hard drive with the data requested to the libraries for you to use.

Digital Scholar Lab Webinar
JSTOR FREE Data for Research (DfR) - provides a self-service system for text mining. By creating a free DfR account you can download the metadata, word frequencies, citations, key terms, and N-grams of up to 1,000 documents. To get larger datasets (>1,000) or a type of data not available through the main site, you have to contact JSTOR directly: support@ithaka.org. Introduction to using DfR from DH @ Washington Lee University
Nexis Uni FREE (in small amounts) Does not officially support TDM, however, patrons wishing to create a data corpora can download up to 500 articles in RTF form at a time.     
Oxford English Dictionary (OED) FREE     Opening Up the Oxford English Dictionary
ProQuest Mostly FREE for newspapers to which we have purchased

ProQuest TDM Studio

Does not include Alexander Street Press products of yet

  • National newspapers: Chicago Tribune (1849-1933), The Christian Science Monitor (1803-1993), Los Angeles Times (1881-1933), Wall Street Journal (1889-1935), Washington Post (1877-1935), Austin American Statesman (1871-1924), The Baltimore Sun (1837-1930), Detroit Free Press (1831-1999), Nashville Tennessean (1812-2002), New York Tribune/Herald Tribune (1841-1962)
  • International Newspapers: The Guardian and The Observer (1791-1909). Times of India (1838-2007)
  • African American Newspapers: Atlanta Daily World (1931-2003), The Baltimore Afro-American (1893-1988), Cleveland Call & Post (1934-1991), Los Angeles Sentinel (1934-2005), New York Amsterdam News (1922-1993), The Norfolk Journal & Guide (1921-2003), The Philadelphia Tribune (1912-2001), Pittsburgh Courier (1911-2002)

Emory also has PAL's to the following databases: 

Historical Chinese Newspapers, Historical Jewish Newspapers, Vogue, Women's Magazine Archive, News, Policy and Politics (Includes Newsweek), American Periodical Series

You may NOT at this time download the datasets themselves (although ProQuest is looking to expand to owned content)

TDM Libguide with quick onboarding guide, webinars, etc.

List of content in TDM

Robots Reading Vogue
Readex Price available upon request
Collections for which we have perpetual access licenses may be available upon request. These include: 
 
African American Newspapers, African American Periodicals, America's Historical Newspapers
 
For more information, contact Emory Libraries. 
Worldcat Free Ask for access to the various APIs for Worldcat content due to Emory Libraries' OCLC Cataloging and Metadata subscription (full cataloging) and a FirstSearch/WorldCat Discovery subscription    

 Adapted from the University of Southern California's Content Mining Guide.