Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Resources for Text and Data Mining

Guide to text mining resources available through Emory Libraries and the Emory Center for Digital Scholarship.

Freely Available/ Open Access Resources

Vendor Description Help/Guidelines Examples
BYU Corpora Textual corpora, focusing on languages and dialects. Recent shifts with three nonfinite
verbal complements in English: data
from the 100-million-word Time corpus
Caselaw Access Project Covers 6.4 million cases that represent 360 years of U.S. legal history. CAP API California Wordclouds
Chronicling America

Historic newspapers.


Bulk Access

Using Big Data to Ask Big Questions cites the following teaching and research projects: 

Digital Public Library of America Photographs, books, maps, news footage, oral histories, personal letters, museum objects, artwork, government documents, etc. from libraries, archives and museums.

API Key Request

DPLABot, StackLife DPLA
Europeana Labs Cultural heritage items from museums and galleries across Europe.


Curated datasets

Europeana Project List. Examples include:

Folger Shakespeare Library Shakespeare's plays, sonnets, poems.
Google Books Books from 1800-2000.


Info page

Quantitative analysis of culture using millions of digitized booksThe Two Poverty Enlightenments: Historical Insights from Digitized Books Spanning Three Centuries
HathiTrust Digital Library Printed material predominately published prior to 1923
Internet Archive Open Library Books, texts and other digital material. Data Dumps
The Linguist List Links to corpora in various languages.
New York Times Developer Network Provides access to ten public APIs: Archive, Article Search, Books, Community, Geographic, Most Popular, Semantic, Times Newswire, TimesTags, and Top Stories.

API Key Request Form

Terms of Use

Project Gutenberg Books in various languages. Terms of Use
Text Creation Partnership Early English Books Online, Eighteenth Century Collections Online and Evans Early American Imprints.
University Datasets Michigan State University Datasets (19th century Sunday School Books, Historic American Cookbooks, farming journals), University of Pennsylvania (books), University of Oxford Text Archive (literary and linguistic texts)