Skip to Main Content

Resources for Text and Data Mining

Guide to text mining resources available through Emory Libraries and the Emory Center for Digital Scholarship.

What is text mining?

Text and data mining (TDM) uses automated tools in order to identify, extract, and present relevant data to one's research from large or numerous sources. By processing the available data in this way, researchers hope to show trends or patterns in the available data. TDM is used in both the humanities and sciences, and can apply to a wide variety of types of data sets.

We are currently updating our policies and guidelines regarding use of acquiring, using and accessing e-resources for TDM projects and/or AI large language models. Our current guidelines still may be helpful. Generally, it is good to consider:

(1) use of the content (research/academic only)

(2) security of the environment for modeling

(3) use of third party tools (for AI)

(4) data management


As a general rule, check with the relevant subject librarian before beginning any project that involves TDM. The complete list of Emory subject librarians can be found here.

Which databases permit text mining?

Databases often have their own rules and restrictions on what is and is not permissible when it comes to applying TDM methods to their data. In addition, access to these databases comes in a variety of forms, mediated by Emory Libraries.

Broadly, databases fall into four categories

  1. Purchased Resources. These are resources Emory libraries has either purchased or with whom Emory has a Perpetual Access License (PAL). While there may be some restrictions on how data can be used--especially when it comes to publication--generally speaking these databases allow TDM in some capacity.
  2. Resources Accessible through Purchase. These are resources Emory may have access to in some capacity, but do not (presently) allow access to the kinds of data needed for TDM. However, if access to these data is needed, it may be purchased. To request that Emory purchase access to these data, please contact your relevant subject librarian.
  3. Freely Available Resources. These databases are open access and either openly allow TDM or allow TDM broadly, but with specific (generally minor) restrictions.
  4. Restricted Resources. These resources either forbid the use of their data for any and all TDM projects, or Emory does not have sufficient access to these databases to permit TDM usage.