Research Guides: Resources for Text and Data Mining: Home

What is text and data mining?

Text and data mining (TDM) uses automated tools in order to identify and extract data relevant to one's research from large or numerous sources. By processing the available data in this way, researchers hope to show trends or patterns in the available data. TDM is used in both the humanities and sciences and can apply to a wide variety of types of datasets. Researchers can use some AI and machine-learning tools for TDM, but these tools are not required for TDM-based research.

AI tools and Emory licensed resources

We are currently updating our policies and guidelines regarding use of acquiring, using, and accessing e-resources for TDM projects and/or AI large language models. The AI environment is rapidly changing! Where legally and financially possible, Emory Libraries will liaise with vendors to accommodate researchers wishing to use corpora derived from licensed e-resources for computational analysis and AI learning. Before using AI tools (or accessing large amounts of content) with any licensed library resource, researchers should contact the library for more information about governing terms. For general guidance on Emory-sanctioned tools for AI and machine learning, see the AWS and Co-Pilot sites.

Here are some example terms and conditions. Current guidelines are now available.

Use of the content is restricted to research/academic, non-commercial uses only.
The environment for modeling must be secure. This point is particularly important for any allowed licensed use for machine learning. Most vendors will require that this data be in a secure environment for analysis.
Restrictions on use of third-party tools (for AI) may be imposed. Most vendors will not allow use of their content in third-party tools due to training data concerns and copyright restrictions
Data management restrictions may be imposed (e.g., Where will this data live? Does it need to be removed once analysis has been completed?).

As a general rule, check with the relevant subject librarian before beginning any project that involves use of Emory e-resources for TDM or AI purposes! Here are some questions to consider as you develop a proposal that uses TDM or AI techniques.

Which databases permit text mining?

Databases often have their own rules and restrictions on what is and is not permissible when it comes to applying TDM methods to their data. In addition, access to these databases comes in a variety of forms, mediated by Emory Libraries.

Broadly, databases fall into four categories

Purchased resources. These are resources Emory libraries has either purchased or with whom Emory has a perpetual access license (PAL). While there may be some restrictions on how data can be used--especially when it comes to publication--generally speaking these databases allow TDM in some capacity. However, one should not assume AI can be used with these resources.
Resources accessible through purchase. These are resources Emory may have access to in some capacity, but do not (presently) allow access to the kinds of data needed for TDM. However, if access to these data is needed, it may be purchased. To request that Emory purchase access to these data, please contact your relevant subject librarian.
Freely Available Resources. These databases are open access and either openly allow TDM or allow TDM broadly, but with specific (generally minor) restrictions.
Restricted Resources. These resources either forbid the use of their data for any and all TDM projects, or Emory does not have sufficient access to these databases to permit TDM usage.

As a general rule, check with the relevant subject librarian before beginning any project that involves use of Emory e-resources for TDM or AI purposes!

Research Datasets

Major data repositories and organizations are currently establishing best practices around the use of datasets in Large Language Models and other AI tools. For example, please see:

ICPSR's guidelines

Digital Millennium Copyright Act (DMCA) Section 1201 Exemptions

UIPO Brief Explainer: DMCA 1201 Exemptions
In October 2024, the Librarian of Congress published updated exemptions to the Digital Millennium Copyright Act (DMCA)’s prohibition on circumventing digital rights management (DRM) to access copyrighted works. Some of the exemptions are leveraged by academic libraries and the audiences they serve. This PDF explainer from the University Information Policy Officers organization (a group composed of librarians who have specialized knowledge of copyright, many with a law degree and/or a master's degree in librarianship) provides an overall picture of the exemptions granted following the most recent review, and provides more detail on the expansion of the exemption to conduct text and data mining (TDM) on audiovisual and literary works.

Many thanks to the members of the UIPO Advocacy and Public Policy Committee for allowing us to share this resource they created.