Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
 
 

Resources for Text and Data Mining

Guide to text mining resources available through Emory Libraries and the Emory Center for Digital Scholarship.

Using Twitter For Research

Unlike some other social media outlets, getting data from Twitter is a relatively easy task. There are, however, some considerations to keep in mind if you are interested in getting and using tweets as a data source:

  • Twitter is not intended to be a source for data for researchers.
  • "Historical" data from Twitter are often not freely available.
  • There are ethical considerations in using Twitter as a data source of which researchers should be mindful.

This guide is not an exhaustive guide for using Twitter as raw material for research. Instead, it is an introduction to getting data from Twitter. For an excellent introduction to using Twitter as a data sources, consider reading the book Twitter as Data, written by Zachary C. Steinert-Threlkeld and published by Cambridge University Press. This title covers a lot of the ins and outs of working with data from Twitter and is well worth the time to read.

Collecting Twitter Data

Twitter has two APIs (Application Programming Interfaces) that you can use to access and download tweets. One is Twitter's REST (Representational State Transfer) API, which you can use to access past tweets and information from profiles of Twitter users. With this API, you can search for tweets on particular topics, but you are limited to tweets from the past 6-9 days. This API will also let you get tweets from a specific account. Here, you can get data further back in time, but only up to a maximum of 3200 tweets. Twitter also has a "streaming" API for collecting tweets in real time. This API provides you with a sample of up to 1% of the total volume of tweets.

To make use of Twitter's API, you need to have a Twitter account. You also need to get the appropriate credentials from Twitter - see https://apps.twitter.com/.

Here are some useful tools for collecting data from Twitter:

  • rtweet - a very powerful R package for collecting Twitter data from either of its APIs
  • TAGS - a template for collecting tweets into Google Sheets
  • Tweepy - a library for Python users for collecting tweets

George Washington University's "Where to get Twitter data for academic research" has other suggestions for tools to use in collecting tweets and is, in general, a good primer on options for how/where to get data from Twitter.

Be mindful of Twitter's terms of use for developers, including the limits it places on how much data you can collect. Also note that Twitter places limits on how much data you can share. If you wish to share the contents of tweets, you can share a maximum of 50,000 tweets/day, and you cannot share them via the likes of a database or a GitHub site. Instead, you can share them only via what Twitter refers to as "non-automated means." Alternately, you can share IDs for tweets. These can be shared on websites and in larger amounts - up to 1,500,000 IDs per month, with the option of requesting exemptions from this limit for research purposes.

Collections of Twitter Data

You might also consider making use of Twitter datasets that have been collected by various academics and organizations. As noted above, Twitter places various limits on the extent to which you can share tweet data with others. As a result, these datasets generally consist of IDs of tweets rather than the content of tweets themselves. To get the contents of tweets, you can "hydrate" the IDs via the "Hydrator", a free tool that was develped by Documenting the Now. (Think of Rey in "Star Wars: The Force Awakens" putting flour in a dish of water to make a muffin, and you'll have a visual metaphor for what hydrating a tweet ID does.)

Some sources to consult for locating datasets consisting of tweet IDs:

Of these various collections, Documenting the Now is the most comprehensive and is generally the best place to start.

If neither pre-existing datasets nor collecting tweets yourself is suitable, e.g. for reasons of time coverage, then you will generally need to pay to get the data you need. Twitter now offers historical access to its archive based on a monthly fee. GWU's "Where To Get Twitter Data For Academic Research" guide mentioned above also mentions fee-based options for access to Twitter data.

Twitter and Ethics in Research

As is noted above, Twitter is not meant to be a data source for researchers. That does not mean, however, that ethical considerations that accompany more "traditional" sources of data are not relevant for users of Twitter data:

(1) Sensitivity and Harm: Research using data from the likes of Twitter does have the potential to cause harm to would-be research subjects. The content of tweets may, for instance, relate to matters damaging to the people who tweeted them, such as tweets about illegal or illicit behavior. Or, we may be talking about tweets from vulnerable populations, such as children or people with mental illnesses or people living in authoritarian states or in violent contexts. Compiling such data creates the potential for information to be spread beyond what was originally intended, with resulting potential for harm.

(2) Privacy and Consent: Survey data of Twitter users suggest that people who post on Twitter are not always comfortable with their posts being aggregated and used for research purposes. While we are talking about social media here, you should keep in mind that Twitter allows users to delete tweets and make accounts private. In addition, Twitter users may intend their posts to be limited to networks and friends. The point being, there is a bit of a disconnect between researchers and the public here. The former are using tweets and sharing data from them in ways that the latter may not have intended and with which they may not be comfortable.

(3) Terms of Use: In general, whatever social-media platform you are using for data, you should know the terms of use for that platform and follow them. We have already noted how Twitter places limits on how you can share data and how much you can share. We have also noted that Twitter allows users to change account settings, e.g. to protect accounts or to delete tweets. In such cases, you should respect that and remove such tweets or accounts from your data. Be mindful of terms of use and their implications for how you can use tweets as data and for their implications for questions about harm and privacy.

For more intensive and detailed discussions of this topic, we have compiled a list of suggested readings on Twitter and research ethics (pdf).