The Covid-19 Twitter Pandemic Archive is a catalog of datasets containing billions of tweet IDs for COVID-19 related tweets and a set of data visualization dashboards that display high-level monthly stats about the COVID-19 conversations on Twitter. The datasets are being offered as-is for archiving and non-commercial research purposes and are free to download and reuse.
The tweets are collected via Twitter’s COVID-19 Streaming Endpoint (API) using a custom script developed by the Social Media Lab. According to Twitter, this new streaming endpoint has no data volume or throughput limitations, and offers a real-time, full-fidelity stream of public Tweets containing the full conversation about COVID-19. For more information about what tweets are included in this collection see Twitter’s filtering rules for this endpoint.
About the datasets
As per Twitter’s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month. For each month, we prepare two data files:
– one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and
– a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).
Creating a random sample dataset from a massive dataset of Tweet IDs
Due to the large number of Tweet IDs (often 100M+) in each dataset in the archive, it is not always practical (or necessary) to recollect and study all of the tweets contained in the datasets. Instead, you can use our new Tweets Sampling Toolkit (available on GitHub) to create a random sample of Tweet IDs from one of the larger dataset available in the archive.
Comparing two or more datasets
In addition to creating a random sample, the Tweets Sampling Toolkit can also perform set operations such as union, difference, and intersection to compare two or more datasets. For example, if you have previously collected your own dataset of COVID-19 related tweets using Twitter’s Standard Search or Streaming API, you could compare it with one of the datasets published in the COVID-19 Twitter Pandemic Archive. This can be done using the “union” function provided in the Tweets Sampling Toolkit to merge two or more datasets of Tweet IDs, while excluding duplicates. Alternatively, you can use the “difference” function to identify and recollect only those tweets (based on their Tweet IDs) that are not part of your original dataset. Finally, you can use the “intersection” function, to locate Tweet IDs that appear in two or more datasets.
The process of recollecting tweets (with their corresponding metadata) based on unique identifiers (Tweet IDs) is called Rehydration. To rehydrate tweets from one of the datasets in the COVID-19 Twitter Pandemic Archive (or a newly created random sample of Tweet IDs), you can use third-party programs such as Hydrator, the Python library Twarc, or Communalytic Pro (dataset limit of 10M Tweet IDs).