Late last year, we released the Covid-19 Twitter Pandemic Archive, a catalog of downloadable Twitter datasets containing billions of tweet IDs. The data for this archive comes from Twitter’s COVID-19 Streaming API and are being collected using four simultaneous data harvesters developed and maintained by the Social Media Lab. The datasets are available for free download as-is for archiving and non-commercial research purposes.
About Datasets in the Pandemic Archive
As per Twitter’s API Terms, each dataset only includes Tweet IDs (as opposed to the actual tweets and associated metadata). New datasets are uploaded to the web at the beginning of each month.
For each month, we prepare two data files:
- one file with Tweet IDs for all COVID-19 related tweets that we collect via the API, and
- a second file containing a subset of Tweet IDs for COVID-19 related tweets that also contain a vaccine-related word (i.e., words starting with vaccin*, vacin*, or vax*).
About the DataViz Dashboard for the COVID-19 Twitter Pandemic Archive
As part of releasing the Covid-19 Twitter Pandemic Archive, we also introduced a new public-facing data visualization dashboard that displays high-level monthly and daily stats about the COVID-19 discourse on Twitter. Along with some general stats about each dataset, the dashboard shows the hourly volume of data ingested by the harvesters for each month in the form of a time series chart (as shown below).
The dashboard can be used to detect:
- when our data harvesters script or Twitter’s COVID-19 Streaming API might have been down, and
- when and what pandemic-related hashtags, news stories or events went viral on Twitter.
Using the DataViz Dashboard to Check the Status of the Data Harvesters
For transparency and accuracy, we have made every effort to detect and note hours/days when a potential data loss might have happened. Due to the nature of the technology, on rare occasions, either our data collection script or Twitter’s API might be down for all or part of the day. When this happens, our collection script is designed to detect and log such incidents automatically and whenever possible to reconnect to the Twitter API promptly.
To help us and our users determine whether all four data harvesters are working properly in real time, we have created a dedicated status page. The page is designed to provide users with some general stats related to the data colleciton such as the number of tweets collected per second and the number of tweets collected since the collection has (re)started.
To detect when a collection script or Twitter’s API was down for all or part of a day, look for the orange arrow down icon in the time series chart. If you don’t see this icon in the time series chart for a particular month, then our system has not detected any potential data losses.
If the arrow down icon is present for one or more of the days in a selected month, it indicates that the number of collected tweets for a 1-hour time block on that day was either zero or noticeably lower than the median for the same 1-hour time block for that whole month. For example, the number of tweets posted between 10-11am on Day 1 is compared to the median number of tweets posted between 10-11am for the whole month. The data is compared in this way because the time series exhibits a predictable daily flow with one or more spikes during the day and a slope during night. If the total number of tweets collected between 10-11am is less than (median – median/2), it is labeled as a potential data loss incident.
For instance, if we visually examine the time series for March of 2021, it appears that there was a potential data loss on March 29 as indicated by the arrow down icon. While 6.39M tweets were collected that day, the time series chart shows a slight drop off in the volume of tweets collected during one of the 1-hour time blocks. This likely happened because one of the four data harvesters temporarily lost connectivity with Twitter’s API.
Since the COVID-19 Twitter Pandemic Archive is designed to detect and study topical trends related to the pandemic on a global scale, smaller data losses like the one that has happened on March 29, 2021 would not likely change the overall trend for that day. At the same time, if you are using this dataset for your research, it would be important to note a potential data loss on that day in the data collection section of your publication. In addition, you could try recollecting potentially missed tweets by merging our dataset with other COVID-19 related datasets shared publicly for the same time period using the Tweets Sampling Toolkit.
Using the DataViz Dashboard to Detect Viral Content
Following a similar approach to detecting drops in the volume of incoming data, we can also use the time series chart to detect unusual spikes in the amount of tweets shared within a short period of time. This is when the number of tweets is noticeably higher than the median number for the same time period for the whole month.
The green arrow up icon displayed under one or more days in the time series chart indicates that the number of collected tweets for at least a 1-hour time block on that day was higher than the median for the same 1-hour time blocks for the whole month. To reduce information overload, the visualization is designed to highlight up to 6 most viral events in a given month. Since these data spikes are calculated on an hourly basis, it is possible that all or most of 6 viral events would occur in a single day.
For example, in March of 2021, there were two days with significant spikes in the volume of tweets – March 2 and March 30. To investigate why, click on the arrow up icon to see the most shared URL on a given day. The spike on March 2 was caused by a trending news story that the New York Governor would be stripped of Emergency COVID-19 executive powers.
And the spike on March 30 corresponds to another trending news story about a potential side effect of the J & J vaccine.
In addition to examining the most shared URLs, we also suggest exploring the daily top hashtags table. It will help to reveal additional topics causing a higher than usual volume of tweets on any particular day in the dataset. For example, on March 30, one of the trending hashtags was #deathsantis. It was used in tweets criticizing Florida Governor Ron DeSantis’ pandemic response, like in the following sample tweet: