In a recent study published on medRxiv* preprint server, researchers evaluated methods that could automatically and quickly detect counterfeit coronavirus disease 2019 (COVID-19) prevention and treatment products using Twitter chats.
They used natural language processing (NLP) and time series anomaly detection methods based on the belief that as any fraudulent product gains popularity among Twitter users, there is an increase corresponding volume of product discussions or mentions. Curiously, these new detection methods quickly spot sudden increases in the frequency of mentions on social media platforms, including Twitter and Facebook.
Study: Early detection of fraudulent COVID-19 products from Twitter conversations. Image Credit: Michele Ursi/Shutterstock
Amid genuine efforts by public health agencies around the world to mitigate the impact of the COVID-19 pandemic, the unscrupulous promotion of fraudulent products claiming to treat, prevent or cure Down Syndrome coronavirus 2 infection severe acute respiratory illness (SARS-CoV-2) has been a persistent and annoying problem.
The United States Food and Drug Administration (FDA) issues warning letters to curb the spread of these products; however, only after many people have been exposed to it. However, in the United States, these products may not be sold or advertised on television or in newspapers. Consequently, entities selling such products promote them on social media platforms, causing the spread of misinformation or an infodemic.
It is therefore urgent to design vigilance tools that automatically and early identify potentially counterfeit COVID-19 products and generate alerts. Fortunately, real-time monitoring of fraudulent COVID-19 products on social media can be automated.
About the study
In the present study, the researchers used time-series anomaly detection methods to detect some or all of the anomalous increases in mentions of counterfeit products related to COVID-19 on Twitter. They systematically organized all Twitter chats through NLP to generate alerts. The team used real-time data from Twitter’s COVID-19 application programming interface (API), directly provided by Twitter, to support research related to COVID-19. Subsequently, the team was able to collect 577,872,350 tweets mentioning keywords related to COVID-19, including coronavirus, covid, etc., between February 19, 2020 and December 31, 2020.
The researchers excluded keywords collected after 2020 and 12 keywords mentioned less than 10 times on Twitter, including their language variants. They collected data continuously and stored it in a database hosted on Google Cloud Platform.
Next, the team manually created a comprehensive list of counterfeit COVID-19 products from the US FDA website. Likewise, they listed the names of people who owned these products, their websites, and their social media profiles, if any. The researchers also manually reviewed 183 FDA warning letters to create a list of products and entities and their earliest FDA issue letter dates.
Additionally, they used a data-centric tool to detect spelling variants or misspellings in the names of counterfeit COVID-19 products. The variant generator tool applied semantic and lexical similarity measures to automatically identify these errors, including key phrases and multi-word phrases.
The team analyzed all products and variant spellings of key phrases with at least 10 mentions in the curated data. Next, they normalized the daily counts by the total number of Twitter posts collected on the same day. Mentions per 1,000 tweets represented daily relative frequencies of COVID-19 related keywords and phrases.
Finally, any data point that was more than three standard deviations (SD) from the 14-day moving average was considered a potential signal. It helped researchers determine whether the date of the first signal for a COVID-19-related keyword was detected earlier than the date the FDA letter was issued, within a week or later.
The FDA warning letters were issued between March 6, 2020 and June 22, 2021. The authors identified 221 potential keywords associated with counterfeit COVID-19 products or the entities selling them. Of the total, the researchers only evaluated 56 keywords because they only considered the first mention of a keyword in their analysis for early detection.
A total of 44 keyphrases related to COVID-19 met all of the inclusion criteria, and 43 of the 44 keyphrases showed anomalous increases in their mentions at some point. A staggering 77.3% (34/44) of keywords were discoverable prior to the FDA letter release dates via Twitter chatter. An additional 13.6% of keywords increased abnormally within seven days of the FDA letter issuance dates.
According to the authors, the current study is the first to use social media-based surveillance to detect counterfeit COVID-19 products early in relation to FDA warning issuance dates. Specifically, the researchers identified products that gained popularity through promotion on Twitter. The study approach was simple, unsupervised, without the need for training data, and economical because it relied on publicly available social media chatter.
medRxiv publishes preliminary scientific reports that are not peer-reviewed and, therefore, should not be considered conclusive, guide clinical practice/health-related behaviors, or treated as established information.