Battling COVID-19 Falsehoods on Twitter

With the outbreak of COVID-19, social media platforms have seen a surge in false information regarding public health and the coronavirus itself. Without moderation of content and elimination of misinformation on the media, inaccurate posts continue to gain popularity and cause bigger chaos. To combat this, my group project members and I have decided to build a classification model that flags texts that potentially include COVID-19 related misinformation.

About Data

We used two sets of data: 1) Tweet data set that includes text, timestamps, locations, and True/False/Unverified labels and 2) OWID COVID-19 data set that has data on daily vaccinations, hospitalizations, available ICU beds, new cases, etc. We added the 2) data set to examine if any of the metrics have an association with the trustworthiness of tweets.

Natural Language Processing

We “cleaned” the text by removing capitalization, punctuation, and stop words that most likely do not have an effect on the falsehoods of the tweet. We kept a close eye on the “#” symbol since it is used to reach as many audiences as possible. After stemming and tokenization of text, we ended up with 1775 cleaned tweets described by 234 features (including tokenized words and COVID-19 data). More detailed description of our preparation in this stage is documented below.

Building Models

We created various models to see which one has the highest quality in classifying tweets. To quantify this, we used a metric called ROC curve that shows the relationship between FPR (False Positive Rate, the rate of classifying a tweet as true when it is actually false) and TPR (True Positive Rate).

The higher the “area”, the better the quality. For most of the models, the accuracy of the classification models are at around 80%, and for minimization of FPR alone, all of the models seem to perform similarly well.

Impact of this project

 

Information has never been easier to access and/or provide, and the cost of this shift is great when the social media platform is filled with false information that may put public health in danger. Though content censoring and moderation may be a controversial topic for a large platform like Twitter, such a model would be greatly helpful in warning the audience of potential falsehoods. We hope that this would help minimize the indirect damage of the outbreak of coronavirus.

Previous
Previous

Chotto Thought, the Podcast

Next
Next

Food Desert in Oakland