Bernard Choa

Research Student for Fundamental of Data Science and Statistic

Topic: “Toxic Detector, a machine learning algorithm that can track negative comments from a social media post”

Research Topic: “Toxic Detector, It is a machine learning algorithm that can track negative comments from a social media post”

Project Description:

With the growing rate of social media platform adoption in the last few years and the growing number of posts on each social media platform, comments and engagement is bound to happen. And in the social media era, there will be positive and negative engagement into every post. Our aim is to try to filter out the negative comments from the overall comments and try to analyze how much there is compared to positive comments and see the rate of negativity there is in a sample post that we can take from a certain social media platform.

This way, we can evaluate how that certain post is perceived by the public, whether it is taken positively or negatively. Hopefully by filtering it out, we can then figure out the overall consensus of that post and try to find a solution to reduce the negativity in the future posts.

The idea came to us with the increase of toxic behavior in the comments of social media. We got inspired by Deepak’s machine learning that detects both positive and negative comments.

Dataset and Preprocessing

Obtaining the dataset will be straightforward, as we can simply utilize existing APIs for social media platforms such as Twitter. Preparing the data for processing is simple; since social media posts are often written in a casual tone, filler words such as “ok”, “lol”, “uhh” will be eliminated from the text. Punctuation and special characters such

as emojis will also be eliminated from the dataset as they do not serve much purpose for the natural language processing itself. Occasionally, user handles will appear in the text, such as the case with Twitter replies or YouTube comment replies. These user handles will also be eliminated from the text.

Model and Techniques

Analyzing text obviously falls under the field of Natural Language Processing. More specifically, our topic will be related to Sentiment Analysis. We will use NLTK as it is the most popular NLP library in Python, as well as SKLearn for general data science requirements.

Bernard Choa

Profile & Qualification

Research & Publication

Professional Engagement

Reward & Grants