pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know

Have you ever wanted to classify news, papers, or tweets based on their topics? Knowing how to do this can help you filter out irrelevant documents, and save time by reading only what you’re interested in.

That’s what text classification is for – allows you to train your model to recognize topics. This technique allows you to use data labels to train your model, and it’s supervised learning.

text classification

In real life, you might not have data labels for text classification. You can go through each document to label them, or hire somebody else to do it, but that’s a lot of time and money, especially when you have more than 1000 data points.

Can you find the topics of your documents without training data? Yes, you can use topic modeling to do it.

What is topic modeling?

With topic modeling, you can cluster words for a set of documents. This is unsupervised learning, because it automatically groups words without a predefined list of labels.

If you feed the model data, it will give you different sets of words, and each set of words describes the topic.

(0, ‘0.024*“ban” + 0.017*“order” + 0.015*“refugee” + 0.015*“law” + 0.013*“trump”‘+ 0.011*“kill” + 0.011*“country” + 0.010*“attack” + 0.009*“state” + ‘ ‘0.009*“immigration”‘) (1, ‘0.020*“student” + 0.020*“work” + 0.019*“great” + 0.017*“learn” + ‘ ‘0.017*“school” + 0.015*“talk” + 0.014*“support” + 0.012*“community” + ‘ ‘0.010*“share” + 0.009*“event”)

When you look at the first set of words, you would guess the topic is military and politics. Looking at the second set of words, you might guess the topic is public events, or school.

This is quite useful. Your texts are automatically categorized, without the need to label them!

Visualize topic modeling with pyLDAvis

Topic modeling is useful, but it’s difficult to understand it just by looking at a combination of words and numbers like above.

One of the most effective ways to understand data is through visualization. Is there a way that we can visualize the results of LDA? Yes, we can with pyLDAvis.

PyLDAvis allows us to interpret the topics in a topic model like below:

PyLDAvis

Pretty cool, isn’t it? Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results.

 

Original Article: https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know