{"id":2891,"date":"2022-09-02T10:37:30","date_gmt":"2022-09-02T03:37:30","guid":{"rendered":"http:\/\/international.binus.ac.id\/computer-science\/?p=2891"},"modified":"2022-09-02T10:37:30","modified_gmt":"2022-09-02T03:37:30","slug":"pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know","status":"publish","type":"post","link":"https:\/\/international.binus.ac.id\/computer-science\/2022\/09\/02\/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know\/","title":{"rendered":"pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know"},"content":{"rendered":"<p>Have you ever wanted to classify news, papers, or tweets based on their topics? Knowing how to do this can help you filter out irrelevant documents, and save time by reading only what you\u2019re interested in.<\/p>\n<p>That\u2019s what text classification is for \u2013 allows you to train your model to recognize topics. This technique allows you to use data labels to train your model, and it\u2019s supervised learning.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-30389 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/text-classification.png?resize=461%2C200&amp;ssl=1\" alt=\"text classification\" width=\"461\" height=\"200\" data-attachment-id=\"30389\" data-permalink=\"https:\/\/neptune.ai\/text-classification\" data-orig-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/text-classification.png?fit=461%2C200&amp;ssl=1\" data-orig-size=\"461,200\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"text-classification\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/text-classification.png?fit=300%2C130&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/text-classification.png?fit=461%2C200&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" \/><\/figure>\n<\/div>\n<p>In real life, you might not have data labels for text classification. You can go through each document to label them, or hire somebody else to do it, but that\u2019s a lot of time and money, especially when you have more than 1000 data points.<\/p>\n<p>Can you find the topics of your documents without training data? Yes, you can use topic modeling to do it.<\/p>\n<h2>What is topic modeling?<\/h2>\n<p>With\u00a0<a href=\"https:\/\/towardsdatascience.com\/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">topic modeling<\/a>, you can cluster words for a set of documents. This is unsupervised learning, because it automatically groups words without a predefined list of labels.<\/p>\n<p>If you feed the model data, it will give you different sets of words, and each set of words describes the topic.<\/p>\n<div>(<span class=\"hljs-name\">0<\/span>, <span class=\"hljs-symbol\">&#8216;0.024*<\/span><span class=\"hljs-string\">&#8220;ban&#8221;<\/span> + <span class=\"hljs-number\">0.017<\/span>*<span class=\"hljs-string\">&#8220;order&#8221;<\/span> + <span class=\"hljs-number\">0.015<\/span>*<span class=\"hljs-string\">&#8220;refugee&#8221;<\/span> + <span class=\"hljs-number\">0.015<\/span>*<span class=\"hljs-string\">&#8220;law&#8221;<\/span> + <span class=\"hljs-number\">0.013<\/span>*<span class=\"hljs-string\">&#8220;trump&#8221;<\/span> &#8216; <span class=\"hljs-symbol\">&#8216;+<\/span> <span class=\"hljs-number\">0.011<\/span>*<span class=\"hljs-string\">&#8220;kill&#8221;<\/span> + <span class=\"hljs-number\">0.011<\/span>*<span class=\"hljs-string\">&#8220;country&#8221;<\/span> + <span class=\"hljs-number\">0.010<\/span>*<span class=\"hljs-string\">&#8220;attack&#8221;<\/span> + <span class=\"hljs-number\">0.009<\/span>*<span class=\"hljs-string\">&#8220;state&#8221;<\/span> + &#8216; <span class=\"hljs-symbol\">&#8216;0.009*<\/span><span class=\"hljs-string\">&#8220;immigration&#8221;<\/span>&#8216;) (<span class=\"hljs-name\">1<\/span>, <span class=\"hljs-symbol\">&#8216;0.020*<\/span><span class=\"hljs-string\">&#8220;student&#8221;<\/span> + <span class=\"hljs-number\">0.020<\/span>*<span class=\"hljs-string\">&#8220;work&#8221;<\/span> + <span class=\"hljs-number\">0.019<\/span>*<span class=\"hljs-string\">&#8220;great&#8221;<\/span> + <span class=\"hljs-number\">0.017<\/span>*<span class=\"hljs-string\">&#8220;learn&#8221;<\/span> + &#8216; <span class=\"hljs-symbol\">&#8216;0.017*<\/span><span class=\"hljs-string\">&#8220;school&#8221;<\/span> + <span class=\"hljs-number\">0.015<\/span>*<span class=\"hljs-string\">&#8220;talk&#8221;<\/span> + <span class=\"hljs-number\">0.014<\/span>*<span class=\"hljs-string\">&#8220;support&#8221;<\/span> + <span class=\"hljs-number\">0.012<\/span>*<span class=\"hljs-string\">&#8220;community&#8221;<\/span> + &#8216; <span class=\"hljs-symbol\">&#8216;0.010*<\/span><span class=\"hljs-string\">&#8220;share&#8221;<\/span> + <span class=\"hljs-number\">0.009<\/span>*<span class=\"hljs-string\">&#8220;event&#8221;<\/span>)<\/div>\n<p>When you look at the first set of words, you would guess the topic is military and politics.\u00a0Looking at the second set of words, you might guess the topic is public events, or school.<\/p>\n<p>This is quite useful. Your texts are automatically categorized, without the need to label them!<\/p>\n<h2>Visualize topic modeling with pyLDAvis<\/h2>\n<p>Topic modeling is useful, but it\u2019s difficult to understand it just by looking at a combination of words and numbers like above.<\/p>\n<p>One of the most effective ways to understand data is through visualization. Is there a way that we can visualize the results of LDA? Yes, we can with<a href=\"https:\/\/github.com\/bmabey\/pyLDAvis\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">\u00a0pyLDAvis<\/a>.<\/p>\n<p>PyLDAvis allows us to interpret the topics in a topic model like below:<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-30395 jetpack-lazy-image jetpack-lazy-image--handled\" src=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/PyLDAvis.gif\" alt=\"PyLDAvis \" width=\"1024\" height=\"620\" data-attachment-id=\"30395\" data-permalink=\"https:\/\/neptune.ai\/pyldavis-3\" data-orig-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/PyLDAvis.gif?fit=1231%2C745&amp;ssl=1\" data-orig-size=\"1231,745\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"PyLDAvis\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/PyLDAvis.gif?fit=300%2C182&amp;ssl=1\" data-large-file=\"https:\/\/i0.wp.com\/neptune.ai\/wp-content\/uploads\/PyLDAvis.gif?fit=1024%2C620&amp;ssl=1\" data-recalc-dims=\"1\" data-lazy-loaded=\"1\" \/><\/figure>\n<\/div>\n<p>Pretty cool, isn\u2019t it?\u00a0<strong>Now we will learn how to use topic modeling and pyLDAvis to categorize tweets and visualize the results.<\/strong><\/p>\n<p>&nbsp;<\/p>\n<p>Original Article: https:\/\/neptune.ai\/blog\/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Have you ever wanted to classify news, papers, or tweets based on their topics? Knowing how to do this can help you filter out irrelevant documents, and save time by reading only what you\u2019re interested in. That\u2019s what text classification is for \u2013 allows you to train your model to recognize topics. This technique allows [&hellip;]<\/p>\n","protected":false},"author":7,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[112],"tags":[],"class_list":["post-2891","post","type-post","status-publish","format-standard","hentry","category-article"],"_links":{"self":[{"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/posts\/2891"}],"collection":[{"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/users\/7"}],"replies":[{"embeddable":true,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/comments?post=2891"}],"version-history":[{"count":1,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/posts\/2891\/revisions"}],"predecessor-version":[{"id":2892,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/posts\/2891\/revisions\/2892"}],"wp:attachment":[{"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/media?parent=2891"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/categories?post=2891"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/international.binus.ac.id\/computer-science\/wp-json\/wp\/v2\/tags?post=2891"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}