Bare-bone LDA with Tomotopy
Try Notion
Bare-bone LDA with Tomotopy
In this script, we create a bare-bone LDA model with Tomotopy. The final section of the script visualizes the LDA estimates with pyLDAvis.
Let’s start by importing the libraries we need for pre-processing (spaCy), estimation (tomotopy), data manipulation (NumPy and Pandas), and visualization tasks (Matplotlib, rich, pyLDAvis)
>>> import os >>> import numpy as np >>> import matplotlib.pyplot as plt >>> import pandas as pd >>> import spacy >>> import tomotopy as tp >>> from rich.console import Console >>> from rich.table import Table >>> import pyLDAvis
We read the corpus of text, available in the GitHub repo under the sampleData directory. Here, I’m assuming the Python session is running under the topicModeling folder located in the same repo.
>>> os.chdir("../sampleData/tripadvisorReviews") >>> in_f = "hotel_reviews.csv" >>> df = pd.read_csv((in_f))
Then, we pass the reviews, included in column ‘Review’ though a spaCy pipeline.
>>> nlp = spacy.load("en_core_web_sm") >>> docs_tokens, tmp_tokens = [], [] >>> for item in df.loc[:, "Review"].to_list(): tmp_tokens = [ token.lemma_ for token in nlp(item) if not token.is_stop and not token.is_punct and not token.like_num ] docs_tokens.append(tmp_tokens) tmp_tokens = []
Now, our data look reasonably clean. Yet, there’s another step to carry out before running our LDA: we have to create a Tomotopy ‘corpus’ object, which we populate with the tokenized docs (see previous step).
>>> corpus = tp.utils.Corpus(). # step 1: the empty corpus >>> for item in docs_tokens: # step 2: we populate the corpus as we corpus.add_doc(words=item). # iterate over tokenized docs
Hence, we can estimate an LDA model with an arbitrary number of topics (k = 10) over our object ‘corpus’ (see step 1). Note that Tomotopy trains an LDA model with an iterative approach to Gibbs-sampling. Hence, we have to specify the number of iterations (see step 2).
>>> lda = tp.LDAModel(k=10, corpus=corpus) # step 1 >>> for i in range(0, 100, 10): lda.train(10) # step 2 print("Iteration: {}\tLog-likelihood: {}".format(i, lda.ll_per_word))
Iteration: 0 Log-likelihood: -8.986793206269024 Iteration: 10 Log-likelihood: -8.683531180601642 Iteration: 20 Log-likelihood: -8.563040296738107 Iteration: 30 Log-likelihood: -8.489663788127347 Iteration: 40 Log-likelihood: -8.44406825673355 Iteration: 50 Log-likelihood: -8.415348844810286 Iteration: 60 Log-likelihood: -8.396785145655292 Iteration: 70 Log-likelihood: -8.384133225237559 Iteration: 80 Log-likelihood: -8.373994052744647 Iteration: 90 Log-likelihood: -8.363973922167325
A critical part of a topic model’s output is the graph linking words with hidden topics. In the next block of code, we carry out the following tasks:
creation of a rich table (step 1)
retrieval of topic-word probabilities using lda.get_topic_words (step 2)
table display (step 3)
>>> console = Console(). # step 1 >>> table = Table( show_header=True, header_style="cyan", title="[bold] [cyan] Word to topic probabilities (top 10 words)[/cyan]", width=150, ) >>> table.add_column("Topic", justify="center", style="cyan", width=10) >>> table.add_column("W 1", width=12) >>> table.add_column("W 2", width=12) >>> table.add_column("W 3", width=12) >>> table.add_column("W 4", width=12) >>> table.add_column("W 5", width=12) >>> table.add_column("W 6", width=12) >>> table.add_column("W 7", width=12) >>> table.add_column("W 8", width=12) >>> table.add_column("W 9", width=12) >>> table.add_column("W 10", width=12) >>> for k in range(lda.k): # step 2 values = [] for word, prob in lda.get_topic_words(k): values.append("{}\n({})\n".format(word, str(np.round(prob, 3)))) table.add_row( str(k), values[0], values[1], values[2], values[3], values[4], values[5], values[6], values[7], values[8], values[9], ) >>> table
Here’s a portion of the table with the topic-word probabilities. Note the larger the probability the more strongly the association between a topic and a word. It’s customary to report the five or ten mostly associated words per topic.
The statistical fit of a topic model can be assessed according to multiple metrics. The Coherence Score is one of the most popular metrics. Using the ‘get_score’ method over the a ‘Coherence’ class object (see step 1 and the Tomotopy’s coherence sub-module) it is possible to retrieve the average coherence score across topics or the coherence score of individual topics. See steps 2 and 3 respectively.
>>> coh = tp.coherence.Coherence(lda, coherence="u_mass") # step 1 >>> average_coherence = coh.get_score() # step 2 >>> coherence_per_topic = [ # step 3 coh.get_score(topic_id=k) for k in range(lda.k) ]
The Coherence Score values associated with alternative models — i.e., models retaining different number of topics — can then be inspected visually using the following Matplotlib snippet.
>>> fig = plt.figure() >>> ax = fig.add_subplot(111) >>> ax.bar(range(lda.k), coherence_per_topic) >>> ax.set_xticks(range(lda.k)) >>> ax.set_xlabel("Topic number") >>> ax.set_ylabel("Coherence score") >>> plt.axhline(y=average_coherence, color="orange", linestyle="--") >>> plt.show()
This snippet comes from the Python script “barebone_tomotopy.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.