Bare-bone LDA with Tomotopy

In this script, we create a bare-bone LDA model with Tomotopy. The final section of the script visualizes the LDA estimates with pyLDAvis.

Let’s start by importing the libraries we need for pre-processing (spaCy), estimation (tomotopy), data manipulation (NumPy and Pandas), and visualization tasks (Matplotlib, rich, pyLDAvis)

Python

Copy

>>> import os
>>> import numpy as np
>>> import matplotlib.pyplot as plt
>>> import pandas as pd
>>> import spacy
>>> import tomotopy as tp
>>> from rich.console import Console
>>> from rich.table import Table
>>> import pyLDAvis

We read the corpus of text, available in the GitHub repo under the sampleData directory. Here, I’m assuming the Python session is running under the topicModeling folder located in the same repo.

Python

Copy

>>> os.chdir("../sampleData/tripadvisorReviews")
>>> in_f = "hotel_reviews.csv"
>>> df = pd.read_csv((in_f))

Then, we pass the reviews, included in column ‘Review’ though a spaCy pipeline.

Python

Copy

>>> nlp = spacy.load("en_core_web_sm")
>>> docs_tokens, tmp_tokens = [], []
>>> for item in df.loc[:, "Review"].to_list():
        tmp_tokens = [
            token.lemma_
            for token in nlp(item)
            if not token.is_stop and not token.is_punct and not token.like_num
        ]
        docs_tokens.append(tmp_tokens)
        tmp_tokens = []

Now, our data look reasonably clean. Yet, there’s another step to carry out before running our LDA: we have  to create a Tomotopy ‘corpus’ object, which we populate with the tokenized docs (see previous step).

Python

Copy

>>> corpus = tp.utils.Corpus().          # step 1: the empty corpus
>>> for item in docs_tokens:             # step 2: we populate the corpus as we
        corpus.add_doc(words=item).      #         iterate over tokenized docs

Hence, we can estimate an LDA model with an arbitrary number of topics (k = 10) over our object ‘corpus’ (see step 1). Note that Tomotopy trains an LDA model with an iterative approach to Gibbs-sampling. Hence, we have to specify the number of iterations (see step 2).

Python

Copy

>>> lda = tp.LDAModel(k=10, corpus=corpus)      # step 1
>>> for i in range(0, 100, 10):
        lda.train(10)                           # step 2
        print("Iteration: {}\tLog-likelihood: {}".format(i, lda.ll_per_word))

Shell

Copy

Iteration: 0	Log-likelihood: -8.986793206269024
Iteration: 10	Log-likelihood: -8.683531180601642
Iteration: 20	Log-likelihood: -8.563040296738107
Iteration: 30	Log-likelihood: -8.489663788127347
Iteration: 40	Log-likelihood: -8.44406825673355
Iteration: 50	Log-likelihood: -8.415348844810286
Iteration: 60	Log-likelihood: -8.396785145655292
Iteration: 70	Log-likelihood: -8.384133225237559
Iteration: 80	Log-likelihood: -8.373994052744647
Iteration: 90	Log-likelihood: -8.363973922167325

A critical part of a topic model’s output is the graph linking words with hidden topics. In the next block of code, we carry out the following tasks:

creation of a rich table (step 1)

retrieval of topic-word probabilities using lda.get_topic_words (step 2)

table display (step 3)

Python

Copy

>>> console = Console().                    # step 1
>>> table = Table(
        show_header=True,
        header_style="cyan",
        title="[bold] [cyan] Word to topic probabilities (top 10 words)[/cyan]",
        width=150,
    )
>>> table.add_column("Topic", justify="center", style="cyan", width=10)
>>> table.add_column("W 1", width=12)
>>> table.add_column("W 2", width=12)
>>> table.add_column("W 3", width=12)
>>> table.add_column("W 4", width=12)
>>> table.add_column("W 5", width=12)
>>> table.add_column("W 6", width=12)
>>> table.add_column("W 7", width=12)
>>> table.add_column("W 8", width=12)
>>> table.add_column("W 9", width=12)
>>> table.add_column("W 10", width=12)
>>> for k in range(lda.k):                       # step 2 
        values = []
        for word, prob in lda.get_topic_words(k):
            values.append("{}\n({})\n".format(word, str(np.round(prob, 3))))       
        table.add_row(
            str(k),
            values[0],
            values[1],
            values[2],
            values[3],
            values[4],
            values[5],
            values[6],
            values[7],
            values[8],
            values[9],
        )
>>> table

Here’s a portion of the table with the topic-word probabilities. Note the larger the probability the more strongly the association between a topic and a word. It’s customary to report the five or ten mostly associated words per topic.

The statistical fit of a topic model can be assessed according to multiple metrics. The Coherence Score is one of the most popular metrics. Using the ‘get_score’ method over the a ‘Coherence’ class object (see step 1 and the Tomotopy’s coherence sub-module) it is possible to retrieve the average coherence score across topics or the coherence score of individual topics. See steps 2 and 3 respectively.

Python

Copy

>>> coh = tp.coherence.Coherence(lda, coherence="u_mass")              # step 1
>>> average_coherence = coh.get_score()                                # step 2
>>> coherence_per_topic = [                                            # step 3
        coh.get_score(topic_id=k) for k in range(lda.k)                
    ]

The Coherence Score values associated with alternative models — i.e., models retaining different number of topics — can then be inspected visually using the following Matplotlib snippet.

Python

Copy

>>> fig = plt.figure()
>>> ax = fig.add_subplot(111)
>>> ax.bar(range(lda.k), coherence_per_topic)
>>> ax.set_xticks(range(lda.k))
>>> ax.set_xlabel("Topic number")
>>> ax.set_ylabel("Coherence score")
>>> plt.axhline(y=average_coherence, color="orange", linestyle="--")
>>> plt.show()

This snippet comes from the Python script “barebone_tomotopy.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/barebone_tomotopy.py at master · simoneSantoni/NLP-orgs-markets

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters You can't perform that action at this time. You signed in with another tab or window.

https://github.com/simoneSantoni/NLP-orgs-markets/blob/master/topicModeling/barebone_tomotopy.py