Text classification with LDA features in scikit-learn

In this script, we use scikit-learn to train a text classifier to discriminate between bad and good Tripadvisor reviews. The features for the text classifier are the topic-to-document probabilities from a pre-trained LDA model.

Let’s start by loading the libraries requires to implement the Python script. We need numPy and Pandas to carry out minimal data preparation activities. Regarding the NLP tasks, we use spaCy for pre-processing and Tomotopy for LDA modeling. Finally, we load the RidgeClassifier routines along with train_test_split utilities from the scikit-learn library.

Python

Copy

>>> import numpy as np
>>> import pandas as pd
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> import spacy
>>> import tomotopy as tp
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.linear_model import RidgeClassifier

We source the data from a local .csv file and display the key features.

Python

Copy

>>> df = pd.read_csv("../sampleData/tripadvisorReviews/hotel_reviews.csv")
>>> df.info()

Plain Text

Copy

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20491 entries, 0 to 20490
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Review  20491 non-null  object
 1   Rating  20491 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 320.3+ KB

Time top explore the data: we visualize the distribution of review across rating levels. Also, we make sure an observable review attribute such as text length does not vary ‘too much’ across rating levels.

Python

Copy

>>> sns.catplot(x="Rating", data=df, kind="count");

Distribution of reviews across rating levels

Python

Copy

>>> sns.violinplot(x="Rating", y=np.log(df.loc[:, "Review"].str.len()), data=df);

Distribution of review length across rating levels

The next step is creating review labels, say ‘bad’ and ‘good’, out of review ratings. As per prior literature on sentiment analysis, 1- and 2-star reviews are considered bad reviews, whereas 4- and 5-star reviews are considered good reviews.

Python

Copy

>>> df.loc[:, "label"] = np.nan
>>> df.loc[df["Rating"] < 3, "label"] = int(0)
>>> df.loc[df["Rating"] > 3, "label"] = int(1)
>>> sns.catplot(x="label", data=df.loc[df["label"].notnull()], kind="count");

Based on the distribution of the reviews across the ‘bad’ and ‘good’ classes, we create a matched dataset s with one ‘good review’ for any ‘bar review.

Python

Copy

>>> bad = df.loc[df["label"] == 0, ["Review", "label"]]
>>> good = df.loc[df["label"] == 1, ["Review", "label"]]
>>> good = good.sample(n=len(bad), random_state=42)
>>> s = pd.concat([bad, good])

Before running the LDA model, the sample reviews are pre-processed using a spaCy pipeline.

Python

Copy

>>> nlp = spacy.load("en_core_web_sm")
>>> docs = nlp.pipe(
        s.loc[:, "Review"].str.lower(),
        n_process=2, 
        batch_size=500, 
        disable=["tok2vec"],
    )
>>> tkns_docs = []
>>> for doc in docs:
        tmp = []
        for token in doc:
            if (
                token.is_stop == False
                and token.is_punct == False
                and token.like_num == False
            ):
                tmp.append(token.lemma_)
        tkns_docs.append(tmp)
        del tmp

The following steps consists of training and evaluating alternative LDA models, i.e., models retaining a different number of topics. Particularly, we sample 25 alternative models:

K = \{10, 20, ..., 250\}

Python

Copy

>>> corpus = tp.utils.Corpus()
>>> for item in tkns_docs:
        corpus.add_doc(words=item)
>>> mf = {}
>>> for i in range(10, 260, 10):
        print(
            ">>> Working on the model with {} topics >>>\n".format(i),
            flush=True
        )
        mdl = tp.LDAModel(k=i, corpus=corpus, min_df=5, rm_top=5, seed=42)
        mdl.train(0)
        for j in range(0, 1000, 10):
            mdl.train(10)
            print("Iteration: {}\tLog-likelihood: {}".format(j, mdl.ll_per_word))
        coh = tp.coherence.Coherence(mdl, coherence="u_mass")
        mf[i] = coh.get_score()
        mdl.save("k_{}".format(i), True)

We identify the best fitting model on the basis of the Coherence score. The below displayed chart indicates the model with 250 topics has the largest Coherence score among the sample models.

Python

Copy

>>> fig = plt.figure(figsize=(10, 4.5))
>>> ax = fig.add_subplot(111)
>>> sns.barplot(x=list(mf.keys()), y=[-1*score for score in mf.values()], ax=ax)
>>> ax.set_xlabel("Number of topics retained")
>>> ax.set_ylabel("Coherence Score (-1 * 'umass')")
>>> plt.show()

The text classifier we want to train uses topic-to-document probabilities to predict review labels. Hence, we retrieve the estimates from the best fitting models (step 1), which we arrange into a Pandas DF (step 2).

Python

Copy

>>> best_mdl = tp.LDAModel.load("k_250").                             # step 1
>>> td = pd.DataFrame(                                                # step 2
        np.stack([doc.get_topic_dist() for doc in best_mdl.docs]),
        columns=["topic_{}".format(i + 1) for i in range(best_mdl.k)],
    )
>>> td.info(verbose=True, null_counts=True)

Plain Text

Copy

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6428 entries, 0 to 6427
Data columns (total 250 columns):
 #    Column     Non-Null Count  Dtype  
---   ------     --------------  -----  
  topic_1    6428 non-null   float32
  topic_2    6428 non-null   float32
  topic_3    6428 non-null   float32
  topic_4    6428 non-null   float32
  topic_5    6428 non-null   float32
  topic_6    6428 non-null   float32
  topic_7    6428 non-null   float32
  topic_8    6428 non-null   float32
  topic_9    6428 non-null   float32
  topic_10   6428 non-null   float32
 topic_11   6428 non-null   float32
 topic_12   6428 non-null   float32
 topic_13   6428 non-null   float32
 topic_14   6428 non-null   float32
 topic_15   6428 non-null   float32
 topic_16   6428 non-null   float32
 topic_17   6428 non-null   float32
 topic_18   6428 non-null   float32
 topic_19   6428 non-null   float32
 topic_20   6428 non-null   float32
...
topic_249  6428 non-null   float32
topic_250  6428 non-null   float32
dtypes: float32(250)
memory usage: 6.1 MB

We are in the position to train a text classifier. In this example, scikit-learn’s RidgeClassifier is used. However, there are alternative estimators that could fit the task and the dataset at hand. First, we split the data into a trining and test sets.

Python

Copy

>>> X, y = td.loc[:,].values, s.loc[:, "label"].values
>>> X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size=0.4,
        random_state=0
    )

Then, we create, train, and evaluate our model. Such a simple model achieves a satisfying level performance, ~ 0.91.

Python

Copy

>>> ridge_c = RidgeClassifier(alpha=0.1, random_state=0, fit_intercept=False)
>>> ridge_c.fit(X_train, y_train)
>>> ridge_c.score(X_test, y_test)

Plain Text

Copy

0.9094090202177294

This snippet comes from the Python script scikitlearn.ipynb, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/scikitlearn.ipynb at 36b8a8bbf38d4b4b4d1fa272d99da09043f3040f · simoneSantoni/NLP-orgs-markets

Teaching materials for a B-school, post-grad module on NLP - NLP-orgs-markets/scikitlearn.ipynb at 36b8a8bbf38d4b4b4d1fa272d99da09043f3040f · simoneSantoni/NLP-orgs-markets

https://github.com/simoneSantoni/NLP-orgs-markets/blob/36b8a8bbf38d4b4b4d1fa272d99da09043f3040f/textClassification/scikitlearn.ipynb