Fully-fledged LDA project: modeling a corpus with scientific articles

This script shows how to use Gensim and Mallet for training an LDA model. Particularly, we aim to:

training and evaluating topic models retaining a different number of topics

folding in unseen documents into the retained topic model (that is the best fit model)

Note: the full notebook, leadership_natural_experiment.ipynb, is pretty long. Here, I emphasize some key steps. The data for the training phase contain 1,156 abstracts published in the leading leadership journal, namely, The Leadership Quarterly. The unseen documents concern 87 abstracts of articles published across economics, sociological, and political science journals.

Having pre-processed the abstracts, we create two Gensim class objects, a dictionary (step 1) and a corpus (step 2).

Python

Copy

>>> DICT = Dictionary(DOCS_PHRASED)                      # step 1
>>> CORPUS = [DICT.doc2bow(doc) for doc in DOCS_PHRASED] # step 2

The existing literature suggests there are 29 topics represented in The Leadership Quarterly. Then, we fit a topic model with 29 topics.

Python

Copy

>>> N_TOPICS = 29
>>> LDA_MALLET = gensim.models.wrappers.LdaMallet(
        MALLET_PATH,
        corpus=CORPUS,
        num_topics=N_TOPICS,
        id2word=DICT,
        random_seed=123
    )
>>> LDA_MALLET.print_topics(num_topics=N_TOPICS, num_words=5)

Shell

Copy

[(0,
  '0.140*"development" + 0.023*"special" + 0.022*"developmental"
  + 0.021*"multi_level" + 0.017*"primarily"'),
 (1,
  '0.058*"context" + 0.046*"effective" + 0.039*"develop" 
  + 0.031*"career" + 0.025*"set"'),
 
 ...
 
 (27,
  '0.071*"perspective" + 0.044*"dynamic" + 0.036*"system" 
  + 0.034*"emerge" + 0.030*"proposition"'),
 (28,
  '0.208*"group" + 0.062*"identity" + 0.034*"hypothesis" 
  + 0.030*"member" + 0.026*"information"')]

The qualitative inspection of word-to-topic probabilities indicate some topics are hard to interpret. Hence, we decide to explore models retaining different numbers of topics. To do that, we create a function to train an LDA model and to store its coherence score.

Python

Copy

>>> def compute_coherence_values(dictionary, corpus, texts, limit, start, step):
        """
        Compute c_v coherence for various number of topics

        Parameters:
        -----------
        dictionary : Gensim dictionary
        corpus     : Gensim corpus
        texts      : List of input texts
        limit      : Max number of topics

        Returns:
        --------
        model_list       : List of LDA topic models
        coherence_values : Coherence values corresponding to the LDA model
                       with respective number of topics
        """
        coherence_values = []
        model_list = []
        mallet_path = MALLET_PATH
        for num_topics in range(start, limit, step):
            model = gensim.models.wrappers.LdaMallet(mallet_path,
                                                 corpus=corpus,
                                                 num_topics=num_topics,
                                                 id2word=dictionary,
                                                 random_seed=123)
            model_list.append(model)
            coherencemodel = CoherenceModel(model=model,
                                        texts=texts,
                                        dictionary=dictionary,
                                        coherence='c_v')
            coherence_values.append(coherencemodel.get_coherence())

        return model_list, coherence_values

We can now deploy the function to search for the best fit LDA model. We are sampling 20 alternative topics models (between 10 and 30 topics).

Python

Copy

>>> LIMIT, START, STEP = 40, 1, 1
>>> MODEL_LIST, COHER_VALS = compute_coherence_values(dictionary=DICT,
                                                      corpus=CORPUS,
                                                      texts=DOCS_PHRASED,
                                                      start=START,
                                                      limit=LIMIT,
                                                      step=STEP)

Let’s plot the coherence scores included in the COHER_VALS list

Python

Copy

>>> FIG = plt.figure(figsize=(6, 4))
>>> AX = FIG.add_subplot(1, 1, 1)
>>> AX.plot(X, Y, marker='o', color='b', ls='-')
>>> AX.set_xlabel("Number of topics retained")
>>> AX.set_ylabel("Coherence score")
>>> AX.set_xticks(np.arange(10, 30, 1))
>>> AX.axvline(x=11, ymin=0, ymax=1, color='r')
>>> AX.grid(True, ls='--')
>>> plt.show()

The model with ten and eleven topics show relatively large coherence scores vis a’ vis alternative models. For the sake of parsimony, we consider the model with ten topics and we inspect it qualitatively.

Python

Copy

>>> N_TOPICS = 10
>>> LDA_MALLET = gensim.models.wrappers.LdaMallet(MALLET_PATH,
                                              corpus=CORPUS,
                                              num_topics=N_TOPICS,
                                              id2word=DICT,
                                              random_seed=123)
>>> LDA_MALLET.print_topics(num_topics=N_TOPICS, num_words=10)

Shell

Copy

[(0,
  '0.042*"context" + 0.030*"woman" + 0.024*"difference" + 0.024*"power" 
  + 0.020*"role" + 0.017*"gender" + 0.017*"practice" + 0.016*"female" 
  + 0.013*"culture" + 0.012*"position"'),
 (1,
  '0.034*"perception" + 0.033*"affect" + 0.030*"role" + 0.028*"emotion" 
  + 0.028*"positive" + 0.026*"emotional" + 0.026*"negative" 
  + 0.017*"network" + 0.016*"influence" + 0.016*"collective"'),
 (2,
  '0.056*"transformational" + 0.043*"subordinate" + 0.026*"rating" 
  + 0.021*"trait" + 0.017*"associate" + 0.016*"experience" 
  + 0.016*"significant" + 0.016*"personality" + 0.014*"high" 
  + 0.014*"analysis"'),
 (3,
  '0.046*"development" + 0.027*"perspective" + 0.017*"develop" 
  + 0.015*"political" + 0.015*"include" + 0.013*"purpose" 
  + 0.013*"interest" + 0.012*"view" + 0.012*"multiple" + 0.010*"year"'),
 (4,
  '0.054*"employee" + 0.048*"work" + 0.034*"lmx" + 0.023*"job" 
  + 0.023*"supervisor" + 0.021*"perceive" + 0.019*"authentic" 
  + 0.018*"mediate" + 0.016*"hypothesis" + 0.015*"satisfaction"'),
 (5,
  '0.024*"understand" + 0.020*"effective" + 0.019*"vision" 
  + 0.017*"problem" + 0.017*"cognitive" + 0.015*"strategy" + 0.014*"lead" 
  + 0.013*"dynamic" + 0.013*"proposition" + 0.013*"identify"'),
 (6,
  '0.106*"performance" + 0.085*"team" + 0.033*"member" + 0.030*"ceo" 
  + 0.021*"management" + 0.019*"skill" + 0.018*"firm" + 0.018*"decision" 
  + 0.017*"share" + 0.015*"strategic"'),
 (7,
  '0.029*"level" + 0.024*"increase" + 0.023*"ethical" + 0.019*"develop" 
  + 0.017*"impact" + 0.016*"moral" + 0.015*"structure" + 0.014*"reveal" 
  + 0.013*"practice" + 0.012*"integrity"'),
 (8,
  '0.051*"charismatic" + 0.041*"change" + 0.035*"manager" + 0.030*"time" 
  + 0.025*"charisma" + 0.018*"style" + 0.014*"content" + 0.013*"crisis" 
  + 0.012*"type" + 0.010*"managerial"'),
 (9,
  '0.073*"group" + 0.040*"effectiveness" + 0.030*"task" + 0.023*"condition" 
  + 0.022*"identity" + 0.021*"individual" + 0.019*"emergence" + 0.017*"show" 
  + 0.015*"response" + 0.014*"characteristic"')]

The ten-topic model seems to provide an accurate representation of the leadership literature’s research themes. So, we can proceed with the next step, which consists of getting the topic representations for the 87 unseen abstracts.

Having pre-processed the unseen documents, we carry out the following steps:

creating a new Gensim corpus with the pre-processed, unseen documents (step 1)

getting the document-to-topic probabilities for the items included in the new corpus (step 2)

storing the document-to-topic probabilities in a Pandas DF for further post-processing (step 3)

Python

Copy

>>> NEW_CORPUS = [DICT.doc2bow(doc) for doc in DOCS_PHRASED]     # step 1 
>>> TRANSF_CORPUS = LDA_MALLET_G.get_document_topics(NEW_CORPUS) # step 2
>>> DOC_TOPIC_M = []                                             # step 3
>>> for id, doc in enumerate(TRANSF_CORPUS):
        for topic in np.arange(0, 10, 1):
            topic_n = doc[topic][0]
            topic_prob = doc[topic][1]
            DOC_TOPIC_M.append([id, topic, topic_prob])
>>> DF = pd.DataFrame(DOC_TOPIC_M)

This snippet comes from the Python script leadership_natural_experiment.ipynb, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/leadership_natural_experiment.ipynb at 74e2d04ab571e51a6c6d40ae4938e6e7fd0d0e2d · simoneSantoni/NLP-orgs-markets

Teaching materials for a B-school, post-grad module on NLP - NLP-orgs-markets/leadership_natural_experiment.ipynb at 74e2d04ab571e51a6c6d40ae4938e6e7fd0d0e2d · simoneSantoni/NLP-orgs-markets

https://github.com/simoneSantoni/NLP-orgs-markets/blob/74e2d04ab571e51a6c6d40ae4938e6e7fd0d0e2d/topicModeling/leadership_natural_experiment.ipynb