Training a doc2vec embedding with Gensim

In Text, meanings, and maths we saw how to use BoW and TFIDF to create vector representations for text regions such as sentences, paragraphs, or even entire documents. We can also accomplish that using the doc2vec algorithm. This script presents a Gensim, doc2vec application creating a vector space from a minimal dataset. Then, an unseen document is projected onto the embedding to get a vector representation.

We start by importing the necessary libraries.

Python

Copy

>>> from nltk.tokenize import wordpunct_tokenize
>>> from gensim.models.doc2vec import Doc2Vec, TaggedDocument

Our sample data contains six known quotes.

Python

Copy

>>> quotes = [
        "I find television very educating. Every time somebody turns on the set "\
        "I go into the other room and read a book",
        "Some people never go crazy. What truly horrible lives they must lead",
        "Be nice to nerds. You may end up working for them. We all could",
        "I do not want people to be very agreeable, as it saves me the trouble " \
        "of liking them a great deal",
        "I did not attend his funeral, but I sent a nice letter saying I "\
        "approved of it",
        "So many books, so l§ittle time",
]

Now, we can pre-process the data. Specifically, we tokenize the quotes (step 1) and store the outcome in a list with TaggedDocument objects (see step 2; convenient for Gensim’s doc2vec).

Python

Copy

>>> tkn_quotes = [wordpunct_tokenize(quote.lower()) for quote in quotes] # step 1
>>> tgd_quotes = [TaggedDocument(d, [i]) for i, d in enumerate(tkn_quotes)] # step 2

Ready to train a doc2vec representation with Gensim’s Doc2Vec. Optionally, we save the results of the model locally to ‘quote_embedding.model’ and load them back in the current Python session.

Python

Copy

>>> model = Doc2Vec(
        tgd_quotes, vector_size=20, window=2, min_count=1, workers=4, epochs=100
    )
>>> model.save("quote_embedding.model")
>>> model = Doc2Vec.load("quote_embedding.model")

We can now use the vector space to get the representation of an unseen document.

Python

Copy

>>> new_quote = "Time is an illusion. Lunchtime doubly so"
>>> new_vector = wordpunct_tokenize(new_quote.lower())

Retrieving the documents that are more similar to ‘new_quote’ is a one liner.

Python

Copy

>>> model.docvecs.most_similar(positive=[model.infer_vector(new_vector)], topn=5)
    [(3, 0.904436469078064),
     (4, 0.8571681976318359),
     (1, 0.8404998779296875),
     (0, 0.839330792427063),
     (2, 0.790512204170227)]

This snippet comes from the Python script “barebone_doc2vec.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/barebone_doc2vec.py at master · simoneSantoni/NLP-orgs-markets

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters You can't perform that action at this time. You signed in with another tab or window.

https://github.com/simoneSantoni/NLP-orgs-markets/blob/master/doc2vec/barebone_doc2vec.py