In Text, meanings, and maths we saw how to use BoW and TFIDF to create vector representations for text regions such as sentences, paragraphs, or even entire documents. We can also accomplish that using the doc2vec algorithm. This script presents a Gensim, doc2vec application creating a vector space from a minimal dataset. Then, an unseen document is projected onto the embedding to get a vector representation.
We start by importing the necessary libraries.
Python
Copy
>>> from nltk.tokenize import wordpunct_tokenize
>>> from gensim.models.doc2vec import Doc2Vec, TaggedDocument
Our sample data contains six known quotes.
Python
Copy
>>> quotes = [
"I find television very educating. Every time somebody turns on the set "\
"I go into the other room and read a book",
"Some people never go crazy. What truly horrible lives they must lead",
"Be nice to nerds. You may end up working for them. We all could",
"I do not want people to be very agreeable, as it saves me the trouble " \
"of liking them a great deal",
"I did not attend his funeral, but I sent a nice letter saying I "\
"approved of it",
"So many books, so l§ittle time",
]
Now, we can pre-process the data. Specifically, we tokenize the quotes (step 1) and store the outcome in a list with TaggedDocument objects (see step 2; convenient for Gensim’s doc2vec).
Python
Copy
>>> tkn_quotes = [wordpunct_tokenize(quote.lower()) for quote in quotes] # step 1
>>> tgd_quotes = [TaggedDocument(d, [i]) for i, d in enumerate(tkn_quotes)] # step 2
Ready to train a doc2vec representation with Gensim’s Doc2Vec. Optionally, we save the results of the model locally to ‘quote_embedding.model’ and load them back in the current Python session.
Python
Copy
>>> model = Doc2Vec(
tgd_quotes, vector_size=20, window=2, min_count=1, workers=4, epochs=100
)
>>> model.save("quote_embedding.model")
>>> model = Doc2Vec.load("quote_embedding.model")
We can now use the vector space to get the representation of an unseen document.
Python
Copy
>>> new_quote = "Time is an illusion. Lunchtime doubly so"
>>> new_vector = wordpunct_tokenize(new_quote.lower())
Retrieving the documents that are more similar to ‘new_quote’ is a one liner.
Python
Copy
>>> model.docvecs.most_similar(positive=[model.infer_vector(new_vector)], topn=5)
[(3, 0.904436469078064),
(4, 0.8571681976318359),
(1, 0.8404998779296875),
(0, 0.839330792427063),
(2, 0.790512204170227)]
This snippet comes from the Python script “barebone_doc2vec.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.