This snippet builds on BoW representation β multiple docs case. Iβm emphasizing how to use a spaCy tokenizer instead of NLTKβs TreebankWordTokenizer.
Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see BoW representation β multiple docs case).
Python
Copy
>>> print(docs)
['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n',
'\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n',
'\nPuma SE, branded as Puma, is a German multinational corporation ... \n']
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from collections import Counter, OrderedDict
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm")
We get an iterable with the tokenized documents as follows:
Python
Copy
>>> docs_tkns = []
>>> for doc in docs:
tmp = [
token
for token in nlp(doc)
if (not token.is_stop) & (not token.is_punct) & (token.is_alpha)
]
docs_tkns.append(tmp)
del tmp
>>> print(docs_tkns)
[[company,
started,
Adolf,
Dassler,
mother,
house,
joined,
elder,
brother,
Rudolf,
GebrΓΌder,
Dassler,
Schuhfabrik,
Dassler,
Brothers,
Shoe,
Factory,
Dassler,
assisted,
development,
spiked,
running,
shoes,
spikes,
multiple,
...
companies,
currently,
based,
Herzogenaurach,
Germany]]
We can reproduce the reaming part of the code included in BoW representation β multiple docs case as follows:
Python
Copy
>>> voc = sorted(set(sum(docs_tkns, [])))
>>> vector_space = []
>>> for doc in docs_tkns:
vector = OrderedDict((token, 0) for token in voc)
tkns_count = Counter(doc)
for k, v in tkns_count.items():
vector[k] = v
vector_space.append(vector)
>>> print(vector_space)
[OrderedDict([(Puma, 0),
(company, 1),
(company, 0),
(SE, 0),
(branded, 0),
(founded, 0),
(started, 1),
(Puma, 0),
(Adolf, 1),
(January, 0),
(German, 0),
(Dassler, 1),
(multinational, 0),
(mother, 1),
(Blue, 0),
(corporation, 0),
(Ribbon, 0),
(house, 1),
(Sports, 0),
(designs, 0),
(joined, 1),
(Bill, 0),
(Bowerman, 0),
(manufactures, 0),
(elder, 1),
...
(recognized, 0),
(trademarks, 0),
(Swoosh, 0),
(logo, 0)])]
The BoW representation based on the text tokenized with spaCy seems substantially more readable than what we achieve with the text tokenized with NLTK.
This snippet comes from the Python script βbow.pyβ, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.