/
...
/
/
BoW representation β€” multiple docs case with a spaCy tokenizer
Search
Try Notion
BoW representation β€” multiple docs case with a spaCy tokenizer
This snippet builds on πŸ—žοΈBoW representation β€” multiple docs case. I’m emphasizing how to use a spaCy tokenizer instead of NLTK’s TreebankWordTokenizer.
Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see πŸ—žοΈBoW representation β€” multiple docs case).
Python
Copy
>>> print(docs) ['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n', '\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n', '\nPuma SE, branded as Puma, is a German multinational corporation ... \n']
​
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from collections import Counter, OrderedDict >>> import spacy >>> nlp = spacy.load("en_core_web_sm")
​
We get an iterable with the tokenized documents as follows:
Python
Copy
>>> docs_tkns = [] >>> for doc in docs: tmp = [ token for token in nlp(doc) if (not token.is_stop) & (not token.is_punct) & (token.is_alpha) ] docs_tkns.append(tmp) del tmp >>> print(docs_tkns) [[company, started, Adolf, Dassler, mother, house, joined, elder, brother, Rudolf, GebrΓΌder, Dassler, Schuhfabrik, Dassler, Brothers, Shoe, Factory, Dassler, assisted, development, spiked, running, shoes, spikes, multiple, ... companies, currently, based, Herzogenaurach, Germany]]
​
We can reproduce the reaming part of the code included in πŸ—žοΈBoW representation β€” multiple docs case as follows:
Python
Copy
>>> voc = sorted(set(sum(docs_tkns, []))) >>> vector_space = [] >>> for doc in docs_tkns: vector = OrderedDict((token, 0) for token in voc) tkns_count = Counter(doc) for k, v in tkns_count.items(): vector[k] = v vector_space.append(vector) >>> print(vector_space) [OrderedDict([(Puma, 0), (company, 1), (company, 0), (SE, 0), (branded, 0), (founded, 0), (started, 1), (Puma, 0), (Adolf, 1), (January, 0), (German, 0), (Dassler, 1), (multinational, 0), (mother, 1), (Blue, 0), (corporation, 0), (Ribbon, 0), (house, 1), (Sports, 0), (designs, 0), (joined, 1), (Bill, 0), (Bowerman, 0), (manufactures, 0), (elder, 1), ... (recognized, 0), (trademarks, 0), (Swoosh, 0), (logo, 0)])]
​
The BoW representation based on the text tokenized with spaCy seems substantially more readable than what we achieve with the text tokenized with NLTK.
This snippet comes from the Python script β€œbow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.