Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see BoW representation β multiple docs case).
Python
Copy
>>> print(docs)
['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n',
'\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n',
'\nPuma SE, branded as Puma, is a German multinational corporation ... \n']
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from typing import Dict, List, Tuple
>>> import collections
In the BoW representation β multiple docs case and BoW representation β multiple docs case with a spaCy tokenizer we have adopted a multi-pronged approach: first, we create the vocabulary (once for all), then, we project the document onto it. It is possible to avoid such a multi-pronged approach by using a for loop that:
step 1: it creates an empty container to populate with the BoW representation of the document
step 2: it tests whether the extant vocabulary contains the token in the document at hand
step 3: if not, it expands the vocabulary
step 4: it updates the container with the BoW representation
Python
Copy
>>> def doc2bow(tkns_: List[str], voc_: Dict[str, int]) -> List[Tuple[int, int]]:
"""_summary_
Args:
tkns_ (List): iterable with document tokens
voc_ (Dict): the vocabulary for the corpus (may be empty)
Returns:
_type_: iterable of tuples with token positions and token cardinalities
"""
tkns_count = collections.defaultdict(int) # step 1
for tkn in tkns_:
if tkn not in voc_: # step 2
voc_[tkn] = len(voc_) # step 3
tkns_count[voc_[tkn]] += 1. # step 4
return list(tkns_count.items())
Letβs deploy the doc2bow function. We initialize an empty dictionary to pass as the βvoc_β argument.
Python
Copy
>>> voc = {}
Then, we apply the function over the first elements of the βdocsβ corpus. Each tuple identifies a tokenβs position and cardinality in the document at hand.
Python
Copy
>>> print(tkns_=doc2bow(TreebankWordTokenizer().tokenize(docs[0]), voc_=voc))
[(0, 1), (1, 1), (2, 2), (3, 1), (4, 2), (5, 2), (6, 5), (7, 4), (8, 3),
(9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1),
(17, 2), (18, 1), (19, 1), (20, 6), (21, 1), (22, 1), (23, 1), (24, 2),
(25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 2), (31, 1), (32, 1),
(33, 1), (34, 3), (35, 2), (36, 1), (37, 1), (38, 3), (39, 1), (40, 1),
(41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 5), (48, 1),
(49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1),
(57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1),
(65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1),
(73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1),
(81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)]
The size of the vocabulary is 87 unique tokens.
Python
Copy
>>> print(len(voc))
87
If we pass a second document to doc2bow, βvocβ updates β now, it contains 171 unique tokens.
Python
Copy
>>> bow_2 = doc2bow(TreebankWordTokenizer().tokenize(docs[1]), voc)
>>> print(len(voc))
171
This snippet comes from the Python script βbow.pyβ, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.