/
...
/
/
πŸš›
BoW representation β€” updating a reference vocabulary β€˜live’
Search
Try Notion
πŸš›πŸš›
BoW representation β€” updating a reference vocabulary β€˜live’
Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see πŸ—žοΈBoW representation β€” multiple docs case).
Python
Copy
>>> print(docs) ['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n', '\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n', '\nPuma SE, branded as Puma, is a German multinational corporation ... \n']
​
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from typing import Dict, List, Tuple >>> import collections
​
In the πŸ—žοΈBoW representation β€” multiple docs case and BoW representation β€” multiple docs case with a spaCy tokenizer we have adopted a multi-pronged approach: first, we create the vocabulary (once for all), then, we project the document onto it. It is possible to avoid such a multi-pronged approach by using a for loop that:
step 1: it creates an empty container to populate with the BoW representation of the document
step 2: it tests whether the extant vocabulary contains the token in the document at hand
step 3: if not, it expands the vocabulary
step 4: it updates the container with the BoW representation
Python
Copy
>>> def doc2bow(tkns_: List[str], voc_: Dict[str, int]) -> List[Tuple[int, int]]: """_summary_ Args: tkns_ (List): iterable with document tokens voc_ (Dict): the vocabulary for the corpus (may be empty) Returns: _type_: iterable of tuples with token positions and token cardinalities """ tkns_count = collections.defaultdict(int) # step 1 for tkn in tkns_: if tkn not in voc_: # step 2 voc_[tkn] = len(voc_) # step 3 tkns_count[voc_[tkn]] += 1. # step 4 return list(tkns_count.items())
​
Let’s deploy the doc2bow function. We initialize an empty dictionary to pass as the β€˜voc_’ argument.
Python
Copy
>>> voc = {}
​
Then, we apply the function over the first elements of the β€˜docs’ corpus. Each tuple identifies a token’s position and cardinality in the document at hand.
Python
Copy
>>> print(tkns_=doc2bow(TreebankWordTokenizer().tokenize(docs[0]), voc_=voc)) [(0, 1), (1, 1), (2, 2), (3, 1), (4, 2), (5, 2), (6, 5), (7, 4), (8, 3), (9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 2), (18, 1), (19, 1), (20, 6), (21, 1), (22, 1), (23, 1), (24, 2), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 2), (31, 1), (32, 1), (33, 1), (34, 3), (35, 2), (36, 1), (37, 1), (38, 3), (39, 1), (40, 1), (41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 5), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), (57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)]
​
The size of the vocabulary is 87 unique tokens.
Python
Copy
>>> print(len(voc)) 87
​
If we pass a second document to doc2bow, β€˜voc’ updates β€” now, it contains 171 unique tokens.
Python
Copy
>>> bow_2 = doc2bow(TreebankWordTokenizer().tokenize(docs[1]), voc) >>> print(len(voc)) 171
​
This snippet comes from the Python script β€œbow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.