BoW representation — updating a reference vocabulary ‘live’

Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see BoW representation — multiple docs case).

Python

Copy

>>> print(docs)
['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n',
 '\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n',
 '\nPuma SE, branded as Puma, is a German multinational corporation ... \n']

First thing first, we import the libraries necessary for the script.

Python

Copy

>>> from typing import Dict, List, Tuple
>>> import collections

In the BoW representation — multiple docs case and BoW representation — multiple docs case with a spaCy tokenizer we have adopted a multi-pronged approach: first, we create the vocabulary (once for all), then, we project the document onto it. It is possible to avoid such a multi-pronged approach by using a for loop that:

step 1: it creates an empty container to populate with the BoW representation of the document 

step 2: it tests whether the extant vocabulary contains the token in the document at hand

step 3: if not, it expands the vocabulary

step 4: it updates the container with the BoW representation

Python

Copy

>>> def doc2bow(tkns_: List[str], voc_: Dict[str, int]) -> List[Tuple[int, int]]:
    """_summary_

	    Args:
	        tkns_ (List): iterable with document tokens
	        voc_ (Dict): the vocabulary for the corpus (may be empty)

	    Returns:
	        _type_: iterable of tuples with token positions and token cardinalities
	  """
        tkns_count = collections.defaultdict(int)     # step 1
        for tkn in tkns_:
            if tkn not in voc_:                       # step 2
                voc_[tkn] = len(voc_)                 # step 3
            tkns_count[voc_[tkn]] += 1.               # step 4
  
        return list(tkns_count.items())

Let’s deploy the doc2bow function. We initialize an empty dictionary to pass as the ‘voc_’ argument.

Python

Copy

>>> voc = {}

Then, we apply the function over the first elements of the ‘docs’ corpus. Each tuple identifies a token’s position and cardinality in the document at hand.

Python

Copy

>>> print(tkns_=doc2bow(TreebankWordTokenizer().tokenize(docs[0]), voc_=voc))
[(0, 1), (1, 1), (2, 2), (3, 1), (4, 2), (5, 2), (6, 5), (7, 4), (8, 3), 
(9, 1), (10, 1), (11, 1), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), 
(17, 2), (18, 1), (19, 1), (20, 6), (21, 1), (22, 1), (23, 1), (24, 2), 
(25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 2), (31, 1), (32, 1), 
(33, 1), (34, 3), (35, 2), (36, 1), (37, 1), (38, 3), (39, 1), (40, 1), 
(41, 2), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 5), (48, 1), 
(49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 2), (56, 1), 
(57, 1), (58, 2), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), 
(65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), 
(73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), 
(81, 1), (82, 1), (83, 1), (84, 1), (85, 1), (86, 1)]

The size of the vocabulary is 87 unique tokens.

Python

Copy

>>> print(len(voc))
87

If we pass a second document to doc2bow, ‘voc’ updates — now, it contains 171 unique tokens.

Python

Copy

>>> bow_2 = doc2bow(TreebankWordTokenizer().tokenize(docs[1]), voc)
>>> print(len(voc))
171

This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/vectorSemantics at master · simoneSantoni/NLP-orgs-markets

This folder contains the Python scripts/notebooks regarding the second and third building blocks of "NLP, Organizations, and Markets," covering the topic of NLP foundations.

https://github.com/simoneSantoni/NLP-orgs-markets/tree/master/vectorSemantics