BoW representation — multiple docs case

Let’s assume we have a corpus with three documents — namely, one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma.

Python

Copy

>>> text_0 = """
The company was started by Adolf Dassler in his mother's house; he was joined 
by his elder brother Rudolf in 1924 under the name Gebrüder Dassler 
Schuhfabrik ("Dassler Brothers Shoe Factory"). Dassler assisted in the 
development of spiked running shoes (spikes) for multiple athletic events. To 
enhance the quality of spiked athletic footwear, he transitioned from a 
previous model of heavy metal spikes to utilising canvas and rubber. Dassler 
persuaded U.S. sprinter Jesse Owens to use his handmade spikes at the 1936 
Summer Olympics. In 1949, following a breakdown in the relationship between 
the brothers, Adolf created Adidas, and Rudolf established Puma, which became 
Adidas' business rival
"""

>>> text_1 = """
The company was founded on January 25, 1964, as "Blue Ribbon Sports", by Bill 
Bowerman and Phil Knight, and officially became Nike, Inc. on May 30, 1971. 
The company takes its name from Nike, the Greek goddess of victory. Nike 
markets its products under its own brand, as well as Nike Golf, Nike Pro, 
Nike+, Air Jordan, Nike Blazers, Air Force 1, Nike Dunk, Air Max, Foamposite, 
Nike Skateboarding, Nike CR7, and subsidiaries including Jordan Brand and 
Converse. Nike also owned Bauer Hockey from 1995 to 2008, and previously owned 
Cole Haan, Umbro, and Hurley International. In addition to manufacturing 
sportswear and equipment, the company operates retail stores under the 
Niketown name. Nike sponsors many high-profile athletes and sports teams 
around the world, with the highly recognized trademarks of "Just Do It" and 
the Swoosh logo.
"""

>>> text_2 = """
Puma SE, branded as Puma, is a German multinational corporation that designs 
and manufactures athletic and casual footwear, apparel and accessories, which 
is headquartered in Herzogenaurach, Bavaria, Germany. Puma is the third largest 
sportswear manufacturer in the world. The company was founded in 1948 by 
Rudolf Dassler. In 1924, Rudolf and his brother Adolf "Adi" Dassler had 
jointly formed the company Gebrüder Dassler Schuhfabrik (Dassler Brothers Shoe 
Factory). The relationship between the two brothers deteriorated until the two 
agreed to split in 1948, forming two separate entities, Adidas and Puma. Both 
companies are currently based in Herzogenaurach, Germany.
"""

>>> docs = [text_0, text_1, text_2]

First thing first, we import the libraries necessary for the script.

Python

Copy

>>> from collections import Counter, OrderedDict, defaultdict

With respect to the single document case (BoW representation — single doc case), the multiple document case requires us to deal with a complication. Insofar as an individual document presents unique tokens (i.e., tokens that aren’t included in other documents), first we have to build a dictionary out of the entire corpus, then, have to project the individual documents onto the dictionary. Let’s create an iterable with the tokenized documents:

Python

Copy

>>> docs_tkns = [sorted(TreebankWordTokenizer().tokenize(doc)) for doc in docs]

The vocabulary is the set of unique tokens included in the concatenation of text_0, text_1, and text_2.

Python

Copy

>>> voc = sorted(set(sum(docs_tkns, [])))
>>> print(voc)
["'",
 "''",
 "'s",
 '(',
 ')',
 ',',
 '.',
 '1',
 '1924',
 '1936',
 '1948',
 '1949',
 '1964',
 '1971.',
 '1995',
 '2008',
 '25',
 '30',
 ';',
 'Adi',
 'Adidas',
 'Adolf',
 'Air',
 'Bauer',
 'Bavaria',
...
 'well',
 'which',
 'with',
 'world',
 'world.']

We’re ready to project the project the individual docs onto the vocabulary to get an appropriate BoW. The collections of these projections is known as vector space. In terms of logic, the below-displayed code comprised these steps:

step 1: it creates an empty container ’vector_space’

By iterating over the documents in the corpus:

step 2: it creates a temporary object ‘vector’ by borrowing the unique tokens from ‘voc’ and assigns them cardinality = 0

step 3: it takes the BoW transformation of the document (see BoW representation — single doc case)

step 4: for each unique token included in the ‘voc’, it updates ‘vector’ with the cardinality of the tokens in the document-specific BoW ‘tkns_count’ (see step 2)

step 5: it stores the projection of the document onto the dictionary — that is the real BoW representation of the document, taking into account both the tokens that are included in the document and those that aren’t

step 6: it deletes the temporary object (which will be used in the next iteration of the loop, if any)

Python

Copy

>>> vector_space = []                                     # step 1
>>> for doc in docs_tkns:
        vector = OrderedDict((token, 0) for token in voc) # step 2
        tkns_count = Counter(doc).                        # step 3
        for k, v in tkns_count.items():                   # step 4
            vector[k] = v
        vector_space.append(vector)                       # step 5
        del vector                                        # step 6
>>> print(vector_space[0])
OrderedDict([("'", 1),
             ("''", 1),
             ("'s", 1),
             ('(', 2),
             (')', 2),
             (',', 5),
             ('.', 1),
             ('1', 0),
             ('1924', 1),
             ('1936', 1),
             ('1948', 0),
             ('1949', 1),
             ('1964', 0),
             ('1971.', 0),
             ('1995', 0),
             ('2008', 0),
             ('25', 0),
             ('30', 0),
             (';', 1),
             ('Adi', 0),
             ('Adidas', 2),
             ('Adolf', 2),
             ('Air', 0),
             ('Bauer', 0),
             ('Bavaria', 0),
...
             ('well', 0),
             ('which', 1),
             ('with', 0),
             ('world', 0),
             ('world.', 0)])

This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/vectorSemantics at master · simoneSantoni/NLP-orgs-markets

This folder contains the Python scripts/notebooks regarding the second and third building blocks of "NLP, Organizations, and Markets," covering the topic of NLP foundations.

https://github.com/simoneSantoni/NLP-orgs-markets/tree/master/vectorSemantics

Clips gradient norm of an iterable of parameters