/
...
/
/
🗞️
BoW representation — multiple docs case
Search
Try Notion
🗞️🗞️
BoW representation — multiple docs case
Let’s assume we have a corpus with three documents — namely, one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma.
Python
Copy
>>> text_0 = """ The company was started by Adolf Dassler in his mother's house; he was joined by his elder brother Rudolf in 1924 under the name Gebrüder Dassler Schuhfabrik ("Dassler Brothers Shoe Factory"). Dassler assisted in the development of spiked running shoes (spikes) for multiple athletic events. To enhance the quality of spiked athletic footwear, he transitioned from a previous model of heavy metal spikes to utilising canvas and rubber. Dassler persuaded U.S. sprinter Jesse Owens to use his handmade spikes at the 1936 Summer Olympics. In 1949, following a breakdown in the relationship between the brothers, Adolf created Adidas, and Rudolf established Puma, which became Adidas' business rival """ >>> text_1 = """ The company was founded on January 25, 1964, as "Blue Ribbon Sports", by Bill Bowerman and Phil Knight, and officially became Nike, Inc. on May 30, 1971. The company takes its name from Nike, the Greek goddess of victory. Nike markets its products under its own brand, as well as Nike Golf, Nike Pro, Nike+, Air Jordan, Nike Blazers, Air Force 1, Nike Dunk, Air Max, Foamposite, Nike Skateboarding, Nike CR7, and subsidiaries including Jordan Brand and Converse. Nike also owned Bauer Hockey from 1995 to 2008, and previously owned Cole Haan, Umbro, and Hurley International. In addition to manufacturing sportswear and equipment, the company operates retail stores under the Niketown name. Nike sponsors many high-profile athletes and sports teams around the world, with the highly recognized trademarks of "Just Do It" and the Swoosh logo. """ >>> text_2 = """ Puma SE, branded as Puma, is a German multinational corporation that designs and manufactures athletic and casual footwear, apparel and accessories, which is headquartered in Herzogenaurach, Bavaria, Germany. Puma is the third largest sportswear manufacturer in the world. The company was founded in 1948 by Rudolf Dassler. In 1924, Rudolf and his brother Adolf "Adi" Dassler had jointly formed the company Gebrüder Dassler Schuhfabrik (Dassler Brothers Shoe Factory). The relationship between the two brothers deteriorated until the two agreed to split in 1948, forming two separate entities, Adidas and Puma. Both companies are currently based in Herzogenaurach, Germany. """ >>> docs = [text_0, text_1, text_2]
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from collections import Counter, OrderedDict, defaultdict
With respect to the single document case (🧮BoW representation — single doc case), the multiple document case requires us to deal with a complication. Insofar as an individual document presents unique tokens (i.e., tokens that aren’t included in other documents), first we have to build a dictionary out of the entire corpus, then, have to project the individual documents onto the dictionary. Let’s create an iterable with the tokenized documents:
Python
Copy
>>> docs_tkns = [sorted(TreebankWordTokenizer().tokenize(doc)) for doc in docs]
The vocabulary is the set of unique tokens included in the concatenation of text_0, text_1, and text_2.
Python
Copy
>>> voc = sorted(set(sum(docs_tkns, []))) >>> print(voc) ["'", "''", "'s", '(', ')', ',', '.', '1', '1924', '1936', '1948', '1949', '1964', '1971.', '1995', '2008', '25', '30', ';', 'Adi', 'Adidas', 'Adolf', 'Air', 'Bauer', 'Bavaria', ... 'well', 'which', 'with', 'world', 'world.']
We’re ready to project the project the individual docs onto the vocabulary to get an appropriate BoW. The collections of these projections is known as vector space. In terms of logic, the below-displayed code comprised these steps:
step 1: it creates an empty container ’vector_space’
By iterating over the documents in the corpus:
step 2: it creates a temporary object ‘vector’ by borrowing the unique tokens from ‘voc’ and assigns them cardinality = 0
step 3: it takes the BoW transformation of the document (see 🧮BoW representation — single doc case)
step 4: for each unique token included in the ‘voc’, it updates ‘vector’ with the cardinality of the tokens in the document-specific BoW ‘tkns_count’ (see step 2)
step 5: it stores the projection of the document onto the dictionary — that is the real BoW representation of the document, taking into account both the tokens that are included in the document and those that aren’t
step 6: it deletes the temporary object (which will be used in the next iteration of the loop, if any)
Python
Copy
>>> vector_space = [] # step 1 >>> for doc in docs_tkns: vector = OrderedDict((token, 0) for token in voc) # step 2 tkns_count = Counter(doc). # step 3 for k, v in tkns_count.items(): # step 4 vector[k] = v vector_space.append(vector) # step 5 del vector # step 6 >>> print(vector_space[0]) OrderedDict([("'", 1), ("''", 1), ("'s", 1), ('(', 2), (')', 2), (',', 5), ('.', 1), ('1', 0), ('1924', 1), ('1936', 1), ('1948', 0), ('1949', 1), ('1964', 0), ('1971.', 0), ('1995', 0), ('2008', 0), ('25', 0), ('30', 0), (';', 1), ('Adi', 0), ('Adidas', 2), ('Adolf', 2), ('Air', 0), ('Bauer', 0), ('Bavaria', 0), ... ('well', 0), ('which', 1), ('with', 0), ('world', 0), ('world.', 0)])
This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.