Python
Copy
>>> text_0 = """
The company was started by Adolf Dassler in his mother's house; he was joined
by his elder brother Rudolf in 1924 under the name Gebrüder Dassler
Schuhfabrik ("Dassler Brothers Shoe Factory"). Dassler assisted in the
development of spiked running shoes (spikes) for multiple athletic events. To
enhance the quality of spiked athletic footwear, he transitioned from a
previous model of heavy metal spikes to utilising canvas and rubber. Dassler
persuaded U.S. sprinter Jesse Owens to use his handmade spikes at the 1936
Summer Olympics. In 1949, following a breakdown in the relationship between
the brothers, Adolf created Adidas, and Rudolf established Puma, which became
Adidas' business rival
"""
>>> text_1 = """
The company was founded on January 25, 1964, as "Blue Ribbon Sports", by Bill
Bowerman and Phil Knight, and officially became Nike, Inc. on May 30, 1971.
The company takes its name from Nike, the Greek goddess of victory. Nike
markets its products under its own brand, as well as Nike Golf, Nike Pro,
Nike+, Air Jordan, Nike Blazers, Air Force 1, Nike Dunk, Air Max, Foamposite,
Nike Skateboarding, Nike CR7, and subsidiaries including Jordan Brand and
Converse. Nike also owned Bauer Hockey from 1995 to 2008, and previously owned
Cole Haan, Umbro, and Hurley International. In addition to manufacturing
sportswear and equipment, the company operates retail stores under the
Niketown name. Nike sponsors many high-profile athletes and sports teams
around the world, with the highly recognized trademarks of "Just Do It" and
the Swoosh logo.
"""
>>> text_2 = """
Puma SE, branded as Puma, is a German multinational corporation that designs
and manufactures athletic and casual footwear, apparel and accessories, which
is headquartered in Herzogenaurach, Bavaria, Germany. Puma is the third largest
sportswear manufacturer in the world. The company was founded in 1948 by
Rudolf Dassler. In 1924, Rudolf and his brother Adolf "Adi" Dassler had
jointly formed the company Gebrüder Dassler Schuhfabrik (Dassler Brothers Shoe
Factory). The relationship between the two brothers deteriorated until the two
agreed to split in 1948, forming two separate entities, Adidas and Puma. Both
companies are currently based in Herzogenaurach, Germany.
"""
>>> docs = [text_0, text_1, text_2]
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from collections import Counter, OrderedDict, defaultdict
With respect to the single document case (BoW representation — single doc case), the multiple document case requires us to deal with a complication. Insofar as an individual document presents unique tokens (i.e., tokens that aren’t included in other documents), first we have to build a dictionary out of the entire corpus, then, have to project the individual documents onto the dictionary. Let’s create an iterable with the tokenized documents:
Python
Copy
>>> docs_tkns = [sorted(TreebankWordTokenizer().tokenize(doc)) for doc in docs]
The vocabulary is the set of unique tokens included in the concatenation of text_0, text_1, and text_2.
Python
Copy
>>> voc = sorted(set(sum(docs_tkns, [])))
>>> print(voc)
["'",
"''",
"'s",
'(',
')',
',',
'.',
'1',
'1924',
'1936',
'1948',
'1949',
'1964',
'1971.',
'1995',
'2008',
'25',
'30',
';',
'Adi',
'Adidas',
'Adolf',
'Air',
'Bauer',
'Bavaria',
...
'well',
'which',
'with',
'world',
'world.']
We’re ready to project the project the individual docs onto the vocabulary to get an appropriate BoW. The collections of these projections is known as vector space. In terms of logic, the below-displayed code comprised these steps:
step 1: it creates an empty container ’vector_space’
By iterating over the documents in the corpus:
step 2: it creates a temporary object ‘vector’ by borrowing the unique tokens from ‘voc’ and assigns them cardinality = 0
step 3: it takes the BoW transformation of the document (see BoW representation — single doc case)
step 4: for each unique token included in the ‘voc’, it updates ‘vector’ with the cardinality of the tokens in the document-specific BoW ‘tkns_count’ (see step 2)
step 5: it stores the projection of the document onto the dictionary — that is the real BoW representation of the document, taking into account both the tokens that are included in the document and those that aren’t
step 6: it deletes the temporary object (which will be used in the next iteration of the loop, if any)
Python
Copy
>>> vector_space = [] # step 1
>>> for doc in docs_tkns:
vector = OrderedDict((token, 0) for token in voc) # step 2
tkns_count = Counter(doc). # step 3
for k, v in tkns_count.items(): # step 4
vector[k] = v
vector_space.append(vector) # step 5
del vector # step 6
>>> print(vector_space[0])
OrderedDict([("'", 1),
("''", 1),
("'s", 1),
('(', 2),
(')', 2),
(',', 5),
('.', 1),
('1', 0),
('1924', 1),
('1936', 1),
('1948', 0),
('1949', 1),
('1964', 0),
('1971.', 0),
('1995', 0),
('2008', 0),
('25', 0),
('30', 0),
(';', 1),
('Adi', 0),
('Adidas', 2),
('Adolf', 2),
('Air', 0),
('Bauer', 0),
('Bavaria', 0),
...
('well', 0),
('which', 1),
('with', 0),
('world', 0),
('world.', 0)])
This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.