/
...
/
/
BoW representation — one-hot encoding with Pandas
Search
Try Notion
BoW representation — one-hot encoding with Pandas
Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see 🗞️BoW representation — multiple docs case).
In this script, we leverage the power of Pandas to get one-hot encoded vectors.
Python
Copy
>>> print(docs) ['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n', '\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n', '\nPuma SE, branded as Puma, is a German multinational corporation ... \n']
First thing first, we import the libraries necessary for the script.
Python
Copy
>>> from nltk.tokenize import TreebankWordTokenizer >>> import pandas as pd
As the first step, we create an empty DF.
Python
Copy
>>> oh = pd.DataFrame()
Then, we iterate over the documents included in corpus and go through the following steps:
step 1: document tokenization
step 2: vocabulary creation
step 3: creation of a DF whose columns are the unique tokens included in ‘voc’ (i.e., the keys of ‘voc’). This DF is populated with 1’s, denoting that the keys are included in the list ‘tkns’ (remind that we’re not interested in the extent to which a token occurs in the text, rather we want to register the mere presence of the token).
step 4: concatenation of the container ‘oh’ and the temporary DF ‘corpus’
step 5: NA replacement — since the individual ‘corpus’ DFs have different columns (different documents may have different ‘voc’ sets!), the concatenation included in step 4 produces a series of NaNs (that is, tokens not occurring in a document).
Python
Copy
>>> for i, doc in enumerate(docs): tkns = TreebankWordTokenizer().tokenize(doc) # step 1 voc = sorted(set(tkns)) # step 2 corpus = pd.DataFrame({k: 1 for k in voc}, index=[I]) # step 3 oh = pd.concat([oh, corpus], axis=0) # step 4 oh.fillna(0, inplace=True) # step 5
This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.