BoW representation — multiple docs case with a spaCy tokenizer

This snippet builds on BoW representation — multiple docs case. I’m emphasizing how to use a spaCy tokenizer instead of NLTK’s TreebankWordTokenizer.

Consider the below corpus of text, which contains one of the first few paragraphs from the Wikipedia entries of Adidas, Nike, and Puma (see BoW representation — multiple docs case).

Python

Copy

>>> print(docs)
['\nThe company was started by Adolf Dassler in his mother\'s house; ... \n',
 '\nThe company was founded on January 25, 1964, as "Blue Ribbon ... \n',
 '\nPuma SE, branded as Puma, is a German multinational corporation ... \n']

First thing first, we import the libraries necessary for the script.

Python

Copy

>>> from collections import Counter, OrderedDict
>>> import spacy
>>> nlp = spacy.load("en_core_web_sm") 

We get an iterable with the tokenized documents as follows:

Python

Copy

>>> docs_tkns = []
>>> for doc in docs:
        tmp = [
            token
            for token in nlp(doc)
            if (not token.is_stop) & (not token.is_punct) & (token.is_alpha)
        ]
        docs_tkns.append(tmp)
        del tmp
>>> print(docs_tkns)
[[company,
  started,
  Adolf,
  Dassler,
  mother,
  house,
  joined,
  elder,
  brother,
  Rudolf,
  Gebrüder,
  Dassler,
  Schuhfabrik,
  Dassler,
  Brothers,
  Shoe,
  Factory,
  Dassler,
  assisted,
  development,
  spiked,
  running,
  shoes,
  spikes,
  multiple,
...
  companies,
  currently,
  based,
  Herzogenaurach,
  Germany]]

We can reproduce the reaming part of the code included in BoW representation — multiple docs case as follows:

Python

Copy

>>> voc = sorted(set(sum(docs_tkns, [])))
>>> vector_space = []
>>> for doc in docs_tkns:
        vector = OrderedDict((token, 0) for token in voc)
        tkns_count = Counter(doc)
        for k, v in tkns_count.items():
            vector[k] = v
        vector_space.append(vector)
>>> print(vector_space)
[OrderedDict([(Puma, 0),
              (company, 1),
              (company, 0),
              (SE, 0),
              (branded, 0),
              (founded, 0),
              (started, 1),
              (Puma, 0),
              (Adolf, 1),
              (January, 0),
              (German, 0),
              (Dassler, 1),
              (multinational, 0),
              (mother, 1),
              (Blue, 0),
              (corporation, 0),
              (Ribbon, 0),
              (house, 1),
              (Sports, 0),
              (designs, 0),
              (joined, 1),
              (Bill, 0),
              (Bowerman, 0),
              (manufactures, 0),
              (elder, 1),
...
              (recognized, 0),
              (trademarks, 0),
              (Swoosh, 0),
              (logo, 0)])]

The BoW representation based on the text tokenized with spaCy seems substantially more readable than what we achieve with the text tokenized with NLTK.

This snippet comes from the Python script “bow.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/vectorSemantics at master · simoneSantoni/NLP-orgs-markets

This folder contains the Python scripts/notebooks regarding the second and third building blocks of "NLP, Organizations, and Markets," covering the topic of NLP foundations.

https://github.com/simoneSantoni/NLP-orgs-markets/tree/master/vectorSemantics