A bare bone tokenizer in Python

Let’s assume that we have a string objet called ‘text’ loaded in our current Python session:

Shell

Copy

>>> print(text)

'Or first American serial? It was released as five short subjects in successive 
weeks because it was believed that the American film-goer would not sit still 
for a fifty-minute movie. In that sense it is an interesting curiosity.How is it 
as a movie or series of movies? For 1909, very good indeed. The acting is a bit 
overwrought, and I have some issues with the costuming, since all the Egyptians 
wear beards like Hittites However the the story of the life of Moses is a good 
story and the script and actor do a good job at showing the character of Moses: 
his flash temper as a young man that he never completely mastered. Patrick 
Hartigan is also good in the scene where he first sees Zipporah and is caught 
between great suspicion and love at first sight.It is also worth discussing some 
of the special effects: the Angel of Death that appears and disappears, the 
triptych composition when the Israelites are crossing the Red Sea and the nice 
double exposure when the Egyptians drown. Yes, they may seem a bit obvious almost 
a century later..... but they are still startling and for their era, highly 
innovative.'

A bare bone tokenizer may create a list out of “text” by splitting the string on whitespaces:

Shell

Copy

>>> tokens = text.split(" ")
>>> print(tokens)

['Or',
 'first',
 'American',
 'serial?',
 'It',
 'was',
 'released',
 'as',
 'five',
 'short',
 'subjects',
 'in',
 'successive',
 'weeks',
 'because',
 'it',
 'was',
 'believed',
 'that',
 'the',
 'American',
 'film-goer',
 'would',
 'not',
 'sit',
...
 'for',
 'their',
 'era,',
 'highly',
 'innovative.']

The visual inspection of “tokens” indicates that:

The tokenizer fails to separate punctuations from words. For example, the token in position “3” is “serial?”

In certain circumstances, the tokenizer seems to behave smartly and to get bi-grams such as “film-goer”. In fact, it is not smart. Our bare bone tokenizer cannot distinguish real bi-grams (e.g., “film-goer”)  from two tokens that are joint by a hyphen by mistake or obscure reasons that are important for the writer (e.g., ”bus-python” might make sense for somebody out there, but it is not a socially-shared bi-gram).

This snippet comes from the Python script “nlpPipelines/tokenization.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/tokenization.py at master · simoneSantoni/NLP-orgs-markets

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.

https://github.com/simoneSantoni/NLP-orgs-markets/blob/master/nlpPipelines/tokenization.py