Text tokenization with NLTK

Let’s assume that we have a string objet called ‘text’ loaded in our current Python session:

Shell

Copy

>>> print(text)

'Or first American serial? It was released as five short subjects in successive 
weeks because it was believed that the American film-goer would not sit still 
for a fifty-minute movie. In that sense it is an interesting curiosity.How is it 
as a movie or series of movies? For 1909, very good indeed. The acting is a bit 
overwrought, and I have some issues with the costuming, since all the Egyptians 
wear beards like Hittites However the the story of the life of Moses is a good 
story and the script and actor do a good job at showing the character of Moses: 
his flash temper as a young man that he never completely mastered. Patrick 
Hartigan is also good in the scene where he first sees Zipporah and is caught 
between great suspicion and love at first sight.It is also worth discussing some 
of the special effects: the Angel of Death that appears and disappears, the 
triptych composition when the Israelites are crossing the Red Sea and the nice 
double exposure when the Egyptians drown. Yes, they may seem a bit obvious almost 
a century later..... but they are still startling and for their era, highly 
innovative.'

Let’s load some of the most popular NLTK’s tokenizers. For the comprehensive list of tokenizers, refer to the API of NLTK, see nltk.tokenize.

Python

Copy

>>> from nltk.tokenize import wordpunct_tokenize
>>> from nltk.tokenize import word_tokenize
>>> from nltk.tokenize import sent_tokenize

wordpunct_tokenize operates a tokenizer based on regular expressions. Mainly, it considers whitespaces and punctuations. The following lines of codes create and print a list with the output of tokenizer:

Python

Copy

>>> tkns_wp = wordpunct_tokenize(text)
>>> print(tkns_wp)

['Or',
 'first',
 'American',
 'serial',
 '?',
 'It',
 'was',
 'released',
 'as',
 'five',
 'short',
 'subjects',
 'in',
 'successive',
 'weeks',
 'because',
 'it',
 'was',
 'believed',
 'that',
 'the',
 'American',
 'film',
 '-',
 'goer',
...
 'era',
 ',',
 'highly',
 'innovative',
 '.']

Punctuation symbols are successfully separated from the ‘other’ tokens. That’s an improvement with respect to A bare bone tokenizer in Python. Not surprisingly the length of tkns_wp is larger than the counterpart list acheived with A bare bone tokenizer in Python (N = 197).

word_tokenize operates a tokenizer based on punkt, dividing a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words (e.g., “let’s”, collocations (e.g., “fast eater”), and words that start sentences (e.g., ”Therefore”). Note punkt is not shipped with NLTK and should be installed manually. The following lines of codes create and print a list with the output of tokenizer:

Python

Copy

>>> nltk.download('punkt')
>>> tkns_pu = word_tokenize(text)
>>> print(tkns_pu)

['Or',
 'first',
 'American',
 'serial',
 '?',
 'It',
 'was',
 'released',
 'as',
 'five',
 'short',
 'subjects',
 'in',
 'successive',
 'weeks',
 'because',
 'it',
 'was',
 'believed',
 'that',
 'the',
 'American',
 'film-goer',
 'would',
 'not',
...
 'era',
 ',',
 'highly',
 'innovative',
 '.']

word_tokenize yields less tokens than wordpunct_tokenize (213 Vs 221). Thanks to the representation of the natural language punkt brings about, word_tokenize successfully recognize the bi-gram “film-goer”. 

sent_tokenize splits a corpus of text into sentences bases on regular expressions.

Python

Copy

>>> snt_tkns = [[sentence] for sentence in sent_tokenize
(text)]
>>> print(snt_tkns)

[['The Battle of Trafalgar, arresting as it did the power of Napoleon, ...'],
 ['A film maker, who undertakes to reproduce an event of such importance ...'],
 ['It may be added at the outset that the effort of the Edison Company ...'],
 ...
 ['- The Moving Picture World, September 9, 1911']]

The result is a list of lists, each of which is associated with a sentence.

This snippet comes from the Python script “nlpPipelines/tokenization.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/tokenization.py at master · simoneSantoni/NLP-orgs-markets

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or window. Reload to refresh your session. Reload to refresh your session.

https://github.com/simoneSantoni/NLP-orgs-markets/blob/master/nlpPipelines/tokenization.py