/
...
/
/
Text tokenization with NLTK
Search
Try Notion
Text tokenization with NLTK
Let’s assume that we have a string objet called ‘text’ loaded in our current Python session:
Shell
Copy
>>> print(text) 'Or first American serial? It was released as five short subjects in successive weeks because it was believed that the American film-goer would not sit still for a fifty-minute movie. In that sense it is an interesting curiosity.How is it as a movie or series of movies? For 1909, very good indeed. The acting is a bit overwrought, and I have some issues with the costuming, since all the Egyptians wear beards like Hittites However the the story of the life of Moses is a good story and the script and actor do a good job at showing the character of Moses: his flash temper as a young man that he never completely mastered. Patrick Hartigan is also good in the scene where he first sees Zipporah and is caught between great suspicion and love at first sight.It is also worth discussing some of the special effects: the Angel of Death that appears and disappears, the triptych composition when the Israelites are crossing the Red Sea and the nice double exposure when the Egyptians drown. Yes, they may seem a bit obvious almost a century later..... but they are still startling and for their era, highly innovative.'
Let’s load some of the most popular NLTK’s tokenizers. For the comprehensive list of tokenizers, refer to the API of NLTK, see nltk.tokenize.
Python
Copy
>>> from nltk.tokenize import wordpunct_tokenize >>> from nltk.tokenize import word_tokenize >>> from nltk.tokenize import sent_tokenize
wordpunct_tokenize operates a tokenizer based on regular expressions. Mainly, it considers whitespaces and punctuations. The following lines of codes create and print a list with the output of tokenizer:
Python
Copy
>>> tkns_wp = wordpunct_tokenize(text) >>> print(tkns_wp) ['Or', 'first', 'American', 'serial', '?', 'It', 'was', 'released', 'as', 'five', 'short', 'subjects', 'in', 'successive', 'weeks', 'because', 'it', 'was', 'believed', 'that', 'the', 'American', 'film', '-', 'goer', ... 'era', ',', 'highly', 'innovative', '.']
Punctuation symbols are successfully separated from the ‘other’ tokens. That’s an improvement with respect to A bare bone tokenizer in Python. Not surprisingly the length of tkns_wp is larger than the counterpart list acheived with A bare bone tokenizer in Python (N = 197).
word_tokenize operates a tokenizer based on punkt, dividing a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words (e.g., “let’s”, collocations (e.g., “fast eater”), and words that start sentences (e.g., ”Therefore”). Note punkt is not shipped with NLTK and should be installed manually. The following lines of codes create and print a list with the output of tokenizer:
Python
Copy
>>> nltk.download('punkt') >>> tkns_pu = word_tokenize(text) >>> print(tkns_pu) ['Or', 'first', 'American', 'serial', '?', 'It', 'was', 'released', 'as', 'five', 'short', 'subjects', 'in', 'successive', 'weeks', 'because', 'it', 'was', 'believed', 'that', 'the', 'American', 'film-goer', 'would', 'not', ... 'era', ',', 'highly', 'innovative', '.']
word_tokenize yields less tokens than wordpunct_tokenize (213 Vs 221). Thanks to the representation of the natural language punkt brings about, word_tokenize successfully recognize the bi-gram “film-goer”.
sent_tokenize splits a corpus of text into sentences bases on regular expressions.
Python
Copy
>>> snt_tkns = [[sentence] for sentence in sent_tokenize (text)] >>> print(snt_tkns) [['The Battle of Trafalgar, arresting as it did the power of Napoleon, ...'], ['A film maker, who undertakes to reproduce an event of such importance ...'], ['It may be added at the outset that the effort of the Edison Company ...'], ... ['- The Moving Picture World, September 9, 1911']]
The result is a list of lists, each of which is associated with a sentence.
This snippet comes from the Python script “nlpPipelines/tokenization.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.