/
...
/
/
Text tokenization with spaCy
Search
Try Notion
Text tokenization with spaCy
Let’s assume that we have a string objet called ‘text’ loaded in our current Python session:
Shell
Copy
>>> print(text) 'Or first American serial? It was released as five short subjects in successive weeks because it was believed that the American film-goer would not sit still for a fifty-minute movie. In that sense it is an interesting curiosity.How is it as a movie or series of movies? For 1909, very good indeed. The acting is a bit overwrought, and I have some issues with the costuming, since all the Egyptians wear beards like Hittites However the the story of the life of Moses is a good story and the script and actor do a good job at showing the character of Moses: his flash temper as a young man that he never completely mastered. Patrick Hartigan is also good in the scene where he first sees Zipporah and is caught between great suspicion and love at first sight.It is also worth discussing some of the special effects: the Angel of Death that appears and disappears, the triptych composition when the Israelites are crossing the Red Sea and the nice double exposure when the Egyptians drown. Yes, they may seem a bit obvious almost a century later..... but they are still startling and for their era, highly innovative.'
Let’s load the spaCy library along with one of the various models of the language pre-trained by the Explosion people. The model of the language contains the information, algorithms, and rules necessary to tokenize our corpus of text (and tag the individual tokens).
Shell
Copy
>>> import spacy >>> nlp = spacy.load("en_core_web_sm") >>> tkns_sp = [token.text for token in nlp(text)] >>> print(tkns_sp) ['Or', 'first', 'American', 'serial', '?', 'It', 'was', 'released', 'as', 'five', 'short', 'subjects', 'in', 'successive', 'weeks', 'because', 'it', 'was', 'believed', 'that', 'the', 'American', 'film', '-', 'goer', ... 'era', ',', 'highly', 'innovative', '.']
The list comprehension queries one of the many pieces of information the spaCy’s pipeline makes available (for example, token.pos would call the POS tag associated with token).
This snippet comes from the Python script “nlpPipelines/tokenization.py”, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.