What is Tokenization?

Tokenization is the process by which a large quantity of text is divided into smaller parts called tokens.
These tokens are very useful for finding patterns and are considered as a base step for stemming and lemmatization.
Tokenization also helps to substitute sensitive data elements with non-sensitive data elements.

Natural language processing is used for building applications such as Text classification, intelligent chatbot, sentimental analysis, language translation, etc.
It becomes vital to understand the pattern in the text to achieve the above-stated purpose.

Tokenization of words

We use the method word_tokenize() to split a sentence into words.
The output of word tokenization can be converted to Data Frame for better text understanding in machine learning applications.
It can also be provided as input for further text cleaning steps such as punctuation removal, numeric character removal or stemming.

Code example:

from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))

Output: ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

From the above example, one can see the punctuationa are also included. Sometimes we want to exclude that.
To achieve this purpose, there are two ways:

USE nltk.RegexpTokenizer() TO REMOVE ALL PUNCTUATION MARKS

Call nltk.RegexpTokenizer(pattern) with pattern as r”\w+” to create a tokenzier that uses pattern to split a string.
Call RegexpTokenizer.tokenize(text) with RegexpTokenizer as the previous result and text as a string representing a sentence to return text as a list of words with punctuation’s removed.

sentence  = "Think and wonder, wonder and think."

tokenizer = nltk.RegexpTokenizer(r"\w+")
new_words = tokenizer.tokenize(sentence)

print(new_words)
OUTPUT
['Think', 'and', 'wonder', 'wonder', 'and', 'think']

USE nltk.word_tokenize() AND LIST COMPREHENSION TO REMOVE ALL PUNCTUATION MARKS

Call nltk.word_tokenize(text) with text as a string representing a sentence to return text as a list of words. Use the syntax [word for word in words if condition] with words as the previous result and condition as word.isalnum() to create a list containing each word in words that only contain alphanumeric characters.

sentence  = "Think and wonder, wonder and think."

words = nltk.word_tokenize(sentence)
new_words= [word for word in words if word.isalnum()]

print(new_words)
OUTPUT
['Think', 'and', 'wonder', 'wonder', 'and', 'think']

Tokenization of Sentences

Sometimes you need to get sentences out of the texts at first.

Code example:

from nltk.tokenize import sent_tokenize
text = "God is Great! I won a lottery."
print(sent_tokenize(text))

Output: ['God is Great!', 'I won a lottery ']

robot learner

https://datasciencebyexample.github.io/2021/06/09/2021-06-09-1/