Convert short texts to numeric vectors with Character ngram tf-idf vectorizer using scikit learn in Python


When apply machine learning algorithms, to handel words or short texts, we usually need to get their numeric embedding vecotors first.
Some powerfulmethods including using pre-trained deep learning model such as BERT to more semantic embedding. If computation resource is a limit, or we want to have simpler embedding methods, we could try TF-IDF metrics.

Here we introduce a very simple way to combine character level n-gram methods and TF-IDF to convert short texts such as a few words to numeric vectors. Within numeric vectors, we could further apply classification methods such as Gradient Boosted Machine for downstream tasks.

First, let’s review what’s n-gram:

Quote some definition from the Wikipedia N-Gram page:

…an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application… An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram”; size 3 is a “trigram”. Larger sizes are sometimes referred to by the value of n, e.g., “four-gram”, “five-gram”, and so on.

The two most common types of N-Grams, by far, are (1) character-level, where the items consist of one or more characters and (2) word-level, where the items consist of one or more words. The size of the item (or token as it’s often called) is defined by n;

Second, what’s TF-IDF metrics?

TF-IDF (Term Frequency - Inverse Document Frequency) encoding is an improved way of BOW (bag of words) which is the same as TF. It considers the frequently seen term in various documents to be less of importance.

TF (Term Frequency): Counts how many term exists in a document
IDF (Inverse Document Frequency): Inverse of the number of documents which contains the term

So TF-IDF is the basically product of TF and IDF metrics.

It turns out to be very simple to implement character level n-gram TF-IDF encoding of short texts by using scikit-learn package.
This means we can easily incorporate the process step in our data process and feature engineering pipeline

Step 1 , Fit character n-gram tf idf vectorizer with training data

# using gram of length at 2 for example
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ['great people', 'feel happy', 'nice work','warm weather']
vectorizer = TfidfVectorizer(analyzer='char', ngram_range=(2,2)).fit(corpus)
print(vectorizer.get_feature_names())
[' h', ' p', ' w', 'ap', 'ar', 'at', 'ce', 'e ', 'ea', 'ee', 'el', 'eo', 'er', 'fe', 'gr', 'ha', 'he', 'ic', 'l ', 'le', 'm ', 'ni', 'op', 'or', 'pe', 'pl', 'pp', 'py', 're', 'rk', 'rm', 't ', 'th', 'wa', 'we', 'wo']

Step 2, Transform new text to tf-idf weighted vector

new_text = ['pineapple milk']
tfidf_vector = vectorizer.transform(new_text).toarray()
print(tfidf_vector)
[[0.         0.         0.         0.42176478 0.         0.
  0.         0.42176478 0.3325242  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.42176478 0.         0.         0.         0.
  0.         0.42176478 0.42176478 0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]]

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC