A transformer example to maintain same feature order and add missing features back for feature engineering

data science

Publish Date: 2022-01-27

For data science projects, one important steps in feature engineering is to make sure the order of feature columns during training
and prediction/test time is the same. Otherwise, we will not get the results as we expect.

This is usually not a problem in train/test split or cross validation stages, where training and test data are generally split form the
same dataframe. However, once model is put online, and the transformer need to handel each single event, which usually comes in the
format of json data, then transformed to dataframe. During this process, the orignal order may not hold.

To ensure the same feature order is used, we could build a transformer for the pipeline; During the fit stage, the orignal order will be
remembered, and during the transform stage, the same order will be enforced; Meanwhile, if there is any missing column, we will add a null value
column.

set up some example dataframe

import pandas as pd

# training example, where we have 3 features
df_train = pd.DataFrame(data=[['a','b','f'],['c','d','e']])
df_train.columns =  ['cat1','cat2','cat3']
display(df_train)

# test example where we missing the 3rd feature
df_test = pd.DataFrame(data=[['h','j']])
df_test.columns =  ['cat2','cat1']
display(df_test)

	cat1	cat2	cat3
0	a	b	f
1	c	d	e

	cat2	cat1
0	h	j

a transformer which can be added to a full pipeline

from sklearn.base import BaseEstimator, TransformerMixin
 

class orderMaitain_Transformer(BaseEstimator, TransformerMixin):

     # Class Constructor

    def __init__(self):

        self.dtype_dict = {} 

        print('initialized')
 

    # Return self
    def fit(self, X, y=None):
        
        self.dtype_dict = X.dtypes.apply(lambda x: x.name).to_dict()
        
        return self
 

    def transform(self, X_, y=None):

        X = X_.copy()
        #print(self.dtype_dict)
        train_columns = []
        
        # add missing column if any
        for col in self.dtype_dict:
            train_columns.append(col)
            if col not in X.columns:
                # null boolean are treated as False; can also use other strategy as well
                if self.dtype_dict[col].startswith('bool'):
                    X[col]=False
                else:
                    X[col] = pd.Series(dtype=self.dtype_dict[col])
        
        # apply same order to both training and test
        print(train_columns)
        X = X[train_columns]  
        return X  
    
orderMaitain_transformer = orderMaitain_Transformer()

initialized

apply transfomer during training and test stages

during training stage

orderMaitain_transformer.fit_transform(df_train)

['cat1', 'cat2', 'cat3']

	cat1	cat2	cat3
0	a	b	f
1	c	d	e

during test and prediction stage

# check that the resuls have an emppty column added, the order is the same as training
orderMaitain_transformer.transform(df_test)

['cat1', 'cat2', 'cat3']

	cat1	cat2	cat3
0	j	h	NaN

robot learner

https://datasciencebyexample.github.io/2022/01/27/2022-01-27-1/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

pipeline transformer scikit-learn

Convert arbitrary date to the date of Monday or Sundy within the same week in Python

2022-02-22 data science

python

Convert short texts to numeric vectors with Character ngram tf-idf vectorizer using scikit learn in Python

2022-01-15 machine learning

text mining natural language processing