A transformer example to maintain same feature order and add missing features back for feature engineering


For data science projects, one important steps in feature engineering is to make sure the order of feature columns during training
and prediction/test time is the same. Otherwise, we will not get the results as we expect.

This is usually not a problem in train/test split or cross validation stages, where training and test data are generally split form the
same dataframe. However, once model is put online, and the transformer need to handel each single event, which usually comes in the
format of json data, then transformed to dataframe. During this process, the orignal order may not hold.

To ensure the same feature order is used, we could build a transformer for the pipeline; During the fit stage, the orignal order will be
remembered, and during the transform stage, the same order will be enforced; Meanwhile, if there is any missing column, we will add a null value
column.

set up some example dataframe

import pandas as pd

# training example, where we have 3 features
df_train = pd.DataFrame(data=[['a','b','f'],['c','d','e']])
df_train.columns = ['cat1','cat2','cat3']
display(df_train)

# test example where we missing the 3rd feature
df_test = pd.DataFrame(data=[['h','j']])
df_test.columns = ['cat2','cat1']
display(df_test)

cat1 cat2 cat3
0 a b f
1 c d e

cat2 cat1
0 h j

a transformer which can be added to a full pipeline

from sklearn.base import BaseEstimator, TransformerMixin


class orderMaitain_Transformer(BaseEstimator, TransformerMixin):

# Class Constructor

def __init__(self):

self.dtype_dict = {}

print('initialized')


# Return self
def fit(self, X, y=None):

self.dtype_dict = X.dtypes.apply(lambda x: x.name).to_dict()

return self


def transform(self, X_, y=None):

X = X_.copy()
#print(self.dtype_dict)
train_columns = []

# add missing column if any
for col in self.dtype_dict:
train_columns.append(col)
if col not in X.columns:
# null boolean are treated as False; can also use other strategy as well
if self.dtype_dict[col].startswith('bool'):
X[col]=False
else:
X[col] = pd.Series(dtype=self.dtype_dict[col])

# apply same order to both training and test
print(train_columns)
X = X[train_columns]
return X

orderMaitain_transformer = orderMaitain_Transformer()

initialized

apply transfomer during training and test stages

during training stage

orderMaitain_transformer.fit_transform(df_train)
['cat1', 'cat2', 'cat3']

cat1 cat2 cat3
0 a b f
1 c d e

during test and prediction stage

# check that the resuls have an emppty column added, the order is the same as training
orderMaitain_transformer.transform(df_test)
['cat1', 'cat2', 'cat3']

cat1 cat2 cat3
0 j h NaN

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC