rare event encoding for categorical feature in machine learning in pandas dataframe


If categorical features has too many values, it will generate too many features after encoding, such as one-hot encoding.
We could set the threshold, if certan value has percentage less than the threshold, we change the value to be ‘rare event’ or
something like that. By doing this, we make sure there are not too many levels for a categorical feature.

The following code can be applied on a dataframe:

def cat_rare_event(df,threshold=0.005):



for col in df.columns:

#print(df[col].dtype)

if df[col].dtype =='object':

print(col)

df.loc[df[col].value_counts()[df[col]].values < int(len(df)*threshold), col] = "rare_value"



return df



or we could put this step as a customized pipeline:

from sklearn.base import BaseEstimator, TransformerMixin



class cat_rare_event_Transformer(BaseEstimator, TransformerMixin):



# Class Constructor



def __init__(self,threshold=0.005):

self.threshold = threshold

print('initialized')






# Return self, nothing else to do here



def fit(self, X, y=None):



return self





def transform(self, X_, y=None):



X = X_.copy()



X = cat_rare_event(X,self.threshold)



return X



Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC