rare event encoding for categorical feature in machine learning in pandas dataframe

If categorical features has too many values, it will generate too many features after encoding, such as one-hot encoding.
We could set the threshold, if certan value has percentage less than the threshold, we change the value to be ‘rare event’ or
something like that. By doing this, we make sure there are not too many levels for a categorical feature.

The following code can be applied on a dataframe:

def cat_rare_event(df,threshold=0.005):

for col in df.columns:


if df[col].dtype =='object':


df.loc[df[col].value_counts()[df[col]].values < int(len(df)*threshold), col] = "rare_value"

return df

or we could put this step as a customized pipeline:

from sklearn.base import BaseEstimator, TransformerMixin

class cat_rare_event_Transformer(BaseEstimator, TransformerMixin):

# Class Constructor

def __init__(self,threshold=0.005):

self.threshold = threshold


# Return self, nothing else to do here

def fit(self, X, y=None):

return self

def transform(self, X_, y=None):

X = X_.copy()

X = cat_rare_event(X,self.threshold)

return X

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !