Transforming One or More Columns of a Pandas DataFrame using ColumnTransformer


When working with tabular data, it’s common to have to transform one or more columns to make them more amenable to analysis or modeling. In many cases, these transformations can be easily accomplished using the pandas library. However, when working with large datasets or building machine learning pipelines, it can be more efficient to use scikit-learn’s ColumnTransformer class to apply transformations to specific columns of the data.

In this blog post, we’ll demonstrate how to use a custom transformer with scikit-learn’s ColumnTransformer to transform one or more columns of a Pandas DataFrame.

Example 1: Transforming NumPy arrays

Let’s start with a simple example where we have a NumPy array with three columns, and we want to transform the first two columns into two new columns.

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass

def transform(self, X):
# Here, X is a 2D numpy array or pandas DataFrame
# Transform columns 0 and 1 into multiple columns
transformed_cols = np.column_stack([X[:, 0]**2, np.sqrt(X[:, 1])])
# Return the transformed columns as a 2D numpy array
return transformed_cols

def fit(self, X, y=None):
return self

# Example usage
X = np.array([[1, 4, 7], [2, 9, 8], [3, 16, 9]])
transformer = ColumnTransformer(
transformers=[('custom', CustomTransformer(), [0, 1])],
remainder='passthrough')
# The 'remainder' parameter preserves any columns not transformed
transformed_X = transformer.fit_transform(X)
print(transformed_X)

In this example, the CustomTransformer class takes two input columns and transforms them into two output columns. The ColumnTransformer applies this transformer to columns 0 and 1 of the input data, and preserves column 2. The “passthrough” option has been used to preserve the remaining column in its original form.

Example 2: Transforming Pandas DataFrames

Now, let’s modify the previous example to work with a Pandas DataFrame instead of a NumPy array.

import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

class CustomTransformer(BaseEstimator, TransformerMixin):
def __init__(self):
pass

def transform(self, X):
# Here, X is a pandas DataFrame
# Transform columns 'A' and 'B' into multiple columns
transformed_cols = pd.DataFrame({'A_squared': X['A']**2,
'B_sqrt': X['B']**0.5})
# Return the transformed columns as a pandas DataFrame
return transformed_cols

def fit(self, X, y=None):
return self

# Example usage
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 9, 16], 'C': [7, 8, 9]})
transformer = ColumnTransformer(
transformers=[('custom', CustomTransformer(), ['A', 'B'])],
remainder='passthrough')
# The 'remainder' parameter preserves any columns not transformed
transformed_df = transformer.fit_transform(df)
print(transformed_df)

In this example, the CustomTransformer class takes two input columns (‘A’ and ‘B’) and transforms them into two output columns (‘A_squared’ and ‘B_sqrt’) in a pandas DataFrame. The ColumnTransformer applies this transformer to columns ‘A’ and ‘B’ of the input data, and preserves column ‘C’. The “passthrough” option has been used to preserve the remaining column ‘C’ in its original form.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC