Difference between fit, transform, fit_transform, predict, and predict_proba in a sklearn pipeline

data science

Publish Date: 2023-02-02

What is pipeline in scikit-learn

Pipeline in scikit-learn is a utility class that helps to assemble several steps of an ML workflow into a single scikit-learn estimator. A pipeline consists of a sequence of transformations or pre-processing steps, followed by an estimator that makes predictions based on the transformed data. The pipeline helps to simplify the ML process by automating the steps involved in transforming the data and training the model. The pipeline also ensures that the data is processed consistently throughout the entire workflow and helps to prevent data leakage between the different stages of the pipeline. The pipeline class is a convenient tool for encapsulating the entire ML process, making it easier to manage and share the code, and reduces the risk of errors in the implementation.

Pipeline with or without estimator

Pipeline doesn’t necessarily need to have a machine learning model ast the estimator in the final step for various reasons.
For example, we just want to create a data pipeline for preprocessing data to divide the tasks between preprocessing and modelinng.

In both cases, the operators we are going talk below work the same way.

what is a transformer in sklearn

A transformer is an estimator that implements the fit(), transform() and/or fit_transform() methods. TransformerMixin is the default implementation and a Mixin class that provides a consistent interface across transformers of sklearn.

In fit() function, we get the input data and perform the required computations to the specific transform function we will then apply. For example,it can calculate the average and standard deviation of the input data, and get them ready for later use.

In transform(), we will transform the input data into some new formats. The output is usually an array or a sparse matrix with equal number of samples (n_samples) as the input data. The parameters obtaind from fit() function will be used in this step.
For eample, if we want to transform the input data to be a normalized version, we will subtract every data points with mean and divide by the standard deviation obtained from fit() step.

fit_transform() is just a more efficient way to call fit() and transform() together. It’s implemented by default.

Difference between fit() , transform(), fit_transform(), predict(), and predict_proba() in a pipeline

In a pipeline, we have multiple transformers, and each transformer has it’s own fit() and transform() methods,
so there are usually confusions about the exact differences among several similar functions with pipeline, and when to use them.
Here are first discuss the differences, then show some examples to demonstrate that.

fit()

Fit all the transformers one after the other and transform the data. Finally, fit the transformed data using the final estimator. Notice that, it will not call the the transform() method of the last transformer. This make sense, because in a typical pipeline, the last step is just a model estimator, and transform() is probably not the correct concept. Instead, a predict() operation will make more sense.

The return value of fit() is the pipeline object itself with all steps fitted.

transform()

transform() of pipeline will call only transform method of all the transformers one by one including the last one.

fit_transform()

fit_transform() of pipeline will call both fit and transform method of all the transformers on by one including the last one.

predict()

predict() of pipeline only works wthen the last step of the pipeline has predict() method defined, which is usually true if the last step is a model estimator. predict() will call transform of each transformer in the pipeline before the last step. Then, the transformed data are finally passed to the final estimator that calls predict method.

predict_proba()

For example, sometimes we want to get a predicted probability instead of class in a classification, so we call predict_proba()
instead of predict(). It will only work if the estimator of the last step has predict_proba() defined. Otherwise, the precedure is the same as predict().

Demo 1

first define two custom transformers, and put them in a pipeline

import numpy as np
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.pipeline import Pipeline

# Custom transformer 1: Log Transformer
class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, log_base=np.e):
        self.flag = 'N'
        self.log_base = log_base

    def fit(self, X, y=None):
        self.flag = 'Y'
        print('hello from transformer 1 fit')
        return self

    def transform(self, X, y=None):
        print('hello from transformer 1 transform')
        print(f'check fit value from transformer 1:{self.flag}')
        return np.log(X) / np.log(self.log_base)

# Custom transformer 2: Square Root Transformer
class SqrtTransformer(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        self.flag = 'N'
        
    
    def fit(self, X, y=None):
        self.flag = 'Y'
        print('hello from transformer 2 fit')
        return self

    def transform(self, X, y=None):
        print('hello from transformer 2 transform')
        print(f'check fit value from transformer 2:{self.flag}')
        return np.sqrt(X)

# Creating the pipeline
pipe = Pipeline([
    ('log', LogTransformer()),
    ('sqrt', SqrtTransformer()),
])

with fit() method, the transform() method of the last transformer is not called

import numpy as np

X = np.array([1,2,3,4,5])

pipe_new = pipe.fit(X)

print(pipe_new)

hello from transformer 1 fit
hello from transformer 1 transform
check fit value from transformer 1:Y
hello from transformer 2 fit
Pipeline(steps=[('log', LogTransformer()), ('sqrt', SqrtTransformer())])

with transform() method, all transfom() methods are called; and parameters from previous fit() are cached

X = np.array([1,2,3,4,5])

X_transformed = pipe.transform(X)

print(X_transformed)

hello from transformer 1 transform
check fit value from transformer 1:Y
hello from transformer 2 transform
check fit value from transformer 2:Y
[0.         0.83255461 1.04814707 1.17741002 1.26863624]

if no previous fit() applied, with transform() method, all transform() methods are still called, but parameters from previous fit() are not there

# redefine a new pipe 

pipe = Pipeline([
    ('log', LogTransformer()),
    ('sqrt', SqrtTransformer()),
])

X = np.array([1,2,3,4,5])

X_transformed = pipe.transform(X)

print(X_transformed)

hello from transformer 1 transform
check fit value from transformer 1:N
hello from transformer 2 transform
check fit value from transformer 2:N
[0.         0.83255461 1.04814707 1.17741002 1.26863624]

with fit_transform(), all fit() and transform() methods are called, and return value is transformed data, not the pipeline

X = np.array([1,2,3,4,5])

X_transformed = pipe.fit_transform(X)

print(X_transformed)

hello from transformer 1 fit
hello from transformer 1 transform
check fit value from transformer 1:Y
hello from transformer 2 fit
hello from transformer 2 transform
check fit value from transformer 2:Y
[0.         0.83255461 1.04814707 1.17741002 1.26863624]

robot learner

https://datasciencebyexample.github.io/2023/02/02/operators-in-sklearn-pipeline/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

pipeline scikit-learn

Logging metrics and tags of multiple models into one mlflow in Databricks

2023-02-07 data engineering

databricks

Spark SQL - Collecting Columns into Lists after Groupby

2023-02-01 data engineering

spark pyspark