time series feature engineering using tsfresh, training vs test

During the test stage, i.e., once the model is on production, for any new data,
tsfresh feature generation does not depend the training data. So one can apply the same feature engineering process as the training data
without worrying about stroing information from training stage.

On ther hand, one can also use the following example to leverage scikit learn pipleline style to handel the feature generation
for both training and test stages.

Feature Selection in a sklearn pipeline

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from tsfresh.examples import load_robot_execution_failures
from tsfresh.transformers import RelevantFeatureAugmenter
from tsfresh.utilities.dataframe_functions import impute

Load and Prepare the Data

Check out the first example notebook to learn more about the data and format.

from tsfresh.examples.robot_execution_failures import download_robot_execution_failures
df_ts, y = load_robot_execution_failures()

We want to use the extracted features to predict for each of the robot executions, if it was a failure or not.
Therefore our basic “entity” is a single robot execution given by a distinct id.

A dataframe with these identifiers as index needs to be prepared for the pipeline.

X = pd.DataFrame(index=y.index)

# Split data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X, y)

Build the pipeline

We build a sklearn pipeline that consists of a feature extraction step (RelevantFeatureAugmenter) with a subsequent RandomForestClassifier.

The RelevantFeatureAugmenter takes roughly the same arguments as extract_features and select_features do.

ppl = Pipeline([
('augmenter', RelevantFeatureAugmenter(column_id='id', column_sort='time')),
('classifier', RandomForestClassifier())

Here comes the tricky part!

The input to the pipeline will be our dataframe X, which one row per identifier.
It is currently empty.
But which time series data should the RelevantFeatureAugmenter to actually extract the features from?

We need to pass the time series data (stored in df_ts) to the transformer.

In this case, df_ts contains the time series of both train and test set, if you have different dataframes for
train and test set, you have to call set_params two times
(see further below on how to deal with two independent data sets)


We are now ready to fit the pipeline

ppl.fit(X_train, y_train)

The augmenter has used the input time series data to extract time series features for each of the identifiers in the X_train and selected only the relevant ones using the passed y_train as target.
These features have been added to X_train as new columns.
The classifier can now use these features during trainings.


During interference, the augmentor does only extract the relevant features it has found out in the training phase and the classifier predicts the target using these features.

y_pred = ppl.predict(X_test)

So, finally we inspect the performance:

print(classification_report(y_test, y_pred))

You can also find out, which columns the augmenter has selected


In this example we passed in an empty (except the index) X_train or X_test into the pipeline.
However, you can also fill the input with other features you have (e.g. features extracted from the metadata)
or even use other pipeline components before.

Separating the time series data containers

In the example above we passed in a single df_ts into the RelevantFeatureAugmenter, which was used both for training and predicting.
During training, only the data with the ids from X_train where extracted and during prediction the rest.

However, it is perfectly fine to call set_params twice: once before training and once before prediction.
This can be handy if you for example dump the trained pipeline to disk and re-use it only later for prediction.
You only need to make sure that the ids of the enteties you use during training/prediction are actually present in the passed time series data.

df_ts_train = df_ts[df_ts["id"].isin(y_train.index)]
df_ts_test = df_ts[df_ts["id"].isin(y_test.index)]
ppl.fit(X_train, y_train);
import pickle
with open("pipeline.pkl", "wb") as f:
pickle.dump(ppl, f)

Later: load the fitted model and do predictions on new, unseen data

import pickle
with open("pipeline.pkl", "rb") as f:
ppk = pickle.load(f)
y_pred = ppl.predict(X_test)
print(classification_report(y_test, y_pred))

Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !