LightGBM regression example with cross validation and early stop run


In this blog post, we will walk through a complete example of using LightGBM, a gradient boosting framework, for regression tasks. We will generate a random dataset, split it into training and testing sets, train a LightGBM regression model, and evaluate its performance using mean squared error (MSE) and a scatter plot of predicted vs expected values.

What is LightGBM?

LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be efficient and scalable, making it suitable for large datasets and high-performance tasks. LightGBM is particularly popular for its speed and accuracy, outperforming many other machine learning algorithms in various benchmarks.

Generating a Random Dataset

For this example, we will generate a random dataset using the make_regression function from scikit-learn. This function creates a dataset with a specified number of samples, features, and noise level. We will generate a dataset with 1000 samples, 10 features, and a noise level of 0.1.

from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)

Next, we will convert the generated data to a pandas DataFrame for easier manipulation.

import pandas as pd

df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
df["target"] = y

Preparing the Data

Before training the model, we need to split the data into training and testing sets. We will use 80% of the data for training and 20% for testing.

from sklearn.model_selection import train_test_split

X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Training the LightGBM Regression Model

Now that we have prepared the data, we can train the LightGBM regression model. First, we need to create a LightGBM dataset from our training data.

import lightgbm as lgb

train_data = lgb.Dataset(X_train, label=y_train)

Next, we will set up the parameters for the LightGBM model. In this example, we will use the default parameters for a regression task.

params = {
"objective": "regression",
"metric": "mse",
"boosting_type": "gbdt",
"num_leaves": 31,
"learning_rate": 0.05,
"feature_fraction": 0.9,
}

We will train the model using cross-validation with early stopping to prevent overfitting. The lgb.cv function performs cross-validation and returns the results for each round. We will use the best number of rounds to train the final model.

num_round = 1000
cv_results = lgb.cv(
params,
train_data,
num_boost_round=num_round,
nfold=5,
early_stopping_rounds=10,
stratified=False,
)

best_round = len(cv_results["l2-mean"])
model = lgb.train(params, train_data, num_boost_round=best_round)

Evaluating the Model

Now that we have trained the model, we can evaluate its performance on the test set. We will use the mean squared error (MSE) as our evaluation metric.

from sklearn.metrics import mean_squared_error

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Finally, we will plot a scatter plot of the predicted vs expected values to visualize the model’s performance.

import matplotlib.pyplot as plt

plt.scatter(y_test, y_pred)
plt.xlabel("Expected")
plt.ylabel("Predicted")
plt.title("Predicted vs Expected")
plt.show()

Conclusion

In this blog post, we have demonstrated a complete example of using LightGBM for regression tasks with a randomly generated dataset. We have shown how to prepare the data, train the model, and evaluate its performance using mean squared error and a scatter plot. LightGBM is a powerful and efficient gradient boosting framework that can be used for various machine learning tasks, including regression, classification, and ranking.

github link


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC