Sampling Rows from a Pandas DataFrame by Group

data engineering

Publish Date: 2023-03-28

Pandas is a popular data analysis library in Python that provides powerful tools for manipulating and analyzing data. One common task in data analysis is to sample rows from a dataframe based on some grouping criteria. In this blog, we’ll explore how to use Pandas to sample rows from a dataframe by group.

Suppose we have a dataframe with a column called ‘vertical’ and we want to sample up to 100 random rows for each unique value in the ‘vertical’ column. Here’s how we can achieve this:

import pandas as pd
import numpy as np

# create a sample dataframe
data = {'vertical': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'C'], 
        'value': np.random.randint(1, 101, size=10)}
df = pd.DataFrame(data)

# group by the 'vertical' column and get up to 100 random rows for each group
sampled_df = df.groupby('vertical').apply(lambda x: x.sample(min(len(x), 100)))

# print the sampled dataframe
print(sampled_df.reset_index(drop=True)

In this example, we first create a sample dataframe with a ‘vertical’ column and a ‘value’ column. We then group the dataframe by the ‘vertical’ column using the groupby() function. We apply a lambda function to each group that samples up to 100 random rows using the sample() function. Finally, we combine the sampled groups back into a single dataframe using apply() and groupby().

Note that the min(len(x), 100) argument passed to sample() ensures that we don’t sample more rows than are available in a given group. This is useful in cases where a group may have fewer than 100 rows.

robot learner

https://datasciencebyexample.github.io/2023/03/28/sample-rows-by-sample-in-dataframe/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

pandas dataframe

How to Include Curly Braces in Python Format Strings

2023-03-28 data engineering

format strings

Best Practices for Creating a .gitignore File and how

2023-03-25 data engineering

git gitignore github