Sampling Rows from a Pandas DataFrame by Group


Pandas is a popular data analysis library in Python that provides powerful tools for manipulating and analyzing data. One common task in data analysis is to sample rows from a dataframe based on some grouping criteria. In this blog, we’ll explore how to use Pandas to sample rows from a dataframe by group.

Suppose we have a dataframe with a column called ‘vertical’ and we want to sample up to 100 random rows for each unique value in the ‘vertical’ column. Here’s how we can achieve this:

import pandas as pd
import numpy as np

# create a sample dataframe
data = {'vertical': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C', 'C'],
'value': np.random.randint(1, 101, size=10)}
df = pd.DataFrame(data)

# group by the 'vertical' column and get up to 100 random rows for each group
sampled_df = df.groupby('vertical').apply(lambda x: x.sample(min(len(x), 100)))

# print the sampled dataframe
print(sampled_df.reset_index(drop=True)

In this example, we first create a sample dataframe with a ‘vertical’ column and a ‘value’ column. We then group the dataframe by the ‘vertical’ column using the groupby() function. We apply a lambda function to each group that samples up to 100 random rows using the sample() function. Finally, we combine the sampled groups back into a single dataframe using apply() and groupby().

Note that the min(len(x), 100) argument passed to sample() ensures that we don’t sample more rows than are available in a given group. This is useful in cases where a group may have fewer than 100 rows.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC