How to groupby one column and join another column by comma in Pandas dataframe


In this blog, we will explore how to use the Pandas groupby method to group a DataFrame by one column and then join another column by comma.

Let’s start by creating an example DataFrame:

import pandas as pd

df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Bob'],
'Fruit': ['Apple', 'Orange', 'Apple', 'Orange', 'Banana', 'Orange']
})

Our example DataFrame has two columns: Name and Fruit. We want to group the DataFrame by the Name column and then join the values in the Fruit column by comma for each group. To accomplish this, we can use the groupby method with an anonymous lambda function.

grouped = df.groupby('Name').apply(lambda x: ','.join(x['Fruit']))

In this code, we use the groupby method to group the DataFrame by the Name column. We then use the apply method to apply a lambda function to each group of rows in the DataFrame. The lambda function takes each group of rows, selects the Fruit column using x[‘Fruit’], and joins the values in that column with a comma using the ‘,’.join() method. The result is a new Series object grouped that contains one row for each unique value in the Name column. The values in each row are the joined values of the Fruit column for the corresponding group of rows in the original DataFrame.

print(grouped)

Output:

Name
Alice Apple,Orange
Bob Orange,Banana,Orange
Charlie Apple
dtype: object

As we can see from the output, the groupby method has grouped the DataFrame by the unique values in the Name column and joined the corresponding values in the Fruit column by comma.

If you want to get two columns in the resulting DataFrame instead of a single column with joined values, you can add an additional step of calling the reset_index() method.

grouped = df.groupby('Name')['Fruit'].apply(lambda x: ','.join(x)).reset_index()

In this updated code, we have added the [‘Fruit’] parameter inside the groupby method to specify that we are only interested in grouping by the Name column and joining the Fruit column. We then use the reset_index() method to convert the resulting Series object back into a DataFrame with two columns: Name and Fruit.

print(grouped)

Output:

      Name                 Fruit
0 Alice Apple,Orange
1 Bob Orange,Banana,Orange
2 Charlie Apple

As we can see from the output, the resulting DataFrame now has two columns: Name and Fruit, with the joined values of the Fruit column grouped by the corresponding values in the Name column.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC