How to groupby one column and join another column by comma in Pandas dataframe

data engineering

Publish Date: 2023-04-16

In this blog, we will explore how to use the Pandas groupby method to group a DataFrame by one column and then join another column by comma.

Let’s start by creating an example DataFrame:

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob', 'Bob'],
    'Fruit': ['Apple', 'Orange', 'Apple', 'Orange', 'Banana', 'Orange']
})

Our example DataFrame has two columns: Name and Fruit. We want to group the DataFrame by the Name column and then join the values in the Fruit column by comma for each group. To accomplish this, we can use the groupby method with an anonymous lambda function.

grouped = df.groupby('Name').apply(lambda x: ','.join(x['Fruit']))

In this code, we use the groupby method to group the DataFrame by the Name column. We then use the apply method to apply a lambda function to each group of rows in the DataFrame. The lambda function takes each group of rows, selects the Fruit column using x[‘Fruit’], and joins the values in that column with a comma using the ‘,’.join() method. The result is a new Series object grouped that contains one row for each unique value in the Name column. The values in each row are the joined values of the Fruit column for the corresponding group of rows in the original DataFrame.

print(grouped)

Output:

Name
Alice           Apple,Orange
Bob       Orange,Banana,Orange
Charlie                Apple
dtype: object

As we can see from the output, the groupby method has grouped the DataFrame by the unique values in the Name column and joined the corresponding values in the Fruit column by comma.

If you want to get two columns in the resulting DataFrame instead of a single column with joined values, you can add an additional step of calling the reset_index() method.

grouped = df.groupby('Name')['Fruit'].apply(lambda x: ','.join(x)).reset_index()

In this updated code, we have added the [‘Fruit’] parameter inside the groupby method to specify that we are only interested in grouping by the Name column and joining the Fruit column. We then use the reset_index() method to convert the resulting Series object back into a DataFrame with two columns: Name and Fruit.

print(grouped)

Output:

      Name                 Fruit
0    Alice           Apple,Orange
1      Bob  Orange,Banana,Orange
2  Charlie                 Apple

As we can see from the output, the resulting DataFrame now has two columns: Name and Fruit, with the joined values of the Fruit column grouped by the corresponding values in the Name column.

robot learner

https://datasciencebyexample.github.io/2023/04/16/dataframe-groupby-and-join-rows-by-comma/

All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !

pandas dataframe

How to install virtual environment with specific python versions

2023-04-17 data engineering

venv

Creativity, the Key to Thriving in the Age of AutoGPT

2023-04-15 data science

chatgpt gpt autogpt