Spark SQL - Collecting Columns into Lists after Groupby


In Spark SQL, you may want to collect the values of one or more columns into lists after grouping the data by one or more columns. This can be accomplished using the collect_list aggregate function in Spark SQL.

Collecting a Single Column into a List

The following code shows an example of how to collect the values of a single column column3 into a list named list_column3 after grouping the data by columns column1 and column2:

from pyspark.sql.functions import collect_list

df.groupBy("column1", "column2").agg(
collect_list("column3").alias("list_column3")
).show()

The result of this code will be a dataframe with three columns: column1, column2, and list_column3. The values of column column3 will be collected into a list named list_column3 for each unique combination of values in columns column1 and column2.

Collecting Multiple Columns into Lists

You can collect the values of multiple columns into multiple lists after grouping the data by one or more columns, as shown in the following code:

from pyspark.sql.functions import collect_list

df.groupBy("column1", "column2").agg(
collect_list("columnA").alias("listA"),
collect_list("columnB").alias("listB")
).show()

The result of this code will be a dataframe with four columns: column1, column2, listA, and listB. The values of columns columnA and columnB will be collected into lists named listA and listB respectively, for each unique combination of values in columns column1 and column2.

Collecting Columns into Lists using Spark SQL

You can also collect the values of one or more columns into lists after grouping the data by one or more columns using Spark’s spark.sql method. The following code shows an example of how to do this:

df.createOrReplaceTempView("df_view")

spark.sql("SELECT column1, column2, collect_list(column3) AS list_column3 FROM df_view GROUP BY column1, column2").show()

This code first creates a temporary view df_view from the df dataframe, and then executes a Spark SQL query that groups the data by columns column1 and column2, and collects the values of column column3 into a list named list_column3. The result of the query will be displayed using the show method.

Order of Elements in the Lists

It’s important to note that the order of elements in the lists collected by the collect_list function will be the same, as they are collected based on the same grouping conditions. For example, if you collect two columns columnA and columnB into two lists listA and listB respectively, the order of elements in listA and listB will correspond to each other for each group defined by the groupBy clause.


Author: robot learner
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source robot learner !
  TOC