Explode in Pyspark

Explode in PySpark

explode and split are SQL functions. Both operate on SQL Column. split takes a Java regular expression as a second argument. If you want to separate data on arbitrary whitespace you'll need something like this:

df = sqlContext.createDataFrame(
    [('cat \n\n elephant rat \n rat cat', )], ['word']
)

df.select(explode(split(col("word"), "\s+")).alias("word")).show()

## +--------+
## |    word|
## +--------+
## |     cat|
## |elephant|
## |     rat|
## |     rat|
## |     cat|
## +--------+

how to use explode in pyspark?

You can explode the all_skills array and then group by and pivot and apply count aggregation. Finally, apply coalesce to poly-fill null values to 0.

from pyspark.sql import functions as F

data = [(['A', 'B'], "2020-11-01",),
 (['B', 'I', 'R'], "2020-11-01",),
 (['S', 'H'], "2020-11-02",),
 (['A', 'H', 'S'], "2020-11-02",), ]

df = spark.createDataFrame(data, ("all_skills", "dates",))

pivoted_df = (df.withColumn("all_skills", F.explode("all_skills"))
   .groupBy("all_skills")
   .pivot("dates")
   .agg(F.count("all_skills"))
)
final_df = pivoted_df.select([F.col("all_skills") if col_name == "all_skills" else F.coalesce(F.col(col_name), F.lit(0)).alias(col_name)  for col_name in pivoted_df.columns])

final_df.show()

"""
+----------+----------+----------+
|all_skills|2020-11-01|2020-11-02|
+----------+----------+----------+
|         B|         2|         0|
|         A|         1|         1|
|         S|         0|         2|
|         R|         1|         0|
|         I|         1|         0|
|         H|         0|         2|
+----------+----------+----------+
"""

Dataframe explode list columns in multiple rows

# This is not part of the solution, just creation of the data sample
# df = spark.sql("select stack(1, array(1, 2, 3, 4, 5, 6) ,array('x1', 'x2', 'x3', 'x4', 'x5', 'x6') ,array('y1', 'y2', 'y3', 'y4', 'y5', 'y6') ,array('v1', 'v2', 'v3', 'v4', 'v5', 'v6')) as (Country_1, Country_2,Country_3,Country_4)")

df.selectExpr('inline(arrays_zip(*))').show()

+---------+---------+---------+---------+
|Country_1|Country_2|Country_3|Country_4|
+---------+---------+---------+---------+
|        1|       x1|       y1|       v1|
|        2|       x2|       y2|       v2|
|        3|       x3|       y3|       v3|
|        4|       x4|       y4|       v4|
|        5|       x5|       y5|       v5|
|        6|       x6|       y6|       v6|
+---------+---------+---------+---------+

Explode multiple columns to rows in pyspark

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Conditional explode in pyspark

Using array_except function from Spark version >= 2.4.

Get the elements difference from the 2 columns after splitting them and use explode_outer on that column.

from pyspark.sql.functions import col,explode_outer,array_except,split

split_col_df = df.withColumn('interest_array',split(col('interest'),',')) \
                 .withColumn('branch_array',split(col('branch'),','))
#Get the elements in branch not in interest
tmp_df = split_col_df.withColumn('elem_diff',array_except(col('branch_array'),col('interest_array')))
res = tmp_df.withColumn('interest_expl',explode_outer(col('interest_array'))) \
            .withColumn('branch_expl',explode_outer(col('elem_diff')))

res.select('athl_id','interest_expl','branch_expl').show()

If there can be duplicates in branch column and you only want to subtract equal number of occurrences of a common value, you might have to write a UDF to solve the problem.

Looking to get counts of items within ArrayType column without using Explode

Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])

df.show()

+---------------+
|         arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
|      [2, 2, 2]|
|         [3, 3]|
+---------------+

from collections import Counter

@F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
    return dict(Counter(array))

df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)

+---------------+------------------------+
|arrays         |counts                  |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2]      |[2 -> 3]                |
|[3, 3]         |[3 -> 2]                |
+---------------+------------------------+

Explode in Pyspark