Explode in Pyspark

Explode in PySpark

explode and split are SQL functions. Both operate on SQL Column. split takes a Java regular expression as a second argument. If you want to separate data on arbitrary whitespace you'll need something like this:

df = sqlContext.createDataFrame(
[('cat \n\n elephant rat \n rat cat', )], ['word']
)

df.select(explode(split(col("word"), "\s+")).alias("word")).show()

## +--------+
## | word|
## +--------+
## | cat|
## |elephant|
## | rat|
## | rat|
## | cat|
## +--------+

how to use explode in pyspark?

You can explode the all_skills array and then group by and pivot and apply count aggregation. Finally, apply coalesce to poly-fill null values to 0.

from pyspark.sql import functions as F

data = [(['A', 'B'], "2020-11-01",),
(['B', 'I', 'R'], "2020-11-01",),
(['S', 'H'], "2020-11-02",),
(['A', 'H', 'S'], "2020-11-02",), ]

df = spark.createDataFrame(data, ("all_skills", "dates",))

pivoted_df = (df.withColumn("all_skills", F.explode("all_skills"))
.groupBy("all_skills")
.pivot("dates")
.agg(F.count("all_skills"))
)
final_df = pivoted_df.select([F.col("all_skills") if col_name == "all_skills" else F.coalesce(F.col(col_name), F.lit(0)).alias(col_name) for col_name in pivoted_df.columns])

final_df.show()

"""
+----------+----------+----------+
|all_skills|2020-11-01|2020-11-02|
+----------+----------+----------+
| B| 2| 0|
| A| 1| 1|
| S| 0| 2|
| R| 1| 0|
| I| 1| 0|
| H| 0| 2|
+----------+----------+----------+
"""

Dataframe explode list columns in multiple rows

# This is not part of the solution, just creation of the data sample
# df = spark.sql("select stack(1, array(1, 2, 3, 4, 5, 6) ,array('x1', 'x2', 'x3', 'x4', 'x5', 'x6') ,array('y1', 'y2', 'y3', 'y4', 'y5', 'y6') ,array('v1', 'v2', 'v3', 'v4', 'v5', 'v6')) as (Country_1, Country_2,Country_3,Country_4)")

df.selectExpr('inline(arrays_zip(*))').show()


+---------+---------+---------+---------+
|Country_1|Country_2|Country_3|Country_4|
+---------+---------+---------+---------+
| 1| x1| y1| v1|
| 2| x2| y2| v2|
| 3| x3| y3| v3|
| 4| x4| y4| v4|
| 5| x5| y5| v5|
| 6| x6| y6| v6|
+---------+---------+---------+---------+

Explode multiple columns to rows in pyspark

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list

Conditional explode in pyspark

Using array_except function from Spark version >= 2.4.

Get the elements difference from the 2 columns after splitting them and use explode_outer on that column.

from pyspark.sql.functions import col,explode_outer,array_except,split

split_col_df = df.withColumn('interest_array',split(col('interest'),',')) \
.withColumn('branch_array',split(col('branch'),','))
#Get the elements in branch not in interest
tmp_df = split_col_df.withColumn('elem_diff',array_except(col('branch_array'),col('interest_array')))
res = tmp_df.withColumn('interest_expl',explode_outer(col('interest_array'))) \
.withColumn('branch_expl',explode_outer(col('elem_diff')))

res.select('athl_id','interest_expl','branch_expl').show()

If there can be duplicates in branch column and you only want to subtract equal number of occurrences of a common value, you might have to write a UDF to solve the problem.

Looking to get counts of items within ArrayType column without using Explode

Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])

df.show()

+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+

from collections import Counter

@F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))

df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)

+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+


Related Topics



Leave a reply



Submit