Explode in PySpark
explode
and split
are SQL functions. Both operate on SQL Column
. split
takes a Java regular expression as a second argument. If you want to separate data on arbitrary whitespace you'll need something like this:
df = sqlContext.createDataFrame(
[('cat \n\n elephant rat \n rat cat', )], ['word']
)
df.select(explode(split(col("word"), "\s+")).alias("word")).show()
## +--------+
## | word|
## +--------+
## | cat|
## |elephant|
## | rat|
## | rat|
## | cat|
## +--------+
how to use explode in pyspark?
You can explode
the all_skills
array and then group by and pivot
and apply count
aggregation. Finally, apply coalesce
to poly-fill null values to 0
.
from pyspark.sql import functions as F
data = [(['A', 'B'], "2020-11-01",),
(['B', 'I', 'R'], "2020-11-01",),
(['S', 'H'], "2020-11-02",),
(['A', 'H', 'S'], "2020-11-02",), ]
df = spark.createDataFrame(data, ("all_skills", "dates",))
pivoted_df = (df.withColumn("all_skills", F.explode("all_skills"))
.groupBy("all_skills")
.pivot("dates")
.agg(F.count("all_skills"))
)
final_df = pivoted_df.select([F.col("all_skills") if col_name == "all_skills" else F.coalesce(F.col(col_name), F.lit(0)).alias(col_name) for col_name in pivoted_df.columns])
final_df.show()
"""
+----------+----------+----------+
|all_skills|2020-11-01|2020-11-02|
+----------+----------+----------+
| B| 2| 0|
| A| 1| 1|
| S| 0| 2|
| R| 1| 0|
| I| 1| 0|
| H| 0| 2|
+----------+----------+----------+
"""
Dataframe explode list columns in multiple rows
# This is not part of the solution, just creation of the data sample
# df = spark.sql("select stack(1, array(1, 2, 3, 4, 5, 6) ,array('x1', 'x2', 'x3', 'x4', 'x5', 'x6') ,array('y1', 'y2', 'y3', 'y4', 'y5', 'y6') ,array('v1', 'v2', 'v3', 'v4', 'v5', 'v6')) as (Country_1, Country_2,Country_3,Country_4)")
df.selectExpr('inline(arrays_zip(*))').show()
+---------+---------+---------+---------+
|Country_1|Country_2|Country_3|Country_4|
+---------+---------+---------+---------+
| 1| x1| y1| v1|
| 2| x2| y2| v2|
| 3| x3| y3| v3|
| 4| x4| y4| v4|
| 5| x5| y5| v5|
| 6| x6| y6| v6|
+---------+---------+---------+---------+
Explode multiple columns to rows in pyspark
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
Conditional explode in pyspark
Using array_except
function from Spark version >= 2.4.
Get the elements difference from the 2 columns after split
ting them and use explode_outer
on that column.
from pyspark.sql.functions import col,explode_outer,array_except,split
split_col_df = df.withColumn('interest_array',split(col('interest'),',')) \
.withColumn('branch_array',split(col('branch'),','))
#Get the elements in branch not in interest
tmp_df = split_col_df.withColumn('elem_diff',array_except(col('branch_array'),col('interest_array')))
res = tmp_df.withColumn('interest_expl',explode_outer(col('interest_array'))) \
.withColumn('branch_expl',explode_outer(col('elem_diff')))
res.select('athl_id','interest_expl','branch_expl').show()
If there can be duplicates in branch
column and you only want to subtract equal number of occurrences of a common value, you might have to write a UDF to solve the problem.
Looking to get counts of items within ArrayType column without using Explode
Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])
df.show()
+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+
from collections import Counter
@F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))
df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)
+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+
Related Topics
Why Is Using Thread Locals in Django Bad
Why Can't I Use the Method _Cmp_ in Python 3 as for Python 2
Re.Sub Replace with Matched Content
How to Redirect Print Statements to Tkinter Text Widget
How to Crop an Image with Pygame
Flask-Sqlalchemy Import/Context Issue
Importing Flask.Ext Raises Modulenotfounderror
Calculate Time Difference Between Pandas Dataframe Indices
Stratified Train/Test-Split in Scikit-Learn
How Can Pyspark Be Called in Debug Mode
Saving the State of a Program to Allow It to Be Resumed
Read from File After Write, Before Closing
How to Initialize Weights in Pytorch
Truncate to Three Decimals in Python
Python "Extend" for a Dictionary