Spliting a Row to Multiple Row Pyspark

Split column to multiple rows based on value pyspark

You can use rlike to check whether ' and ' is present in col3 in order to add the IsSplit flag:

import pyspark.sql.functions as F

df2 = df.withColumn(
'IsSplit',
F.when(F.col('col3').rlike(' and '), 'Y').otherwise('N')
).withColumn(
'col3',
F.explode(F.split('col3', ' and '))
)

df2.show()
+----+----+----+----+-------+
|col1|col2|col3|col4|IsSplit|
+----+----+----+----+-------+
| a| b|jack| d| Y|
| a| b|jill| d| Y|
| 1| 2| 3| 4| N|
| z| x| c| v| N|
| t| y| mom| p| Y|
| t| y| dad| p| Y|
+----+----+----+----+-------+

Split row into multiple rows to limit length of array in column (spark / scala)

Using higher-order functions transform + filter along with slice, you can split the array into sub arrays of size 20 then explode it:

val l = 20

val df1 = df.withColumn(
"items",
explode(
expr(
s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i+1,$l), null)), x-> x is not null)"
)
)
)

Pyspark DataFrame: Split column with multiple values into rows

You can use explode but first you'll have to convert the string representation of the array into an array.

One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".

from pyspark.sql.functions import col, explode, regexp_replace, split

df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+

Pyspark : How to split pipe-separated column into multiple rows?

Use split function will return an array then explode function on array.

Example:

df.show(10,False)
#+-------+---------+-----------------------+
#|movieid|moviename|genre |
#+-------+---------+-----------------------+
#|1 |example1 |action|thriller|romance|
#+-------+---------+-----------------------+

from pyspark.sql.functions import *

df.withColumnRenamed("genre","genre1").\
withColumn("genre",explode(split(col("genre1"),'\\|'))).\
drop("genre1").\
show()
#+-------+---------+--------+
#|movieid|moviename| genre|
#+-------+---------+--------+
#| 1| example1| action|
#| 1| example1|thriller|
#| 1| example1| romance|
#+-------+---------+--------+

Spark DF: Split array to multiple rows

Here is one proposed solution. You can organize your sal field into arrays using $concatArrays in MongoDB before exporting it to Spark. Then, run something like this

#df
#+---+-----+------------------+
#| id|empno| sal|
#+---+-----+------------------+
#| 1| 101|[1000, 2000, 1500]|
#| 2| 102| [1000, 1500]|
#| 3| 103| [2000, 3000]|
#+---+-----+------------------+

import pyspark.sql.functions as F

df_new = df.select('id','empno',F.explode('sal').alias('sal'))

#df_new.show()
#+---+-----+----+
#| id|empno| sal|
#+---+-----+----+
#| 1| 101|1000|
#| 1| 101|2000|
#| 1| 101|1500|
#| 2| 102|1000|
#| 2| 102|1500|
#| 3| 103|2000|
#| 3| 103|3000|
#+---+-----+----+

pyspark RDD expand a row to multiple rows

Use flatMap:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])

Example:

rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]

Explode multiple columns to rows in pyspark

I did this by passing columns as list to a for loop and exploded the dataframe for every element in list



Related Topics



Leave a reply



Submit