Split column to multiple rows based on value pyspark
You can use rlike
to check whether ' and '
is present in col3
in order to add the IsSplit
flag:
import pyspark.sql.functions as F
df2 = df.withColumn(
'IsSplit',
F.when(F.col('col3').rlike(' and '), 'Y').otherwise('N')
).withColumn(
'col3',
F.explode(F.split('col3', ' and '))
)
df2.show()
+----+----+----+----+-------+
|col1|col2|col3|col4|IsSplit|
+----+----+----+----+-------+
| a| b|jack| d| Y|
| a| b|jill| d| Y|
| 1| 2| 3| 4| N|
| z| x| c| v| N|
| t| y| mom| p| Y|
| t| y| dad| p| Y|
+----+----+----+----+-------+
Split row into multiple rows to limit length of array in column (spark / scala)
Using higher-order functions transform
+ filter
along with slice
, you can split the array into sub arrays of size 20 then explode it:
val l = 20
val df1 = df.withColumn(
"items",
explode(
expr(
s"filter(transform(items, (x,i)-> IF(i%$l=0, slice(items,i+1,$l), null)), x-> x is not null)"
)
)
)
Pyspark DataFrame: Split column with multiple values into rows
You can use explode
but first you'll have to convert the string representation of the array into an array.
One way is to use regexp_replace
to remove the leading and trailing square brackets, followed by split
on ", "
.
from pyspark.sql.functions import col, explode, regexp_replace, split
df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+
Pyspark : How to split pipe-separated column into multiple rows?
Use split
function will return an array
then explode
function on array.
Example:
df.show(10,False)
#+-------+---------+-----------------------+
#|movieid|moviename|genre |
#+-------+---------+-----------------------+
#|1 |example1 |action|thriller|romance|
#+-------+---------+-----------------------+
from pyspark.sql.functions import *
df.withColumnRenamed("genre","genre1").\
withColumn("genre",explode(split(col("genre1"),'\\|'))).\
drop("genre1").\
show()
#+-------+---------+--------+
#|movieid|moviename| genre|
#+-------+---------+--------+
#| 1| example1| action|
#| 1| example1|thriller|
#| 1| example1| romance|
#+-------+---------+--------+
Spark DF: Split array to multiple rows
Here is one proposed solution. You can organize your sal field into arrays using $concatArrays
in MongoDB before exporting it to Spark. Then, run something like this
#df
#+---+-----+------------------+
#| id|empno| sal|
#+---+-----+------------------+
#| 1| 101|[1000, 2000, 1500]|
#| 2| 102| [1000, 1500]|
#| 3| 103| [2000, 3000]|
#+---+-----+------------------+
import pyspark.sql.functions as F
df_new = df.select('id','empno',F.explode('sal').alias('sal'))
#df_new.show()
#+---+-----+----+
#| id|empno| sal|
#+---+-----+----+
#| 1| 101|1000|
#| 1| 101|2000|
#| 1| 101|1500|
#| 2| 102|1000|
#| 2| 102|1500|
#| 3| 103|2000|
#| 3| 103|3000|
#+---+-----+----+
pyspark RDD expand a row to multiple rows
Use flatMap
:
rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()])
Example:
rdd.flatMap(lambda x: [(w, x[0]) for w in x[1].split()]).collect()
# [('sentence', 10), ('number', 10), ('one', 10), ('longer', 17), ('sentence', 17), ('number', 17), ('two', 17)]
Explode multiple columns to rows in pyspark
I did this by passing columns as list to a for loop and exploded the dataframe for every element in list
Related Topics
Turn String into a List and Remove Carriage Returns (Python)
Python: Element Is Not Attached to the Page Document
How to Print Superscript in Python
How to Iterate Through Cur.Fetchall() in Python
How to Make Print() Accept the User Input in Same Line
How to Compute Mean() for Particular Column in Pandas Dataframe Without Considering Nan Values
How to Generate and Open an Outlook Email With Python (But Do Not Send)
When to Use Cla(), Clf() or Close() for Clearing a Plot in Matplotlib
Pandas Get the Age from a Date (Example: Date of Birth)
Image.Open() Cannot Identify Image File - Python
Change Date Formats in CSV With Python 3
Typeerror: Missing 1 Required Positional Argument: 'Self'
How to Get the Name of an Object
How to Count the Number of Files in a Directory Using Python
How to Print a String Multiple Times
Python, Deleting All Files in a Folder Older Than X Days
How to Decompile a Compiled .Pyc File into a .Py File
How to Concisely Replace Column Values Given Multiple Conditions