Pyspark Data Frame Converting False and True to 0 and 1

Pyspark data frame Converting false and true to 0 and 1

Using CASE ... WHEN (when(...).otherwise(...)) is unnecessarily verbose. Instead you can just cast to integer:

from pyspark.sql.functions import col

df.select([col(c).cast("integer") for c ["test1", "test2"]])

Using when and otherwise while converting boolean values to strings in Pyspark

As I mentioned in the comments, the issue is a type mismatch. You need to convert the boolean column to a string before doing the comparison. Finally, you need to cast the column to a string in the otherwise() as well (you can't have mixed types in a column).

Your code is easy to modify to get the correct output:

import pyspark.sql.functions as f

cols = ["testing", "active"]
for col in cols:
df = df.withColumn(
col,
f.when(
f.col(col) == 'N',
'False'
).when(
f.col(col) == 'Y',
'True'
).when(
f.col(col).cast('string') == 'true',
'True'
).when(
f.col(col).cast('string') == 'false',
'False'
).otherwise(f.col(col).cast('string'))
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+

However, there are some alternative approaches as well. For instance, this is a good place to use pyspark.sql.Column.isin():

df = reduce(
lambda df, col: df.withColumn(
col,
f.when(
f.col(col).cast('string').isin(['N', 'false']),
'False'
).when(
f.col(col).cast('string').isin(['Y', 'true']),
'True'
).otherwise(f.col(col).cast('string'))
),
cols,
df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+

(Here I used reduce to eliminate the for loop, but you could have kept it.)

You could also use pyspark.sql.DataFrame.replace() but you'd have to first convert the column active to a string:

df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true',], 'True', subset=cols)\
.replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above

Or using replace just once:

df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)

Create a boolean column and fill it if other column contains a particular string in Pyspark

You don't need to use filter to scan each row of col1. You can just use the column's value inside when and try to match it with the %+ literal that indicates that you are searching for a + character at the very end of the String.

DF.withColumn("col2", when(col("col1").like("%+"), true).otherwise(false))

This will result in the following DataFrame:

+----+-----+
|col1| col2|
+----+-----+
| a+| true|
| b+| true|
| a-|false|
| d-|false|
+----+-----+

You can study more about the when/otherwise functionality here and here.

Logical with count in Pyspark

Group by customer + personId and use when expression to check if all values in is_online_store column are true / false or a mix of the 2, using for example bool_and function:

from pyspark.sql import functions as F

df1 = df.groupBy("customer", "PersonId").agg(
F.when(F.expr("bool_and(is_online_store)"), "Online")
.when(F.expr("bool_and(!is_online_store)"), "Offline")
.otherwise("Hybrid").alias("New_Column")
)

df1.show()
#+--------+--------+----------+
#|customer|PersonId|New_Column|
#+--------+--------+----------+
#|afabd2d2| 2| Offline|
#|afabd2d2| 8| Online|
#|afabd2d2| 4| Hybrid|
#|afabd2d2| 3| Online|
#+--------+--------+----------+


Related Topics



Leave a reply



Submit