Pyspark data frame Converting false and true to 0 and 1
Using CASE ... WHEN
(when(...).otherwise(...)
) is unnecessarily verbose. Instead you can just cast
to integer:
from pyspark.sql.functions import col
df.select([col(c).cast("integer") for c ["test1", "test2"]])
Using when and otherwise while converting boolean values to strings in Pyspark
As I mentioned in the comments, the issue is a type mismatch. You need to convert the boolean column to a string before doing the comparison. Finally, you need to cast the column to a string in the otherwise()
as well (you can't have mixed types in a column).
Your code is easy to modify to get the correct output:
import pyspark.sql.functions as f
cols = ["testing", "active"]
for col in cols:
df = df.withColumn(
col,
f.when(
f.col(col) == 'N',
'False'
).when(
f.col(col) == 'Y',
'True'
).when(
f.col(col).cast('string') == 'true',
'True'
).when(
f.col(col).cast('string') == 'false',
'False'
).otherwise(f.col(col).cast('string'))
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
However, there are some alternative approaches as well. For instance, this is a good place to use pyspark.sql.Column.isin()
:
df = reduce(
lambda df, col: df.withColumn(
col,
f.when(
f.col(col).cast('string').isin(['N', 'false']),
'False'
).when(
f.col(col).cast('string').isin(['Y', 'true']),
'True'
).otherwise(f.col(col).cast('string'))
),
cols,
df
)
df.show()
#+---+----+-------+----------+-----+------+
#| id|name|testing|avg_result|score|active|
#+---+----+-------+----------+-----+------+
#| 1| sam| null| null| null| True|
#| 2| Ram| True| 0.05| 10| False|
#| 3| Ian| False| 0.01| 1| False|
#| 4| Jim| False| 1.2| 3| True|
#+---+----+-------+----------+-----+------+
(Here I used reduce
to eliminate the for
loop, but you could have kept it.)
You could also use pyspark.sql.DataFrame.replace()
but you'd have to first convert the column active to a string:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true',], 'True', subset=cols)\
.replace(['N', 'false'], 'False', subset=cols)\
df.show()
# results omitted, but it's the same as above
Or using replace
just once:
df = df.withColumn('active', f.col('active').cast('string'))\
.replace(['Y', 'true', 'N', 'false'], ['True', 'True', 'False', 'False'], subset=cols)
Create a boolean column and fill it if other column contains a particular string in Pyspark
You don't need to use filter
to scan each row of col1
. You can just use the column's value inside when
and try to match it with the %+
literal that indicates that you are searching for a +
character at the very end of the String
.
DF.withColumn("col2", when(col("col1").like("%+"), true).otherwise(false))
This will result in the following DataFrame:
+----+-----+
|col1| col2|
+----+-----+
| a+| true|
| b+| true|
| a-|false|
| d-|false|
+----+-----+
You can study more about the when
/otherwise
functionality here and here.
Logical with count in Pyspark
Group by customer
+ personId
and use when
expression to check if all values in is_online_store
column are true
/ false
or a mix of the 2, using for example bool_and
function:
from pyspark.sql import functions as F
df1 = df.groupBy("customer", "PersonId").agg(
F.when(F.expr("bool_and(is_online_store)"), "Online")
.when(F.expr("bool_and(!is_online_store)"), "Offline")
.otherwise("Hybrid").alias("New_Column")
)
df1.show()
#+--------+--------+----------+
#|customer|PersonId|New_Column|
#+--------+--------+----------+
#|afabd2d2| 2| Offline|
#|afabd2d2| 8| Online|
#|afabd2d2| 4| Hybrid|
#|afabd2d2| 3| Online|
#+--------+--------+----------+
Related Topics
Converting Two Lists into a Matrix
Finding the Maximum Number of Columns in a File or CSV Using Python
What Is the Correct Format to Write Float Value to File in Python
Strip White Spaces from CSV File
How to Remove a Pandas Dataframe from Another Dataframe
Python: Element Is Not Attached to the Page Document
Only Reading First N Rows of CSV File With CSV Reader in Python
Pandas.Read_Excel Parameter "Sheet_Name" Not Working
Remove Last Few Characters in Pyspark Dataframe Column
Python: How to Read and Load an Excel File from Aws S3
Remove Very First Row in Pandas
How to Write Multiple Images (Subplots) into One Image
How to Append Data Using Openpyxl Python to Excel File from a Specified Row
How to Read a List of Parquet Files from S3 as a Pandas Dataframe Using Pyarrow
Pandas Dataframe Check If Column Value Exists in a Group of Columns
Regex to Append Some Characters in a Certain Position