Comparison operator in PySpark (not equal/ !=)
To filter null values try:
foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )
https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull
Filter values not equal in pyspark
I think the problem is that they are null, null values are somehow special
Try this to filter for values where Buy
is not Y
df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()))
So if you want to filter where Buy and Sell are not 'Y' as it seems by what you've tried, you need to do this:
df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull()))
Quick example:
Input
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| B| Y| Y|
| C| Y|null|
| D| Y|null|
| E|null|null|
+---+----+----+
Output
>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| C| Y|null|
| D| Y|null|
| E|null|null|
+---+----+----+
>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
| A|null|null|
| E|null|null|
+---+----+----+
Smaller or equal comparison syntax error
It looks like you have mixed udf
and Spark functions, you need to use only one of them. When possible it's always preferable not to use and udf
since those can not be optimized (and are thus generally slower). Without udf
it could be done as follows:
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", when(abs(datediff($"end", $"BCED")) lt 6, 0).otherwise(1))
I introduced a temporary column to make the code a bit more readable.
Using an udf
it could, for example, be done as follows:
val isExpired = udf((a: Date, b: Date) => {
if ((math.abs(a.getTime() - b.getTime()) / (1000 * 3600 * 24)) < 6) {
0
} else {
1
}
})
df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
.withColumn("expired", isExpired($"end", $"BCED"))
Here, I again made use of a temporary column but this logic could be moved into the udf
if preferred.
Is there a not equal operator in Python?
Use !=
. See comparison operators. For comparing object identities, you can use the keyword is
and its negation is not
.
e.g.
1 == 1 # -> True
1 != 1 # -> False
[] is [] #-> False (distinct objects)
a = b = []; a is b # -> True (same object)
Create rows for 0 values when aggregating all combinations of several columns
I agree that crossJoin
here is the correct approach. But I think afterwards it may be a bit more versatile to use a join
instead of a union
and groupBy
. Especially if there are more aggregations than one count
.
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('foo', 1),
('foo', 2),
('bar', 2),
('bar', 2)],
['x', 'y'])
df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()
# +----+----+-----+
# | x| y|count|
# +----+----+-----+
# |null|null| 4|
# |null| 1| 1|
# |null| 2| 3|
# | bar|null| 2|
# | bar| 1| 0|
# | bar| 2| 2|
# | foo|null| 2|
# | foo| 1| 1|
# | foo| 2| 1|
# +----+----+-----+
Split a string in words and check if a word matches a list item and return that word as value for a new column
You can use regexp_extract
to get the relevant string:
import pyspark.sql.functions as F
pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])
df2 = df.withColumn(
'match',
F.regexp_extract(
'text',
rf"\b({pattern})\b",
1
)
).withColumn(
'match',
F.when(F.col('match') != '', F.col('match')) # replace no match with null
)
df2.show(truncate=False)
+----------------------------------+------------+
|text |match |
+----------------------------------+------------+
|This is line one |Null |
|This is line two |Null |
|bla coroner foo bar |coroner |
|This is line three |Null |
|foo bar shakespeare |shakespeare |
|Null |Null |
+----------------------------------+------------+
The pattern
is something like \b(word1|word2|word3)\b
, where \b
means a word boundary (spaces/begin of line/end of line), and |
means or
.
Related Topics
Why Is My Left Join Not Returning Nulls
Tools to Work with Stored Procedures in Oracle, in a Team
How to Set The Starting Point for The Primary Key (Id) Column in Postgres via a Rails Migration
Rails + Postgresql Ssl Decryption Failure
Could Not Obtain Information About Windows Nt Group User
Joining Multiple Common Table Expressions
Sql Server 2005 Get First and Last Date for Any Month in Any Year
Undelete Recently Deleted Rows SQL Server
Sql Server Left Join with 'Or' Operator
Sql Query for Insert in Grails
Arel Causing Infinite Loop on Aggregation