Comparison Operator in Pyspark (Not Equal/ !=)

Comparison operator in PySpark (not equal/ !=)

To filter null values try:

foo_df = df.filter( (df.foo==1) & (df.bar.isNull()) )

https://spark.apache.org/docs/1.6.2/api/python/pyspark.sql.html#pyspark.sql.Column.isNull

Filter values not equal in pyspark

I think the problem is that they are null, null values are somehow special

Try this to filter for values where Buy is not Y

df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()))

So if you want to filter where Buy and Sell are not 'Y' as it seems by what you've tried, you need to do this:

 df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull()))

Quick example:

Input

+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  B|   Y|   Y|
|  C|   Y|null|
|  D|   Y|null|
|  E|null|null|
+---+----+----+

Output

>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  C|   Y|null|
|  D|   Y|null|
|  E|null|null|
+---+----+----+

>>> df.filter((df["Buy"] != "Y") | (df["Buy"].isNull()) & (df["Sell"] != "Y") | (df["Sell"].isNull())).show(10)
+---+----+----+
| id|Sell| Buy|
+---+----+----+
|  A|null|null|
|  E|null|null|
+---+----+----+

Smaller or equal comparison syntax error

It looks like you have mixed udf and Spark functions, you need to use only one of them. When possible it's always preferable not to use and udf since those can not be optimized (and are thus generally slower). Without udf it could be done as follows:

df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
  .withColumn("expired", when(abs(datediff($"end", $"BCED")) lt 6, 0).otherwise(1))

I introduced a temporary column to make the code a bit more readable.

Using an udf it could, for example, be done as follows:

val isExpired = udf((a: Date, b: Date) => {
  if ((math.abs(a.getTime() - b.getTime()) / (1000 * 3600 * 24)) < 6) { 
    0
  } else { 
    1
  }
})

df.withColumn("end", when($"termEnd".isNull, $"agrEnd").otherwise($"termEnd"))
  .withColumn("expired", isExpired($"end", $"BCED"))

Here, I again made use of a temporary column but this logic could be moved into the udf if preferred.

Is there a not equal operator in Python?

Use !=. See comparison operators. For comparing object identities, you can use the keyword is and its negation is not.

e.g.

1 == 1 #  -> True
1 != 1 #  -> False
[] is [] #-> False (distinct objects)
a = b = []; a is b # -> True (same object)

Create rows for 0 values when aggregating all combinations of several columns

I agree that crossJoin here is the correct approach. But I think afterwards it may be a bit more versatile to use a join instead of a union and groupBy. Especially if there are more aggregations than one count.

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('foo', 1),
     ('foo', 2),
     ('bar', 2),
     ('bar', 2)],
    ['x', 'y'])

df_cartesian = df.select('x').distinct().crossJoin(df.select("y").distinct())
df_cubed = df.cube('x', 'y').count()
df_cubed.join(df_cartesian, ['x', 'y'], 'full').fillna(0, ['count']).show()

# +----+----+-----+
# |   x|   y|count|
# +----+----+-----+
# |null|null|    4|
# |null|   1|    1|
# |null|   2|    3|
# | bar|null|    2|
# | bar|   1|    0|
# | bar|   2|    2|
# | foo|null|    2|
# | foo|   1|    1|
# | foo|   2|    1|
# +----+----+-----+

Split a string in words and check if a word matches a list item and return that word as value for a new column

You can use regexp_extract to get the relevant string:

import pyspark.sql.functions as F

pattern = '|'.join([rf'{word}' for word in word_list if len(word) >= 6 and len(word) <= 11])

df2 = df.withColumn(
    'match',
    F.regexp_extract(
        'text',
        rf"\b({pattern})\b",
        1
    )
).withColumn(
    'match',
    F.when(F.col('match') != '', F.col('match'))    # replace no match with null
)

df2.show(truncate=False)
+----------------------------------+------------+
|text                              |match       |
+----------------------------------+------------+
|This is line one                  |Null        |
|This is line two                  |Null        |
|bla coroner foo bar               |coroner     |
|This is line three                |Null        |
|foo bar shakespeare               |shakespeare |
|Null                              |Null        |
+----------------------------------+------------+

The pattern is something like \b(word1|word2|word3)\b, where \b means a word boundary (spaces/begin of line/end of line), and | means or.

Comparison Operator in Pyspark (Not Equal/ !=)