Filter Df When Values Matches Part of a String in Pyspark

Filter df when values matches part of a string in pyspark

Spark 2.2 onwards

df.filter(df.location.contains('google.com'))

Spark 2.2 documentation link


Spark 2.1 and before

You can use plain SQL in filter

df.filter("location like '%google.com%'")

or with DataFrame column methods

df.filter(df.location.like('%google.com%'))

Spark 2.1 documentation link

Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)

You can use the contains function.

from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"])
df1.filter(col("long_text").contains(col("number"))).show()

The long_text column should contain the number in the number column.

Output:

+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+

Filter pyspark dataframe if contains a list of strings

IIUC, you want to return the rows in which column_a is "like" (in the SQL sense) any of the values in list_a.

One way is to use functools.reduce:

from functools import reduce

list_a = ['string', 'third']

df1 = df.where(
reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+

Essentially you loop over all of the possible strings in list_a to compare in like and "OR" the results. Here is the execution plan:

df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]

Another option is to use pyspark.sql.Column.rlike instead of like.

df2 = df.where(
df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)

df2.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+

Which has the corresponding execution plan:

df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]

Filter if String contain sub-string pyspark

There are various way to join two dataframes:

(1) find the location/position of string column_dataset_1_normalized in column_dataset_2_normalized by using SQL function locate, instr, position etc, return a position (1-based) if exists

    from pyspark.sql.functions import expr

cond1 = expr('locate(column_dataset_1_normalized,column_dataset_2_normalized)>0')
cond2 = expr('instr(column_dataset_2_normalized,column_dataset_1_normalized)>0')
cond3 = expr('position(column_dataset_1_normalized IN column_dataset_2_normalized)>0')

(2) use regex rlike to find column_dataset_1_normalized from column_dataset_2_normalized, this is only valid when no regex meta-characters is shown in column_dataset_1_normalized

    cond4 = expr('column_dataset_2_normalized rlike column_dataset_1_normalized')

Run the following code and use one of the above conditions, for example:

df1.join(df2, cond1).select('column_dataset_1').show(truncate=False)
+---------------------+
|column_dataset_1 |
+---------------------+
|W-B.7120RP1605794 |
|125858G_022BR/P070751|
+---------------------+

Edit: Per comments, the matched sub-string might not be the same as df1.column_dataset_1, so we will need to reverse-engineer the sub-string from the normalized string. Based on how the normalization is conducted, the following udf might help (notice this will not cover any leading/trailing non-alnum that might be in the matched). Basically, we will iterate through the string by chars and find the start/end index of the normalized string in the original string, then take the sub-string:

from pyspark.sql.functions import udf

@udf('string')
def find_matched(orig, normalized):
n, d = ([], [])
for i in range(len(orig)):
if orig[i].isalnum():
n.append(orig[i])
d.append(i)
idx = ''.join(n).find(normalized)
return orig[d[idx]:d[idx+len(normalized)]] if idx >= 0 else None

df1.join(df2, cond3) \
.withColumn('matched', find_matched('column_dataset_2', 'column_dataset_1_normalized')) \
.select('column_dataset_2', 'matched', 'column_dataset_1_normalized') \
.show(truncate=False)

+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|column_dataset_2 |matched |column_dataset_1_normalized|
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|Por ejemplo, si W-B.7120RP-1605794se trata de un archivo de texto, |W-B.7120RP-1605794 |WB7120RP1605794 |
|utilizados 125858G_022BR/P-070751 frecuentemente (por ejemplo, un texto que describe|125858G_022BR/P-070751 |125858G022BRP070751 |
+------------------------------------------------------------------------------------+-----------------------+---------------------------+

Filter pyspark DataFrame by string match

The most efficient here is to loop, you can use set intersection:

df['match'] = [set(c.split()).intersection(k.split(',')) > set()
for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  match
0 paul account is active active,activated,activ True
1 john account is activated active,activated,activ True
2 max account is activateds active,activated,activ False

Used input:

df = pd.DataFrame({'name': ['paul' , 'john' , 'max'],
'comments': ['account is active' ,'account is activated','account is activateds'],
'keywords': ['active,activated,activ', 'active,activated,activ', 'active,activated,activ']})

With a minor variation you could check for substring match ("activ" would match "activateds"):

df['substring'] = [any(w in c for w in k.split(','))
for c,k in zip(df['comments'], df['keywords'])]

Output:

   name               comments                keywords  substring
0 paul account is active active,activated,activ True
1 john account is activated active,activated,activ True
2 max account is activateds active,activated,activ True

Pyspark filter dataframe if column does not contain string

Use ~ as bitwise NOT:

df2.where(~F.col('Key').contains('sd')).show()

Filter PySpark DataFrame by checking if string appears in column

You can use pyspark.sql.functions.array_contains method:

df.filter(array_contains(df['authors'], 'Some Author'))

from pyspark.sql.types import *
from pyspark.sql.functions import array_contains

lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 2]|
| [author 1]|
+--------------------+

df.printSchema()
root
|-- authors: array (nullable = true)
| |-- element: string (containsNull = true)

df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 1]|
+--------------------+


Related Topics



Leave a reply



Submit