Filter df when values matches part of a string in pyspark
Spark 2.2 onwardsdf.filter(df.location.contains('google.com'))
Spark 2.2 documentation link
Spark 2.1 and before
You can use plain SQL in
filter
df.filter("location like '%google.com%'")
or with DataFrame column methods
df.filter(df.location.like('%google.com%'))
Spark 2.1 documentation link
Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)
You can use the contains
function.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"])
df1.filter(col("long_text").contains(col("number"))).show()
The long_text column should contain the number in the number column.
Output:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Filter pyspark dataframe if contains a list of strings
IIUC, you want to return the rows in which column_a
is "like" (in the SQL sense) any of the values in list_a
.
One way is to use functools.reduce
:
from functools import reduce
list_a = ['string', 'third']
df1 = df.where(
reduce(lambda a, b: a|b, (df['column_a'].like('%'+pat+"%") for pat in list_a))
)
df1.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
Essentially you loop over all of the possible strings in list_a
to compare in like
and "OR" the results. Here is the execution plan:
df1.explain()
#== Physical Plan ==
#*(1) Filter (Contains(column_a#0, string) || Contains(column_a#0, third))
#+- Scan ExistingRDD[column_a#0,count#1]
Another option is to use pyspark.sql.Column.rlike
instead of like
.
df2 = df.where(
df['column_a'].rlike("|".join(["(" + pat + ")" for pat in list_a]))
)
df2.show()
#+------------+-----+
#| column_a|count|
#+------------+-----+
#| some_string| 10|
#|third_string| 30|
#+------------+-----+
Which has the corresponding execution plan:
df2.explain()
#== Physical Plan ==
#*(1) Filter (isnotnull(column_a#0) && column_a#0 RLIKE (string)|(third))
#+- Scan ExistingRDD[column_a#0,count#1]
Filter if String contain sub-string pyspark
There are various way to join two dataframes:
(1) find the location/position of string column_dataset_1_normalized in column_dataset_2_normalized by using SQL function locate, instr, position etc, return a position (1-based) if exists
from pyspark.sql.functions import expr
cond1 = expr('locate(column_dataset_1_normalized,column_dataset_2_normalized)>0')
cond2 = expr('instr(column_dataset_2_normalized,column_dataset_1_normalized)>0')
cond3 = expr('position(column_dataset_1_normalized IN column_dataset_2_normalized)>0')
(2) use regex rlike to find column_dataset_1_normalized from column_dataset_2_normalized, this is only valid when no regex meta-characters is shown in column_dataset_1_normalized
cond4 = expr('column_dataset_2_normalized rlike column_dataset_1_normalized')
Run the following code and use one of the above conditions, for example:
df1.join(df2, cond1).select('column_dataset_1').show(truncate=False)
+---------------------+
|column_dataset_1 |
+---------------------+
|W-B.7120RP1605794 |
|125858G_022BR/P070751|
+---------------------+
Edit: Per comments, the matched sub-string might not be the same as df1.column_dataset_1, so we will need to reverse-engineer the sub-string from the normalized string. Based on how the normalization is conducted, the following udf might help (notice this will not cover any leading/trailing non-alnum that might be in the matched). Basically, we will iterate through the string by chars and find the start/end index of the normalized string in the original string, then take the sub-string:
from pyspark.sql.functions import udf
@udf('string')
def find_matched(orig, normalized):
n, d = ([], [])
for i in range(len(orig)):
if orig[i].isalnum():
n.append(orig[i])
d.append(i)
idx = ''.join(n).find(normalized)
return orig[d[idx]:d[idx+len(normalized)]] if idx >= 0 else None
df1.join(df2, cond3) \
.withColumn('matched', find_matched('column_dataset_2', 'column_dataset_1_normalized')) \
.select('column_dataset_2', 'matched', 'column_dataset_1_normalized') \
.show(truncate=False)
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|column_dataset_2 |matched |column_dataset_1_normalized|
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
|Por ejemplo, si W-B.7120RP-1605794se trata de un archivo de texto, |W-B.7120RP-1605794 |WB7120RP1605794 |
|utilizados 125858G_022BR/P-070751 frecuentemente (por ejemplo, un texto que describe|125858G_022BR/P-070751 |125858G022BRP070751 |
+------------------------------------------------------------------------------------+-----------------------+---------------------------+
Filter pyspark DataFrame by string match
The most efficient here is to loop, you can use set
intersection:
df['match'] = [set(c.split()).intersection(k.split(',')) > set()
for c,k in zip(df['comments'], df['keywords'])]
Output:
name comments keywords match
0 paul account is active active,activated,activ True
1 john account is activated active,activated,activ True
2 max account is activateds active,activated,activ False
Used input:
df = pd.DataFrame({'name': ['paul' , 'john' , 'max'],
'comments': ['account is active' ,'account is activated','account is activateds'],
'keywords': ['active,activated,activ', 'active,activated,activ', 'active,activated,activ']})
With a minor variation you could check for substring match ("activ" would match "activateds"):
df['substring'] = [any(w in c for w in k.split(','))
for c,k in zip(df['comments'], df['keywords'])]
Output:
name comments keywords substring
0 paul account is active active,activated,activ True
1 john account is activated active,activated,activ True
2 max account is activateds active,activated,activ True
Pyspark filter dataframe if column does not contain string
Use ~
as bitwise NOT:
df2.where(~F.col('Key').contains('sd')).show()
Filter PySpark DataFrame by checking if string appears in column
You can use pyspark.sql.functions.array_contains
method:
df.filter(array_contains(df['authors'], 'Some Author'))
from pyspark.sql.types import *
from pyspark.sql.functions import array_contains
lst = [(["author 1", "author 2"],), (["author 2"],) , (["author 1"],)]
schema = StructType([StructField("authors", ArrayType(StringType()), True)])
df = spark.createDataFrame(lst, schema)
df.show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 2]|
| [author 1]|
+--------------------+
df.printSchema()
root
|-- authors: array (nullable = true)
| |-- element: string (containsNull = true)
df.filter(array_contains(df.authors, "author 1")).show()
+--------------------+
| authors|
+--------------------+
|[author 1, author 2]|
| [author 1]|
+--------------------+
Related Topics
Remove Timestamp from Date String in Python
Invalidargumenterror: Logits and Labels Must Have the Same First Dimension Seq2Seq Tensorflow
Cmake on Linux Centos 7, How to Force the System to Use Cmake3
Regular Expression to Check Whitespace in the Beginning and End of a String
How to Store Python Dictionary in to MySQL Db Through Python
How to Sum Multiple Columns in a Spark Dataframe in Pyspark
How to Get Text from Span Tag in Beautifulsoup
Combine Date and Time Columns Using Python Pandas
Typeerror: the Json Object Must Be Str, Not 'Bytes'
How to Change Python Version in Command Prompt If I Have 2 Python Version Installed
Replace Single Quote With Double Quote in a String Python
Python Db-Api: Fetchone VS Fetchmany VS Fetchall
Check If Value from One Dataframe Exists in Another Dataframe
How to Increase the Font Size of the Legend in My Seaborn Factorplot/Facetgrid