Filtering a Pyspark DataFrame with SQL-like IN clause
String you pass to SQLContext
it evaluated in the scope of the SQL environment. It doesn't capture the closure. If you want to pass a variable you'll have to do it explicitly using string formatting:
df = sc.parallelize([(1, "foo"), (2, "x"), (3, "bar")]).toDF(("k", "v"))
df.registerTempTable("df")
sqlContext.sql("SELECT * FROM df WHERE v IN {0}".format(("foo", "bar"))).count()
## 2
Obviously this is not something you would use in a "real" SQL environment due to security considerations but it shouldn't matter here.In practice DataFrame
DSL is a much better choice when you want to create dynamic queries:
from pyspark.sql.functions import col
df.where(col("v").isin({"foo", "bar"})).count()
## 2
It is easy to build and compose and handles all details of HiveQL / Spark SQL for you. SQL like NOT IN clause for PySpark data frames
I just had the same issue and found solution. If you want to negate any condition (in pySpark represented as Column
class) there is negation operator ~
, for example:
df.where(~df.flag.isin(1, 2, 3)) # records with flag NOT IN (1, 2, 3)
Pyspark: Filter data frame if column contains string from another column (SQL LIKE statement)
You can use the contains
function.
from pyspark.sql.functions import *
df1 = spark.createDataFrame([("hahaha the 3 is good",3),("i dont know about 3",2),("what is 5 doing?",5),\
("ajajaj 123",2),("7 dwarfs",1)], ["long_text","number"])
df1.filter(col("long_text").contains(col("number"))).show()
The long_text column should contain the number in the number column. Output:
+--------------------+------+
| long_text|number|
+--------------------+------+
|hahaha the 3 is good| 3|
| what is 5 doing?| 5|
| ajajaj 123| 2|
+--------------------+------+
Pyspark: Filter dataframe based on multiple conditions
Your logic condition is wrong. IIUC, what you want is:import pyspark.sql.functions as f
df.filter((f.col('d')<5))\
.filter(
((f.col('col1') != f.col('col3')) |
(f.col('col2') != f.col('col4')) & (f.col('col1') == f.col('col3')))
)\
.show()
I broke the filter()
step into 2 calls for readability, but you could equivalently do it in one line.Output:
+----+----+----+----+---+
|col1|col2|col3|col4| d|
+----+----+----+----+---+
| A| xx| D| vv| 4|
| A| x| A| xx| 3|
| E| xxx| B| vv| 3|
| F|xxxx| F| vvv| 4|
| G| xxx| G| xx| 4|
+----+----+----+----+---+
Related Topics
How to Create Module-Wide Variables in Python
How to Get Tweets Older Than a Week (Using Tweepy or Other Python Libraries)
Continuing in Python's Unittest When an Assertion Fails
Python Pandas Dataframe, Is It Pass-By-Value or Pass-By-Reference
Reading Tar File Contents Without Untarring It, in Python Script
How to Combine Multiple Rows into a Single Row with Pandas
Suppressing Scientific Notation in Pandas
Plotting Multiple Lines, in Different Colors, with Pandas Dataframe
Typeerror: Expected a Character Buffer Object - While Trying to Save Integer to Textfile
Destructuring-Bind Dictionary Contents
Wrapping Long Y Labels in Matplotlib Tight Layout Using Setp
Importerror: Cannot Import Name Numpy_Mkl
Using Only the Db Part of Django
Django: Adding "Nulls Last" to Query