PySpark: How to check if a column contains a number using isnan
isnan
only returns true
if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false
. If you want to check if a column contains a numerical value, you need to define your own udf
, for example as shown below:
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()
This gives as output:
+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
| 33004| 33004| null|
| Muxia| null| Muxia|
| Fuensanta| null| Fuensanta|
+-----------+----------+------------+
How to find if a specific column of a pyspark dataframe contains numeric value
You can use regexp_extract
df = spark_session.createDataFrame([
Row(Part1 = "1 HKY TBT TPP 190326 115346 5 C"),
Row(Part1 = "51 HKK ABB TYR B 190326 000526 13 C")
])
regex = r'^(\d+)\s[^\d]*(\d+)\s[^\d]*(\d+)'
df.withColumn("Part2", regexp_extract(col("Part1"), regex, 2))\
.withColumn("Part3", regexp_extract(col("Part1"), regex, 3))\
.show()
Output:
+--------------------+------+------+
| Part1| Part2| Part3|
+--------------------+------+------+
|1 HKY TBT TPP 190...|190326|115346|
|51 HKK ABB TYR B ...|190326|000526|
+--------------------+------+------+
get first numeric values from pyspark dataframe string column into new column
You can use regexp_extract
:
from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()
+------+-------------------+------------+
| id| productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE| |
|419836| BLUE KOSHER SAUCE| |
|350022| GUAVA (1G)| 1|
|123213| GUAVA G5| 5|
|125513| 3GULA G5| 3|
|127143| GUAVA G50| 50|
|124513| LAAVA C2L5| 2|
+------+-------------------+------------+
How can I check if a dataframe contains a column according to a list of column names in Pyspark?
Here's an example how to do that:
spark = SparkSession.builder.getOrCreate()
data = [{"a": "12.1", "b": "23.2", "c": "33.2"}]
columns = ["a", "c"]
df = spark.createDataFrame(data)
df = df.select(
[F.col(c).cast(DoubleType()) if c in columns else F.col(c) for c in df.columns]
)
Result:
root
|-- a: double (nullable = true)
|-- b: string (nullable = true)
|-- c: double (nullable = true)
+----+----+----+
|a |b |c |
+----+----+----+
|12.1|23.2|33.2|
+----+----+----+
test if spark dataframe column contains 5 digit number
check the below code
df.withColumn("contains_5digit",
F.when(F.col('code').rlike("\d{5}"),1).otherwise(0)).show()
+-----+---------------+
| code|contains_5digit|
+-----+---------------+
|95110| 1|
+-----+---------------+
Selecting only numeric/string columns names from a Spark DF in pyspark
dtypes is list of tuples (columnNane,type) you can use simple filter
columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]
Related Topics
How to Clear/Delete the Contents of a Tkinter Text Widget
How to Remove the Decimal Point in a Pandas Dataframe
How to Check If Numbers Are in a List in Python
Convert Commas Decimal Separators to Dots Within a Dataframe
Json Dump in Python Writing Newline Character and Carriage Returns in File.
Making a Matrix in Python 3 Without Numpy Using Inputs
How to Close a Tkinter Window by Pressing a Button
How to Select Last Row and Also How to Access Pyspark Dataframe by Index
Pip Error: Microsoft Visual C++ 14.0 Is Required
How to Remove Strings Present in a List from a Column in Pandas
How to Run External Executable Using Python
Django: Check Whether an Object Already Exists Before Adding
Printing the Number of Days in a Given Month and Year [Python]
Easiest Way to Convert Two Columns to Python Dictionary
How to Remove Empty Cell from Data Frame Row Wise
Using Regex to Get the Value Between Two Characters (Python 3)