How to Check If a String Column in Pyspark Dataframe Is All Numeric

PySpark: How to check if a column contains a number using isnan

isnan only returns true if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false. If you want to check if a column contains a numerical value, you need to define your own udf, for example as shown below:

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))

def is_digit(value):
if value:
return value.isdigit()
else:
return False

is_digit_udf = udf(is_digit, BooleanType())

df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()

This gives as output:

+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
| 33004| 33004| null|
| Muxia| null| Muxia|
| Fuensanta| null| Fuensanta|
+-----------+----------+------------+

How to find if a specific column of a pyspark dataframe contains numeric value

You can use regexp_extract

df = spark_session.createDataFrame([
Row(Part1 = "1 HKY TBT TPP 190326 115346 5 C"),
Row(Part1 = "51 HKK ABB TYR B 190326 000526 13 C")
])

regex = r'^(\d+)\s[^\d]*(\d+)\s[^\d]*(\d+)'
df.withColumn("Part2", regexp_extract(col("Part1"), regex, 2))\
.withColumn("Part3", regexp_extract(col("Part1"), regex, 3))\
.show()

Output:

+--------------------+------+------+
| Part1| Part2| Part3|
+--------------------+------+------+
|1 HKY TBT TPP 190...|190326|115346|
|51 HKK ABB TYR B ...|190326|000526|
+--------------------+------+------+

get first numeric values from pyspark dataframe string column into new column

You can use regexp_extract:

from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()

+------+-------------------+------------+
| id| productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE| |
|419836| BLUE KOSHER SAUCE| |
|350022| GUAVA (1G)| 1|
|123213| GUAVA G5| 5|
|125513| 3GULA G5| 3|
|127143| GUAVA G50| 50|
|124513| LAAVA C2L5| 2|
+------+-------------------+------------+

How can I check if a dataframe contains a column according to a list of column names in Pyspark?

Here's an example how to do that:

spark = SparkSession.builder.getOrCreate()
data = [{"a": "12.1", "b": "23.2", "c": "33.2"}]
columns = ["a", "c"]
df = spark.createDataFrame(data)
df = df.select(
[F.col(c).cast(DoubleType()) if c in columns else F.col(c) for c in df.columns]
)

Result:

root
|-- a: double (nullable = true)
|-- b: string (nullable = true)
|-- c: double (nullable = true)

+----+----+----+
|a |b |c |
+----+----+----+
|12.1|23.2|33.2|
+----+----+----+

test if spark dataframe column contains 5 digit number

check the below code

df.withColumn("contains_5digit", 
F.when(F.col('code').rlike("\d{5}"),1).otherwise(0)).show()

+-----+---------------+
| code|contains_5digit|
+-----+---------------+
|95110| 1|
+-----+---------------+

Selecting only numeric/string columns names from a Spark DF in pyspark

dtypes is list of tuples (columnNane,type) you can use simple filter

 columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]


Related Topics



Leave a reply



Submit