How to Check If a String Column in Pyspark Dataframe Is All Numeric

PySpark: How to check if a column contains a number using isnan

isnan only returns true if the column contains an mathematically invalid number, for example 5/0. In any other case, including strings, it will return false. If you want to check if a column contains a numerical value, you need to define your own udf, for example as shown below:

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

df = spark.createDataFrame([('33004', ''),('Muxia', None), ('Fuensanta', None)], ("Postal code", "PostalCode"))

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType())

df = df.withColumn('PostalCode', when(is_digit_udf(df['Postal code']), df['Postal code']))
df = df.withColumn('Municipality', when(~is_digit_udf(df['Postal code']), df['Postal code']))
df.show()

This gives as output:

+-----------+----------+------------+
|Postal code|PostalCode|Municipality|
+-----------+----------+------------+
|      33004|     33004|        null|
|      Muxia|      null|       Muxia|
|  Fuensanta|      null|   Fuensanta|
+-----------+----------+------------+

How to find if a specific column of a pyspark dataframe contains numeric value

You can use regexp_extract

df = spark_session.createDataFrame([
    Row(Part1 = "1 HKY TBT TPP 190326 115346       5 C"),
    Row(Part1 = "51 HKK ABB TYR B    190326 000526    13 C")
])

regex = r'^(\d+)\s[^\d]*(\d+)\s[^\d]*(\d+)'
df.withColumn("Part2", regexp_extract(col("Part1"), regex, 2))\
    .withColumn("Part3", regexp_extract(col("Part1"), regex, 3))\
    .show()

Output:

+--------------------+------+------+
|               Part1| Part2| Part3|
+--------------------+------+------+
|1 HKY TBT TPP 190...|190326|115346|
|51 HKK ABB TYR B ...|190326|000526|
+--------------------+------+------+

get first numeric values from pyspark dataframe string column into new column

You can use regexp_extract:

from pyspark.sql import functions as F
df.withColumn("product1_num", F.regexp_extract("productname", "([0-9]+)",1)).show()

+------+-------------------+------------+
|    id|        productname|product1_num|
+------+-------------------+------------+
|234832|EXTREME BERRY SAUCE|            |
|419836|  BLUE KOSHER SAUCE|            |
|350022|         GUAVA (1G)|           1|
|123213|           GUAVA G5|           5|
|125513|           3GULA G5|           3|
|127143|          GUAVA G50|          50|
|124513|         LAAVA C2L5|           2|
+------+-------------------+------------+

How can I check if a dataframe contains a column according to a list of column names in Pyspark?

Here's an example how to do that:

spark = SparkSession.builder.getOrCreate()
data = [{"a": "12.1", "b": "23.2", "c": "33.2"}]
columns = ["a", "c"]
df = spark.createDataFrame(data)
df = df.select(
    [F.col(c).cast(DoubleType()) if c in columns else F.col(c) for c in df.columns]
)

Result:

root
 |-- a: double (nullable = true)
 |-- b: string (nullable = true)
 |-- c: double (nullable = true)

+----+----+----+                                                                
|a   |b   |c   |
+----+----+----+
|12.1|23.2|33.2|
+----+----+----+

test if spark dataframe column contains 5 digit number

check the below code

df.withColumn("contains_5digit", 
F.when(F.col('code').rlike("\d{5}"),1).otherwise(0)).show()

+-----+---------------+
| code|contains_5digit|
+-----+---------------+
|95110|              1|
+-----+---------------+

Selecting only numeric/string columns names from a Spark DF in pyspark

dtypes is list of tuples (columnNane,type) you can use simple filter

 columnList = [item[0] for item in df.dtypes if item[1].startswith('string')]