Spark Data Frames - Check If Column Is of Type Integer

Spark DataFrame contains specific integer value in column

No. Filter can filter other data types also.

dataDictionary = [
    (1,"APPLE"),
    (2,"KIWI"),
    (3,"PEAR")
    ]

df = spark.createDataFrame(data=dataDictionary, schema = ["ID","Tags"])
df.printSchema()
df.show(truncate=False)
df.filter("ID==2").rdd.isEmpty()  #Will return Boolean.

enter image description here

Does the dataframe know the type of column?

It depends on what type of file you are reading.

If it is a CSV file without header then you need to provide the column name and data type using schema.

It is a CSV file with header, then you need to use an "inferSchema"->"true" as an option while reading a file. This option automatically infers the schema and data types. However, data type is automatically driven from the first few records of actual data.

val df = spark.read.options(Map("inferSchema"->"true","delimiter"->"|","header"->"true")).csv(filePath)

For any reason, if your first few records of a column have a value integer and other records have a string then you will have issues hence, it always a best practice to provide the schema explicitly.

Your code is working as expected.

Below statement automatically infers the data type to Int for age based on the data Seq(("aa",1),("bb",2))

val df = rdd.toDF("name","age")

However, when you convert Dataframe to Dataset

val ds = rdd.map(line =>{Person(line._1,line._2)}).toDS()

Here, you are converting to Person which has Long data type for "age" field hence, you are seeing it as Long as expected. Note that automatically converting from Int to Long is done by Scala (up cast) not Spark.

Hope this clarifies !!

Below link is a good read on how to provide a complex schema. hope this gives you more idea.

https://medium.com/@mrpowers/adding-structtype-columns-to-spark-dataframes-b44125409803

Thanks

Check column datatype and execute SQL only on Integer and Decimal in Spark SQL

This is how you can filter the columns with integer and double type

// fiter the columns 
val columns = df.schema.fields.filter(x => x.dataType == IntegerType || x.dataType == DoubleType)

//use these filtered with select 
df.select(columns.map(x => col(x.name)): _*)

I hope this helps!

PySpark: How to judge column type of dataframe

TL;DR Use external data types (plain Python types) to test values, internal data types (DataType subclasses) to test schema.

First and foremost - You should never use

type(123) == int

Correct way to check types in Python, which handles inheritance, is

isinstance(123, int)

Having this done, lets talk about

Basically I want to know the way to directly get the object of the class like IntegerType, StringType from the dataframe and then judge it.

This is not how it works. DataTypes describe schema (internal representation) not values. External types, is a plain Python object, so if internal type is IntegerType, then external types is int and so on, according to the rules defined in the Spark SQL Programming guide.

The only place where IntegerType (or other DataTypes) instance exist is your schema:

from pyspark.sql.types import *

df = spark.createDataFrame([(1, "foo")])

isinstance(df.schema["_1"].dataType, LongType)
# True
isinstance(df.schema["_2"].dataType, StringType)
# True

_1, _2 = df.first()

isinstance(_1, int)
# True
isinstance(_2, str)
# True

get datatype of column using pyspark

import pandas as pd
pd.set_option('max_colwidth', -1) # to prevent truncating of columns in jupyter

def count_column_types(spark_df):
    """Count number of columns per type"""
    return pd.DataFrame(spark_df.dtypes).groupby(1, as_index=False)[0].agg({'count':'count', 'names': lambda x: " | ".join(set(x))}).rename(columns={1:"type"})

Example output in jupyter notebook for a spark dataframe with 4 columns:

count_column_types(my_spark_df)

enter image description here

Validating the data type of a column in pyspark dataframe

In one line:

df.withColumn("c2", df["c2"].cast("integer")).na.drop(subset=["c2"])

If c2 is not a valid integer, it will be NULL and dropped in the subsequent step.

Without changing the type

valid = df.where(df["c2"].cast("integer").isNotNull())
invalid = df.where(df["c2"].cast("integer").isNull())