Best Way to Get the Max Value in a Spark Dataframe Column

Best way to get the max value in a Spark dataframe column

>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor|           timestamp|     uid|         x|          y|
+-----+--------------------+--------+----------+-----------+
|    1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
|    1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
|    1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
|    1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

Is there any way to get max value from a column in Pyspark other than collect()?

No need to sort, you can just select the maximum:

res = df.select(max(col('col1')).alias('max_col1')).first().max_col1

Or you can use selectExpr

res = df1.selectExpr('max(diff) as max_col1').first().max_col1

How to get the rows with Max value in Spark DataFrame

You can easily to did by extracting the MAX High value and finally applying a filter against the value on the entire Dataframe

Data Preparation

df = pd.DataFrame({
        'Date':['2021-01-23','2021-02-09','2009-09-19'],
        'High':[89,90,96],
        'Low':[43,54,50]
})

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+----------+----+---+
|      Date|High|Low|
+----------+----+---+
|2021-01-23|  89| 43|
|2021-02-09|  90| 54|
|2009-09-19|  96| 50|
+----------+----+---+

Filter

max_high = sparkDF.select(F.max(F.col('High')).alias('High')).collect()[0]['High']

>>> 96

sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()

+----------+----+---+
|      Date|High|Low|
+----------+----+---+
|2009-09-19|  96| 50|
+----------+----+---+

Add column to Spark dataframe with the max value that is less than the current record's value

You may try the following which uses max as a window function with when (a case expression) but focuses on the preceding rows

from pyspark.sql import functions as F
from pyspark.sql import Window


df = df.withColumn('previous_service_date',F.max(
    F.when(F.col("status")=="PD",F.col("service_date")).otherwise(None)
).over(
    Window.partitionBy("product")
          .rowsBetween(Window.unboundedPreceding,-1)
))

df.orderBy('service_date').show(truncate=False)

+---+--------------------+-------------------+------+-------+---------------------+
|id |claim_id            |service_date       |status|product|previous_service_date|
+---+--------------------+-------------------+------+-------+---------------------+
|123|10606134411906233408|2018-09-17 00:00:00|PD    |blue   |null                 |
|123|10606147900401009928|2019-01-24 00:00:00|PD    |yellow |null                 |
|123|10606160940704723994|2019-05-23 00:00:00|RV    |yellow |2019-01-24 00:00:00  |
|123|10606171648203079553|2019-08-29 00:00:00|RJ    |blue   |2018-09-17 00:00:00  |
|123|10606186611407311724|2020-01-13 00:00:00|PD    |blue   |2018-09-17 00:00:00  |
+---+--------------------+-------------------+------+-------+---------------------+

Edit 1

You may also use last as denoted below

df = df.withColumn('previous_service_date',F.last(
    F.when(F.col("status")=="PD" ,F.col("service_date")).otherwise(None),True
).over(
    Window.partitionBy("product")
          .orderBy('service_date')
          .rowsBetween(Window.unboundedPreceding,-1)
))

Let me know if this works for you.

get min and max from a specific column scala spark dataframe

How about getting the column name from the metadata:

val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))

Getting the maximum value of a column in an Apache Spark Dataframe (Scala)

In your code, the spark max was being mistaken for scala max, so I just specified the max to be from spark.

That will give you the max of id column as an integer value:

%scala
import org.apache.spark.sql.functions.col

val max = df.agg(org.apache.spark.sql.functions.max(col("id"))).collect()(0)(0).asInstanceOf[Int]

Output: max: Int = 6

If you want to create column to store max value of id:

%scala
import org.apache.spark.sql.functions._
df.withColumn("max", lit(df.agg(org.apache.spark.sql.functions.max($"id")).as[Int].first))

How to get the max value of date column in pyspark

I don't understand why you used try/except. The if-statement should be enough. Also you need to use the Spark SQL min/max instead of those in Python. Avoid naming your variables as min/max, which overrides default functions.

import pyspark.sql.functions as F

for col in df.columns:
    if dict(df.dtypes)[col]== 'string':
        minval, maxval = df.select(F.min(col), F.max(col)).first()
        print(maxval)
    else:      
        print(col, 'NA')

how to find the max value of all columns in a spark dataframe

The code will work irrespective of how many columns or mix of datatypes there are.

Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  10|   5| null|  50|
|  Bob|  15|  15|Simon|  10|
| Jack|   5|   1| Timo|   3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
    ['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
    ['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
    ['col2', 'col3', 'col5']

Read about first() and ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice|  15|  15|Simon|  50|
+-----+----+----+-----+----+

Best Way to Get the Max Value in a Spark Dataframe Column