How to Get Max(Date) from Given Set of Data Grouped by Some Fields Using Pyspark

how to get max(date) from given set of data grouped by some fields using pyspark?

For non-numeric but Orderable types you can use agg with max directly:

from pyspark.sql.functions import col, max as max_

df = sc.parallelize([
("2016-04-06 16:36", 1234, 111, 1),
("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])

(df.withColumn("datetime", col("datetime").cast("timestamp"))
.groupBy("userId", "memberId")
.agg(max_("datetime")))

## +------+--------+--------------------+
## |userId|memberId| max(datetime)|
## +------+--------+--------------------+
## | 1234| 111|2016-04-06 17:35:...|
## +------+--------+--------------------+

pyspark window min(date) and max(date) of group

Here is one way to approach the problem

  • Create a helper group column to distinguish between the consecutive rows in loc per user
  • Then group the dataframe by the columns user, loc and group and aggregate the column date using min and max
  • Drop the group column and sort the dataframe by startdate
w = Window.partitionBy('user').orderBy('date')
b = F.lag('loc').over(w) != F.col('loc')

(
df.withColumn('group', b.cast('int'))
.fillna(0, 'group')
.withColumn('group', F.sum('group').over(w))
.groupBy('user', 'loc', 'group')
.agg(F.min('date').alias('startdate'),
F.max('date').alias('enddate'))
.drop('group')
.orderBy('startdate')
)


+----+---+----------+----------+
|user|loc| startdate| enddate|
+----+---+----------+----------+
| a| 1|2021-01-01|2021-01-02|
| a| 2|2021-01-03|2021-01-04|
| a| 1|2021-01-05|2021-01-06|
+----+---+----------+----------+

Retrieval of max date group by other column in spark-sql with scala

Applying max on a string type column will not give you the maximum date. You need to convert that to a date type column first:

val maxDateDF = spark.sql("SELECT name, max(to_date(birthDate, 'dd/MM/yyyy')) maxDate FROM people group by name")

If you wish to retain the original date formatting, you can convert it back to a string using date_format:

val maxDateDF = spark.sql("SELECT name, date_format(max(to_date(birthDate, 'dd/MM/yyyy')), 'dd/MM/yyyy') maxDate FROM people group by name")

Spark 2.0 groupBy column and then get max(date) on a datetype column

RelationalGroupedDataset.max is for numeric values only.

You could try agg() with the accompanying max function. In Scala:

import org.apache.spark.sql.functions._
old_df.groupBy($"ID").agg(max("date"))

so in Java it should be:

import static org.apache.spark.sql.functions.*;
old_df.groupBy("ID").agg(max("date"))

GroupBy column and filter rows with maximum value in Pyspark

You can do this without a udf using a Window.

Consider the following example:

import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 5|
#| a| 8|
#| a| 7|
#| b| 1|
#| b| 3|
#+---+---+

Create a Window to partition by column A and use this to compute the maximum of each group. Then filter out the rows such that the value in column B is equal to the max.

from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
.where(f.col('B') == f.col('maxB'))\
.drop('maxB')\
.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+

Or equivalently using pyspark-sql:

df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#| A| B|
#+---+---+
#| b| 3|
#| a| 8|
#+---+---+

How to calculate Max(Date) and Min(Date) for DateType in pyspark dataframe?

Aggregate with min and max:

from pyspark.sql.functions import min, max

df = spark.createDataFrame([
"2017-01-01", "2018-02-08", "2019-01-03"], "string"
).selectExpr("CAST(value AS date) AS date")

min_date, max_date = df.select(min("date"), max("date")).first()
min_date, max_date
# (datetime.date(2017, 1, 1), datetime.date(2019, 1, 3))


Related Topics



Leave a reply



Submit