How to Sum Multiple Columns in a Spark Dataframe in Pyspark

Summing multiple columns in Spark

org.apache.spark.sql.functions.sum(Column e)

Aggregate function: returns the sum of all values in the expression.

As you can see, sum takes just one column as input so sum(df$waiting, df$eruptions) wont work.Since you wan to sum up the numeric fields, you can dosum(df("waiting") + df("eruptions")).If you wan to sum up values for individual columns then, you can df.agg(sum(df$waiting),sum(df$eruptions)).show

How can I sum multiple columns in a spark dataframe in pyspark?

Try this:

df = df.withColumn('result', sum(df[col] for col in df.columns))

df.columns will be list of columns from df.

How to sum several columns conditionally with pyspark?

Johanrex,

Here's a piece of code :

from pyspark.sql.functions import *

df.groupBy("order_id").agg(
    sum(col("nr_of_items")*col("price")).alias("total_order_amount"),
    sum(when(col("is_black") == lit(1), col("price")*col("nr_of_items")).otherwise(lit(0))).alias("black_order_amount"),
    sum(when(col("is_fabric") == lit(1), col("price")*col("nr_of_items")).otherwise(lit(0))).alias("fabric_order_amount")
).limit(100).toPandas()

Output :

order_id    total_order_amount  black_order_amount  fabric_order_amount
1               50                  20                  20
2               90                  50                  90

Group by then sum of multiple columns in Scala Spark

you can try

import org.apache.spark.sql.functions._
val df = Seq(
  ("a", 9, 1),
  ("a", 4, 2),
  ("b", 1, 3),
  ("a", 1, 4),
  ("b", 2, 5)
).toDF("name", "x", "y")

df.groupBy(col("name"))
  .agg(
    sum(col("x")).as("xsum"),
    sum(col("y")).as("ysum")
  )
  .show(false)

If you want to make it dynamic:

var exprs:List[Column] = List()

for(col <- List[String]("x", "y")){
  exprs :+= expr(s"sum($col) as sum_$col")
}

df.groupBy(col("name"))
  .agg(
    exprs.head, exprs.tail:_*
  )
  .show(false)

summing multiple columns values which has different column names by using pattern matching of column names using pyspark/pandas

Try this approach with Pandas (df is your data frame):

df['plain_sum'] = df.filter(regex='^plain-prod.*').sum(axis=1)

Pyspark sum of columns after union of dataframe

After union of df1 and df2, you can group by userid and sum all columns except date for which you get the max.

Note that for the union part, you can actually use DataFrame.unionByName if you have the same data types but only number of columns can differ:

df = df1.unionByName(df2, allowMissingColumns=True)

Then group by and agg:

import pyspark.sql.functions as F

result = df.groupBy("userid").agg(
    F.max("date").alias("date"),
    *[F.sum(c).alias(c) for c in df.columns if c not in ("date", "userid")]
)

result.show()

#+------+----------+------+------+------+------+
#|userid|      date|value1|value2|value3|value4|
#+------+----------+------+------+------+------+
#|     a|2022-01-10|     3|     2|    47|    35|
#|     b|2022-01-10|     3|     4|    47|    59|
#|     c|2022-01-10|     1|     3|    23|    47|
#+------+----------+------+------+------+------+

This supposes the second dataframe contains only dates prior to the today date in the first one. Otherwise, you'll need to filter df2 before union.

How to Sum Many Columns in PySpark Dataframe

I am not an expert in python but in your loop you are comparing a DataFrame[sum(a): bigint] with 5, and for some reason the answer is True.

df.agg(sum(working_cols[x])).collect()[0][0] should give you what you want. I actually collect the dataframe to the driver, select the first row (there is only one) and select the first column (only one as well).

Note that your approach is not optimal in terms of perf. You could compute all the sums with only one pass of the dataframe like this:

sums = [F.sum(x).alias(str(x)) for x in df.columns]
d = df.select(sums).collect()[0].asDict()

With this code, you would have a dictionary that assocites each column name to its sum and on which you could apply any logic that's of intrest to you.

Add column sum as new column in PySpark dataframe

This was not obvious. I see no row-based sum of the columns defined in the spark Dataframes API.

Version 2

This can be done in a fairly simple way:

newdf = df.withColumn('total', sum(df[col] for col in df.columns))

df.columns is supplied by pyspark as a list of strings giving all of the column names in the Spark Dataframe. For a different sum, you can supply any other list of column names instead.

I did not try this as my first solution because I wasn't certain how it would behave. But it works.

Version 1

This is overly complicated, but works as well.

You can do this:

use df.columns to get a list of the names of the columns
use that names list to make a list of the columns
pass that list to something that will invoke the column's overloaded add function in a fold-type functional manner

With python's reduce, some knowledge of how operator overloading works, and the pyspark code for columns here that becomes:

def column_add(a,b):
     return  a.__add__(b)

newdf = df.withColumn('total_col', 
         reduce(column_add, ( df[col] for col in df.columns ) ))

Note this is a python reduce, not a spark RDD reduce, and the parenthesis term in the second parameter to reduce requires the parenthesis because it is a list generator expression.

Tested, Works!

$ pyspark
>>> df = sc.parallelize([{'a': 1, 'b':2, 'c':3}, {'a':8, 'b':5, 'c':6}, {'a':3, 'b':1, 'c':0}]).toDF().cache()
>>> df
DataFrame[a: bigint, b: bigint, c: bigint]
>>> df.columns
['a', 'b', 'c']
>>> def column_add(a,b):
...     return a.__add__(b)
...
>>> df.withColumn('total', reduce(column_add, ( df[col] for col in df.columns ) )).collect()
[Row(a=1, b=2, c=3, total=6), Row(a=8, b=5, c=6, total=19), Row(a=3, b=1, c=0, total=4)]