Pyspark - Sum a Column in Dataframe and Return Results as Int

PySpark - Sum a column in dataframe and return results as int

The simplest way really :

df.groupBy().sum().collect()

But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]

I tried on a bigger dataset and i measured the processing time:

RDD and ReduceByKey : 2.23 s

GroupByKey: 30.5 s

PySpark sum of column recomputed?

The obvious approach would be to use a window function like this:

win = Window.orderBy(f.lit(1))
df.withColumn("SUM(B)", f.sum("B").over(win)).show()

Yet, you would obtain the following warning

WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

It means that by using a window function without partionning causes the entire dataframe to be sent to one single executor. It is exactly what we want to avoid with spark for obvious performance reasons.

A better solution, that does not involve sending all the data on one executor, would be to compute the sum and then add it to the dataframe with lit like this:

sum_b = df.select(f.sum("B")).first()[0]
df.withColumn("SUM(B)", f.lit(sum_b)).show()

How to sum values of an entire column in pyspark

Assuming you already have the data in a Spark DataFrame, you can use the sum SQL function, together with DataFrame.agg.

For example:

sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])

from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()

# Out: [Row(sum(a)=3, sum(b)=7)]

Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.

sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()

Notice how you need to unpack the arguments from the list using the * operator.

Concatenate PySpark Dataframe Column Names by Value and Sum

I don't see anything wrong with writing a for loop here

from pyspark.sql import functions as F

cols = ['a', 'b', 'c', 'd', 'e']

temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))

+---+---+---+---+---+---+------------+----+
| id|  a|  b|  c|  d|  e|extra_column| key|
+---+---+---+---+---+---+------------+----+
|  1|  0|  1|  1|  1|  1|   something|bcde|
|  2|  0|  1|  1|  1|  0|   something| bcd|
|  3|  1|  0|  0|  0|  0|   something|   a|
|  4|  0|  1|  0|  0|  0|   something|   b|
|  5|  1|  0|  0|  0|  0|   something|   a|
|  6|  0|  0|  0|  0|  0|   something|    |
+---+---+---+---+---+---+------------+----+

(temp
    .groupBy('key')
    .agg(F.count('*').alias('value'))
    .where(F.col('key') != '')
    .show()
)

+----+-----+
| key|value|
+----+-----+
|bcde|    1|
|   b|    1|
|   a|    2|
| bcd|    1|
+----+-----+

python, pyspark : get sum of a pyspark dataframe column values

Spark SQL has a dedicated module for column functions pyspark.sql.functions.

So the way it works is:

from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])

res = data.unionAll(
    data.select([
        F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
        F.sum(data.age).alias('age'), # get the sum of 'age'
        F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
    ]))
res.show()

Prints:

+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20|   A|
| def| 30|   B|
| All| 50| All|
+----+---+----+

Pyspark - Sum a Column in Dataframe and Return Results as Int