PySpark - Sum a column in dataframe and return results as int
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
PySpark sum of column recomputed?
The obvious approach would be to use a window function like this:
win = Window.orderBy(f.lit(1))
df.withColumn("SUM(B)", f.sum("B").over(win)).show()
Yet, you would obtain the following warning
WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
It means that by using a window function without partionning causes the entire dataframe to be sent to one single executor. It is exactly what we want to avoid with spark for obvious performance reasons.
A better solution, that does not involve sending all the data on one executor, would be to compute the sum and then add it to the dataframe with lit
like this:
sum_b = df.select(f.sum("B")).first()[0]
df.withColumn("SUM(B)", f.lit(sum_b)).show()
How to sum values of an entire column in pyspark
Assuming you already have the data in a Spark DataFrame, you can use the sum
SQL function, together with DataFrame.agg.
For example:
sdf = spark.createDataFrame([[1, 3], [2, 4]], schema=['a','b'])
from pyspark.sql import functions as F
sdf.agg(F.sum(sdf.a), F.sum(sdf.b)).collect()
# Out: [Row(sum(a)=3, sum(b)=7)]
Since in your case you have quite a few columns, you can use a list comprehension to avoid naming columns explicitly.
sums = sdf.agg(*[F.sum(sdf[c_name]) for c_name in sdf.columns]).collect()
Notice how you need to unpack the arguments from the list using the * operator.
Concatenate PySpark Dataframe Column Names by Value and Sum
I don't see anything wrong with writing a for loop here
from pyspark.sql import functions as F
cols = ['a', 'b', 'c', 'd', 'e']
temp = (df.withColumn('key', F.concat(*[F.when(F.col(c) == 1, c).otherwise('') for c in cols])))
+---+---+---+---+---+---+------------+----+
| id| a| b| c| d| e|extra_column| key|
+---+---+---+---+---+---+------------+----+
| 1| 0| 1| 1| 1| 1| something|bcde|
| 2| 0| 1| 1| 1| 0| something| bcd|
| 3| 1| 0| 0| 0| 0| something| a|
| 4| 0| 1| 0| 0| 0| something| b|
| 5| 1| 0| 0| 0| 0| something| a|
| 6| 0| 0| 0| 0| 0| something| |
+---+---+---+---+---+---+------------+----+
(temp
.groupBy('key')
.agg(F.count('*').alias('value'))
.where(F.col('key') != '')
.show()
)
+----+-----+
| key|value|
+----+-----+
|bcde| 1|
| b| 1|
| a| 2|
| bcd| 1|
+----+-----+
python, pyspark : get sum of a pyspark dataframe column values
Spark SQL has a dedicated module for column functions pyspark.sql.functions
.
So the way it works is:
from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
res = data.unionAll(
data.select([
F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
F.sum(data.age).alias('age'), # get the sum of 'age'
F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
]))
res.show()
Prints:
+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20| A|
| def| 30| B|
| All| 50| All|
+----+---+----+
Related Topics
Get Discord User Id from Username
Key Error When Selecting Columns in Pandas Dataframe After Read_Csv
Receiving Integers from the User Until They Enter 0
How to Restart a Program Based on User Input
How to Convert Python Code to Application
How to Find a Minimum Value in a 2D Array Without Using Numpy or Flattened in Python
How to Convert a 1 Channel Image into a 3 Channel With Opencv2
Using a String Variable as a Variable Name
How to Select Last Row and Also How to Access Pyspark Dataframe by Index
How to Put a Space Between Two String Items in Python
Find and Replace Specific Values Within 2D Array
How to Remove Nan from List Python/Numpy
Counting Number of Zeros Per Row by Pandas Dataframe
How to Calculate Average a Dictionary from List of Dictionary Data