How to Measure the Execution Time of a Query on Spark

How to measure the execution time of a query on Spark

Update:
No, using time package is not the best way to measure execution time of Spark jobs. The most convenient and exact way I know of is to use the Spark History Server.

On Bluemix, in your notebooks go to the "Paelette" on the right side. Choose the "Evironment" Panel and you will see a link to the Spark History Server, where you can investigate the performed Spark jobs including computation times.

In PySpark groupBy, how do I calculate execution time by group?

If don't want to print the execution time to stdout you could return it as an extra column from the Pandas UDF instead e.g.

@pandas_udf("my_col long, execution_time long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
return pdf.assign(execution_time=datetime.now() - start)

Alternatively, to compute the average execution time in the driver application, you could accumulate the execution time and the number of UDF calls in the UDF with two Accumulators. e.g.

udf_count = sc.accumulator(0)
total_udf_execution_time = sc.accumulator(0)

@pandas_udf("my_col long", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
start = datetime.now()
# Some business logic
udf_count.add(1)
total_udf_execution_time.add(datetime.now() - start)
return pdf

# Some Spark action to run business logic

mean_udf_execution_time = total_udf_execution_time.value / udf_count.value


Related Topics



Leave a reply



Submit