How to Update a Broadcast Variable in Spark Streaming

spark streaming broadcast variable daily update

Refer to the last answer in the thread you referred. Summary - instead of sending the data, send the caching code to update data at the needed interval

  1. Create CacheLookup object that updates daily@12 am
  2. Wrap that in Broadcast variable
  3. Use CacheLookup as part of streaming logic

Broadcasting updates on spark jobs

Broadcast variables are immutable but you can create a new broadcast variable.
This new broadcast variable can be used in the next iteration.

All you need to do is to change the reference to the newly created broadcast, unpersist the old broadcast from the executors and destroy it from the driver.

Define the variable at class level which will allow you to change the reference of broadcast variable in driver and use the destroy method.

object Main extends App {

// defined and initialized at class level to allow reference change
var previousData: Map[String, Double] = null

override def main(args: Array[String]): Unit = {
//your code

}
}

You were not allowed to use the destroy method on the variable because the reference no longer exists in the driver. Changing the reference to the new broadcast variable can resolve the issue.

Unpersist only removes data from the executors and hence when the variable is re-accessed, the driver resends it to the executors.

blocking = true will allow you let the application completely remove the data from the executor before the next access.

sc.broadcast() - There is no official documentation saying that it is blocking. Although as soon as it is called the application will start broadcasting the data to the executors before running the next line of the code .So if the data is very large it may slow down your application. So be care full on how you are using it .

It is a good practice to call unpersist before destroy.This will help you get rid of data completely from executors and driver.



Related Topics



Leave a reply



Submit