Best Practice to Run Multiple Spark Instance At a Time in Same Jvm

Best practice to run multiple spark instance at a time in same jvm?

Do you mean two spark applications or one spark application and two spark contexts. Two spark applications, each with their own driver and sparkcontext should be achievable unless you have to do something common as per your requirement.

When you have two spark applications, they are just like any other and the resources need to be shared like any other application

Multiple SparkSessions in single JVM

It is not supported and won't be. SPARK-2243 is resolved as Won't Fix.

If you need multiple contexts there are different projects which can help you (Mist, Livy).

How to submit multiple Spark applications in parallel without spawning separate JVMs?

With a use case, this is much clearer now. There are two possible solutions:

If you require shared data between those jobs, use the FAIR-scheduler and a (REST-)frontend (as does SparkJobServer, Livy, etc.). You don't need to use SparkJobServer either, it should be relatively easy to code, if you have a fixed scope. I've seen projects go in that direction. All you need is an event loop and a way to translate your incoming queries into Spark queries. In a way, I would expect there to be demand for a library to cover this use case, since it's pretty much always the first thing you have to build, when you work on a Spark-based application/framework.
In this case, you can size your executors according to your hardware, Spark will manage scheduling of your jobs. With Yarn's dynamic resource allocation, Yarn will also free resources (kill executors), should your framework/app be idle.
For more information, read here: http://spark.apache.org/docs/latest/job-scheduling.html

If you don't need shared data, use YARN (or another resource manager) to assign resources in a fair manner to both jobs. YARN has a fair scheduling mode, and you can set the resource demands per application. If you think this suits you, but you need shared data, then you might want to think about using Hive or Alluxio to provide a data interface. In this case you would run two spark-submits, and maintain multiple drivers in the cluster. Building additional automation around spark-submit can help you make this less annoying and more transparent to end users. This approach is also high-latency, since resource allocation and SparkSession initialization take up a more or less constant amount of time.

Do multiple spark sessions which query on the same partition in Hadoop table make the query slower?

In theory, in both cases the performance should be same but there are other factors that may be influencing the scheduling of the jobs. From my experience, you could be experiencing any of these constraints.

  1. The job pool in which you are firing your spark jobs has less resources than required by 8 instances of your job. Lets say each instance/run require 4 executors so while you are running 8 instances of the job that means you are requesting 32 executors in total. If your job queue is configured to support 20 executors ( executors could be limited by either number of cores or memory allocated to queue or both). your 8 jobs are not able to achieve maximum parallelism and hence appear to be slow as compared to one single instance running at the same time which gets all the resources it needs and is able to achieve maximum parallelism
  2. your cluster is running full force and utilising most of it resources for other prod jobs due to which your 8 instances of the same job are waiting for resources at the same time.
  3. if you ran a single test for 8 instances, there could be a possibility that there was a write job happening on the parent directory or hive table which could have caused your jobs to enter a wait state.

These are some of the problems which i have seen in my career, there could be other factors as well. i would suggest get access to yarn schedular or queues from your sysadmin and see if one of these is the culprit.

How many SparkSessions can a single application have?

TL;DR You can have as many SparkSessions as needed.

You can have one and only one SparkContext on a single JVM, but the number of SparkSessions is pretty much unbounded.

But can you elaborate on what you mean by a single SparkContext on a single JVM?

It means that at any given time in the lifecycle of a Spark application the driver can only be one and only one which in turn means that there's one and only one SparkContext on that JVM available.

The driver of a Spark application is where the SparkContext lives (or it's the opposite rather where SparkContext defines the driver -- the distinction is pretty much blurry).

You can only have one SparkContext at one time. Although you can start and stop it on demand as many times you want, but I remember an issue about it that said you should not close SparkContext unless you're done with Spark (which usually happens at the very end of your Spark application).

In other words, have a single SparkContext for the entire lifetime of your Spark application.

There was a similar question What's the difference between SparkSession.sql vs Dataset.sqlContext.sql? about multiple SparkSessions that can shed more light on why you'd want to have two or more sessions.

I was able call sparkSession.sparkContext().stop(), and also stop the SparkSession.

So?! How does this contradict what I said?! You stopped the only SparkContext available on the JVM. Not a big deal. You could, but that's just one part of "you can only have one and only one SparkContext on a single JVM available", isn't it?

SparkSession is a mere wrapper around SparkContext to offer Spark SQL's structured/SQL features on top of Spark Core's RDDs.

From the point of Spark SQL developer, the purpose of a SparkSession is to be a namespace for query entities like tables, views or functions that your queries use (as DataFrames, Datasets or SQL) and Spark properties (that could have different values per SparkSession).

If you'd like to have the same (temporary) table name used for different Datasets, creating two SparkSessions would be what I'd consider the recommended way.

I've just worked on an example to showcase how whole-stage codegen works in Spark SQL and have created the following that simply turns the feature off.

// both where and select operators support whole-stage codegen
// the plan tree (with the operators and expressions) meets the requirements
// That's why the plan has WholeStageCodegenExec inserted
// You can see stars (*) in the output of explain
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)
scala> q.explain
== Physical Plan ==
*Project [_2#89 AS c0#93]
+- *Filter (_1#88 = 0)
+- LocalTableScan [_1#88, _2#89, _3#90]

// Let's break the requirement of having up to spark.sql.codegen.maxFields
// I'm creating a brand new SparkSession with one property changed
val newSpark = spark.newSession()
import org.apache.spark.sql.internal.SQLConf.WHOLESTAGE_MAX_NUM_FIELDS
newSpark.sessionState.conf.setConf(WHOLESTAGE_MAX_NUM_FIELDS, 2)

scala> println(newSpark.sessionState.conf.wholeStageMaxNumFields)
2

// Let's see what's the initial value is
// Note that I use spark value (not newSpark)
scala> println(spark.sessionState.conf.wholeStageMaxNumFields)
100

import newSpark.implicits._
// the same query as above but created in SparkSession with WHOLESTAGE_MAX_NUM_FIELDS as 2
val q = Seq((1,2,3)).toDF("id", "c0", "c1").where('id === 0).select('c0)

// Note that there are no stars in the output of explain
// No WholeStageCodegenExec operator in the plan => whole-stage codegen disabled
scala> q.explain
== Physical Plan ==
Project [_2#122 AS c0#126]
+- Filter (_1#121 = 0)
+- LocalTableScan [_1#121, _2#122, _3#123]

I then created a new SparkSession and used a new SparkContext. No error was thrown.

Again, how does this contradict what I said about a single SparkContext being available? I'm curious.

What exactly does stopping the spark context do, and why can you not create a new one once you've stopped one?

You can no longer use it to run Spark jobs (to process large and distributed datasets) which is pretty much exactly the reason why you use Spark in the first place, doesn't it?

Try the following:

  1. Stop SparkContext
  2. Execute any processing using Spark Core's RDD or Spark SQL's Dataset APIs

An exception? Right! Remember that you close the "doors" to Spark so how could you have expected to be inside?! :)

Spark: launch from single JVM jobs with different memory/cores configs simultaneously

Spark standalone uses a simple FIFO scheduler for applications. By default, each application uses all the available nodes in the cluster. The number of nodes can be limited per application, per user, or globally. Other resources, such as memory, cpus, etc. can be controlled via the application’s SparkConf object.

Apache Mesos has a master and slave processes. The master makes offers of resources to the application (called a framework in Apache Mesos) which either accepts the offer or not. Thus, claiming available resources and running jobs is determined by the application itself. Apache Mesos allows fine-grained control of the resources in a system such as cpus, memory, disks, and ports. Apache Mesos also offers course-grained control control of resources where Spark allocates a fixed number of CPUs to each executor in advance which are not released until the application exits. Note that in the same cluster, some applications can be set to use fine-grained control while others are set to use course-grained control.

Apache Hadoop YARN has a ResourceManager with two parts, a Scheduler, and an ApplicationsManager. The Scheduler is a pluggable component. Two implementations are provided, a CapacityScheduler, useful in a cluster shared by more than one organization, and the FairScheduler, which ensures all applications, on average, get an equal number of resources. Both schedulers assign applications to a queues and each queue gets resources that are shared equally between them. Within a queue, resources are shared between the applications. The ApplicationsManager is responsible for accepting job submissions and starting the application specific ApplicationsMaster. In this case, the ApplicationsMaster is the Spark application. In the Spark application, resources are specified in the application’s SparkConf object.

For your case just with standalone it is not possible may be there can be some premise solutions but I haven't faced



Related Topics



Leave a reply



Submit