Resolving Dependency Problems in Apache Spark

Resolving dependency problems in Apache Spark

When building and deploying Spark applications all dependencies require compatible versions.

  • Scala version. All packages have to use the same major (2.10, 2.11, 2.12) Scala version.

    Consider following (incorrect) build.sbt:

    name := "Simple Project"

    version := "1.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % "2.0.1",
    "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
    "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )

    We use spark-streaming for Scala 2.10 while remaining packages are for Scala 2.11. A valid file could be

    name := "Simple Project"

    version := "1.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % "2.0.1",
    "org.apache.spark" % "spark-streaming_2.11" % "2.0.1",
    "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )

    but it is better to specify version globally and use %% (which appends the scala version for you):

    name := "Simple Project"

    version := "1.0"

    scalaVersion := "2.11.7"

    libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-core" % "2.0.1",
    "org.apache.spark" %% "spark-streaming" % "2.0.1",
    "org.apache.bahir" %% "spark-streaming-twitter" % "2.0.1"
    )

Similarly in Maven:

    <project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
  • Spark version All packages have to use the same major Spark version (1.6, 2.0, 2.1, ...).

    Consider following (incorrect) build.sbt:

    name := "Simple Project"

    version := "1.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % "1.6.1",
    "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
    "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )

    We use spark-core 1.6 while remaining components are in Spark 2.0. A valid file could be

    name := "Simple Project"

    version := "1.0"

    libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % "2.0.1",
    "org.apache.spark" % "spark-streaming_2.10" % "2.0.1",
    "org.apache.bahir" % "spark-streaming-twitter_2.11" % "2.0.1"
    )

    but it is better to use a variable
    (still incorrect):

    name := "Simple Project"

    version := "1.0"

    val sparkVersion = "2.0.1"

    libraryDependencies ++= Seq(
    "org.apache.spark" % "spark-core_2.11" % sparkVersion,
    "org.apache.spark" % "spark-streaming_2.10" % sparkVersion,
    "org.apache.bahir" % "spark-streaming-twitter_2.11" % sparkVersion
    )

Similarly in Maven:

    <project>
<groupId>com.example</groupId>
<artifactId>simple-project</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Simple Project</name>
<packaging>jar</packaging>
<version>1.0</version>
<properties>
<spark.version>2.0.1</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>spark-streaming-twitter_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
</project>
  • Spark version used in Spark dependencies has to match Spark version of the Spark installation. For example if you use 1.6.1 on the cluster you have to use 1.6.1 to build jars. Minor versions mismatch are not always accepted.

  • Scala version used to build jar has to match Scala version used to build deployed Spark. By default (downloadable binaries and default builds):

    • Spark 1.x -> Scala 2.10
    • Spark 2.x -> Scala 2.11
  • Additional packages should be accessible on the worker nodes if included in the fat jar. There are number of options including:

    • --jars argument for spark-submit - to distribute local jar files.
    • --packages argument for spark-submit - to fetch dependencies from Maven repository.

    When submitting in the cluster node you should include application jar in --jars.

How to solve SBT Dependency Problem with Spark and whisklabs/docker-it-scala

I tried two approaches

1. Approach: Shading the dependency in the xxxxxxx project

I added the assembly plugin to the plugin.sbt

  • addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.7")

and added some shading rules to the build.sbt. I was creating a fat-jar for the xxxxxxx project

assemblyShadeRules in assembly := Seq(
ShadeRule
.rename("com.fasterxml.jackson.**" -> "embedded.com.fasterxml.jackson.@1")
.inAll
)

That shading worked. All com.fasterxml.jackson dependencies were rewritten to embedded.com.fasterxml.jackson.* inside the xxxxxxx project. (I unzip the jar and decompiled the classes, to see what happended)

Unfortunately that rewriting didn't solved the problem in the root project (and I didn't know why). So I tried:

2.Approach Using dependencyOverrides in commonSettings

I added the following dependencies to the root project:

  val jacksonCore         = "com.fasterxml.jackson.core" % "jackson-core" % "2.9.6"
val jacksonDatabind = "com.fasterxml.jackson.core" % "jackson-databind" % "2.9.6"
val jacksonModule = "com.fasterxml.jackson.module" %% "jackson-module-scala" % "2.9.6"

I did not exclude the com.fasterxml.jackson dependency from

  • Apache Spark, nor
  • from the xxxxxxx

I added the following setting to the common settings:

lazy val commonSettings = Seq(
scalaVersion := library.version.scala,
...

dependencyOverrides ++= Seq(
library.jacksonDatabind,
library.jacksonCore,
library.jacksonModule
),

...
)

That worked, the Exceptions are gone. Unfortunately I can't explain why this does work (and how) and why the shading didn't work. :(

Jar dependencies error using Spark 2.3 structured streaming

You also have to add this library in your project

org.apache.kafka:kafka-clients:0.10.0.0

maven:

<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.0.0</version>
</dependency>

sbt:

libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.10.0.0"

Solution found by the poster

Adding kafka-clients:0.10.0.0.jar to HDFS instead of $SPARK_HOME/jars/

Maven dependency hell for spark mlib ALS algorithm

Finally I was able to make it work:

<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
</dependencies>

You have to use these exact versions otherwise it will crash in multiple various ways.

Spark structured streaming kafka dependency cannot be resolved

But after i tried in my school's server, it has the following messages and errors

Your school has a firewall preventing remote packages from being downloaded.

This link works for me, for example

Server access error at url https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.3.1/spark-sql-kafka-0-10_2.11-2.3.1.pom (java.net.ConnectException: Connection refused)

You'll need to download the Kafka jars outside of school, then use --jars flag to submit with them

Maven could not resolve dependencies spark

The problem is that the dependency:

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-assembly_2.10</artifactId>
<version>1.1.0</version>
</dependency>

is not a jar its a pom file only which means you can't define it like this. You can see it in the error message:

Failure to find org.apache.spark:spark-assembly_2.10:jar:1.1.0

which shows that Maven will try to download a jar file. The means you have to define it like this:

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-assembly_2.10</artifactId>
<version>1.1.0</version>
<type>pom</type>
</dependency>

But i'm not sure if this will solve all problems. You should take a deep look into the documentation if this is the right path.

Update:
You can also use that as BOM via:

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-assembly_2.10</artifactId>
<version>1.1.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>


Related Topics



Leave a reply



Submit