1391

Spark fat jar to run multiple versions on YARN

I have an older version of Spark setup with YARN that I don't want to wipe out but still want to use a newer version. I found a couple posts referring to how a fat jar can be used for this.

Many SO posts point to either maven(officially supported) or sbt to build a fat jar because it's not directly available for download. There seem to be multiple plugins to do it using maven: maven-assembly-plugin, maven-shade-plugin, onejar-maven-plugin etc.

However, I can't figure out if I really need a plugin and if so, which one and how exactly to go about it. I tried directly compiling github source using 'build/mvn' and 'build/sbt' but the 'spark-assembly_2.11-2.0.2.jar' file is just 283 bytes.

My goal is to run pyspark shell using the newer version's fat jar in a similar way as mentioned here.

Answer1:

From spark version 2.0.0 creating far jar is no longer supported, you can find more information in Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?

The recommended way in your case (running on YARN) is to create directory on HDFS with content of spark's jars/ directory and add this path to spark-defaults.conf:

spark.yarn.jars hdfs:///path/too/jars/directory/on/hdfs/*.jar

Then if you run pyspark shell it will use previously uploaded libraries so it will behave exactly like fat jar from Spark 1.X.

Answer2:

The easiest solution (without changing your Spark on YARN architecture and speaking to your YARN admins) is to:

<ol> <li>

Define a library dependency on Spark 2 in your build system, be it sbt or maven.

</li> <li>

Assemble your Spark application to create a so-called uber-jar or fatjar with Spark libraries inside.

</li> </ol>

It works and I personally tested it at least once in a project.

The only (?) downside of it is that the build process takes longer (you have to sbt assembly not sbt package) and the size of your Spark application's deployable fatjar is...well...much bigger. That also makes the deployment longer since you have to spark-submit it to YARN over the wire.

All in all, it works but takes longer (which may still be shorter than convincing your admin gods to, say forget about what is available in commercial offerings like Cloudera's CDH or Hortonworks' HDP or MapR distro).

Recommend

  • How to add multiple columns in Apache Spark
  • PySpark sqlContext read Postgres 9.6 NullPointerException
  • Maven: Command line to download the dependencies described in the pom.xml
  • Custom partiotioning of JavaDStreamPairRDD
  • new spark.sql.shuffle.partitions value not used after checkpointing
  • Could not find goal '' in plugin org.springframework.boot:spring-boot-maven-plugin:1.1.4.R
  • Install ActiveMq in Apache Karaf 4.0.0.M2
  • pyspark substring and aggregation
  • detecting connection lost in spark streaming
  • Maven-Release-Plugin: Force to use specific version of scm provider
  • Objective C - Create a framework for my iphone apps?
  • Maven use Encrypted passwords in POM
  • How to package a jar and all dependencies within a new jar with maven
  • openssl handshake failed
  • user data scripts fails without giving reason
  • why calling cd shell command through system() or execvp() from a child process won't work?
  • Installing Perl6 and Panda on Ubuntu 15.10. Problems with bootstrap.pl
  • How can I get the full list of running processes on a Mac from a python app
  • Can I read an iPhone beacon with Windows.Devices.Bluetooth.Advertisement.BluetoothLEManufacturerData
  • python script hangs on input method when running spark
  • Android Google Maps API v2 start navigation
  • how to avoid repetitive constructor in children
  • Parsing a CSV string while ignoring commas inside the individual columns
  • copying resource to sdcard gives a damaged file in android
  • Counter field in MS Access, how to generate?
  • Javascript + PHP Encryption with pidCrypt
  • ActionScript 2 vs ActionScript 3 performance
  • Websockets service method fails during R startup
  • SVN: Merging two branches together
  • Hibernate gives error error as “Access to DialectResolutionInfo cannot be null when 'hibernate.
  • Run Powershell script from inside other Powershell script with dynamic redirection to file
  • json Serialization in asp
  • Load html files in TinyMce
  • How can I get HTML syntax highlighting in my editor for CakePHP?
  • Why can't I rebase on to an ancestor of source changesets if on a different branch?
  • coudnt use logback because of log4j
  • How to CLICK on IE download dialog box i.e.(Open, Save, Save As…)
  • Can Visual Studio XAML designer handle font family names with spaces as a resource?
  • Is it possible to post an object from jquery to bottle.py?
  • Running Map reduces the dimensions of the matrices