I have an older version of Spark setup with YARN that I don't want to wipe out but still want to use a newer version. I found a couple posts referring to how a fat jar can be used for this.
Many SO posts point to either maven(officially supported) or sbt to build a fat jar because it's not directly available for download. There seem to be multiple plugins to do it using maven: maven-assembly-plugin, maven-shade-plugin, onejar-maven-plugin etc.
However, I can't figure out if I really need a plugin and if so, which one and how exactly to go about it. I tried directly compiling github source using 'build/mvn' and 'build/sbt' but the 'spark-assembly_2.11-2.0.2.jar' file is just 283 bytes.
My goal is to run pyspark shell using the newer version's fat jar in a similar way as mentioned here.
From spark version 2.0.0 creating far jar is no longer supported, you can find more information in Do we still have to make a fat jar for submitting jobs in Spark 2.0.0?
The recommended way in your case (running on YARN) is to create directory on HDFS with content of spark's
jars/ directory and add this path to
Then if you run pyspark shell it will use previously uploaded libraries so it will behave exactly like fat jar from Spark 1.X.
The easiest solution (without changing your Spark on YARN architecture and speaking to your YARN admins) is to:<ol> <li>
Define a library dependency on Spark 2 in your build system, be it sbt or maven.</li> <li>
Assemble your Spark application to create a so-called uber-jar or fatjar with Spark libraries inside.</li> </ol>
It works and I personally tested it at least once in a project.
The only (?) downside of it is that the build process takes longer (you have to
sbt assembly not
sbt package) and the size of your Spark application's deployable fatjar is...well...much bigger. That also makes the deployment longer since you have to
spark-submit it to YARN over the wire.
All in all, it works but takes longer (which may still be shorter than convincing your admin gods to, say forget about what is available in commercial offerings like Cloudera's CDH or Hortonworks' HDP or MapR distro).