Running Apache Spark and S3 locally

If you don't have Apache Spark installed locally, follow the steps to install Spark on your macOS.

First of all, you have to get the command to run an Apache Spark task locally, I usually run them using:

spark-submit --deploy-mode client --master local[1] --class com.sample.App --name App target/path/to/your.jar argument1 argument2

Another consideration before we start is to use the correct S3 handler, since there are a few of them that are already deprecated, this guide uses s3a, so make sure that all S3 URL are like s3a://bucket/path...

Now, if you run the spark-submit command using the default Apache Spark installation you will get the following error

java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated

You need to include the required dependencies in your Spark installation. You can also require the packages in your package but usually these packages are already provided by the Spark installation (especially if you are using AWS EMR)

Step 1: Download Hadoop AWS

Check the Hadoop version that you are using currently. You can get it from any jar present on you Spark installation. Since my Spark is installed in /usr/local/Cellar/apache-spark/2.4.5/libexec, I check the version as follows

$ ls /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-*

/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-annotations-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-auth-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-aws-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-client-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-common-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-hdfs-2.7.3.jar

As you can see the Hadoop version is 2.7.3, so I downloaded the jar from the following URL: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3

Copy the jar to you Spark installation (i.e. /usr/local/Cellar/apache-spark/2.4.5/libexec/jars)

Step 2: Install AWS Java SDK

You have to download the exact same version that was used to generate the hadoop-aws package. You can get it from the bottom of the page from where you downloaded it. As you can see in the image.

As you can see the aws-java-sdk version is 1.7.4, you can download it from https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 . Once downloaded copy it to jars directory.

Step 3: Configure AWS access and secret keys

You can chose any of the available methods to provide S3 credentials. You can use environment variable

export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key

Or include it in your code

sc.hadoopConfiguration.set("fs.s3a.access.key", "my.aws.key")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "my.secret.key")

There are many ways for doing it, just use the one that better fits your needs.

We are done! If everything have worked fine, you will be able to run again the spark-submit command with no errors!

Extra ball

If after following all the steps you are getting the following exception

Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException

Download jets3t from Maven repository (https://mvnrepository.com/artifact/net.java.dev.jets3t/jets3t) and copy it to your jars path.


You can also find me on Twitter if you’d like to read similar technical tricks!

Related posts

Apr 21, '20

Install Apache Spark on macOS

2 min read