If you don't have Apache Spark installed locally, follow the steps to install Spark on your macOS.
First of all, you have to get the command to run an Apache Spark task locally, I usually run them using:
spark-submit --deploy-mode client --master local --class com.sample.App --name App target/path/to/your.jar argument1 argument2
Another consideration before we start is to use the correct S3 handler, since there are a few of them that are already deprecated, this guide uses s3a, so make sure that all S3 URL are like
Now, if you run the
spark-submit command using the default Apache Spark installation you will get the following error
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
You need to include the required dependencies in your Spark installation. You can also require the packages in your package but usually these packages are already provided by the Spark installation (especially if you are using AWS EMR)
Step 1: Download Hadoop AWS
Check the Hadoop version that you are using currently. You can get it from any jar present on you Spark installation. Since my Spark is installed in
/usr/local/Cellar/apache-spark/2.4.5/libexec, I check the version as follows
$ ls /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-* /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-annotations-2.7.3.jar /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-auth-2.7.3.jar /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-aws-2.7.3.jar /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-client-2.7.3.jar /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-common-2.7.3.jar /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-hdfs-2.7.3.jar
As you can see the Hadoop version is 2.7.3, so I downloaded the jar from the following URL: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3
Copy the jar to you Spark installation (i.e.
Step 2: Install AWS Java SDK
You have to download the exact same version that was used to generate the
hadoop-aws package. You can get it from the bottom of the page from where you downloaded it. As you can see in the image.
As you can see the
aws-java-sdk version is 1.7.4, you can download it from https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 . Once downloaded copy it to jars directory.
Step 3: Configure AWS access and secret keys
You can chose any of the available methods to provide S3 credentials. You can use environment variable
export AWS_ACCESS_KEY_ID=my.aws.key export AWS_SECRET_ACCESS_KEY=my.secret.key
Or include it in your code
sc.hadoopConfiguration.set("fs.s3a.access.key", "my.aws.key") sc.hadoopConfiguration.set("fs.s3a.secret.key", "my.secret.key")
There are many ways for doing it, just use the one that better fits your needs.
We are done! If everything have worked fine, you will be able to run again the
spark-submit command with no errors!
If after following all the steps you are getting the following exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
jets3t from Maven repository (https://mvnrepository.com/artifact/net.java.dev.jets3t/jets3t) and copy it to your jars path.
You can also find me on Twitter if you’d like to read similar technical tricks!