Running Apache Spark and S3 locally
If you don't have Apache Spark installed locally, follow the steps to install Spark on your macOS.
First of all, you have to get the command to run an Apache Spark task locally, I usually run them using:
spark-submit --deploy-mode client --master local[1] --class com.sample.App --name App target/path/to/your.jar argument1 argument2
Another consideration before we start is to use the correct S3 handler, since there are a few of them that are already deprecated, this guide uses s3a, so make sure that all S3 URL are like s3a://bucket/path...
Now, if you run the spark-submit
command using the default Apache Spark installation you will get the following error
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
You need to include the required dependencies in your Spark installation. You can also require the packages in your package but usually these packages are already provided by the Spark installation (especially if you are using AWS EMR)
Step 1: Download Hadoop AWS
Check the Hadoop version that you are using currently. You can get it from any jar present on you Spark installation. Since my Spark is installed in /usr/local/Cellar/apache-spark/2.4.5/libexec
, I check the version as follows
$ ls /usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-*
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-annotations-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-auth-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-aws-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-client-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-common-2.7.3.jar
/usr/local/Cellar/apache-spark/2.4.5/libexec/jars/hadoop-hdfs-2.7.3.jar
As you can see the Hadoop version is 2.7.3, so I downloaded the jar from the following URL: https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3
Copy the jar to you Spark installation (i.e. /usr/local/Cellar/apache-spark/2.4.5/libexec/jars
)
Step 2: Install AWS Java SDK
You have to download the exact same version that was used to generate the hadoop-aws
package. You can get it from the bottom of the page from where you downloaded it. As you can see in the image.
As you can see the aws-java-sdk
version is 1.7.4, you can download it from https://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk/1.7.4 . Once downloaded copy it to jars directory.
Step 3: Configure AWS access and secret keys
You can chose any of the available methods to provide S3 credentials. You can use environment variable
export AWS_ACCESS_KEY_ID=my.aws.key
export AWS_SECRET_ACCESS_KEY=my.secret.key
Or include it in your code
sc.hadoopConfiguration.set("fs.s3a.access.key", "my.aws.key")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "my.secret.key")
There are many ways for doing it, just use the one that better fits your needs.
We are done! If everything have worked fine, you will be able to run again the spark-submit
command with no errors!
Extra ball
If after following all the steps you are getting the following exception
Exception in thread "main" java.lang.NoClassDefFoundError: org/jets3t/service/S3ServiceException
Download jets3t
from Maven repository (https://mvnrepository.com/artifact/net.java.dev.jets3t/jets3t) and copy it to your jars path.
You can also find me on Twitter if you’d like to read similar technical tricks!