Spark supports submitting jobs natively to a k8s scheduler since version 2.3.0, but even with its nice documentation you still have to piece some parts together to get it working properly. In this article, I intend to provide a local working example and explain why it made perfect sense for OLX to go this way.
Handling data is at the heart of what we do to deliver the best possible experience to our customers around the world (to understand how important this is for us, you can have a look at every data-related article written in the OLX Tech Blog). And handling data at a global company scale can be quite challenging, which pushes us to seek ways to improve on how we do it on a continuous basis.
Apache Spark plays a fundamental role in that space, being the most pervasive data processing technology inside OLX.
Why spark in k8s?
Another technology at the core of OLX engineering is kubernetes, the de-facto standard platform for running most of our apps. K8s brings us many benefits, such as alignment (even regarding vocabulary, which is often overlooked but very important when dealing with multiple teams and applications), operational excellence, eases standardisation over monitoring, alerting, logging and auto-scaling (as cluster-wide solutions often offload SREs from having to think about every single of these aspects for each new application).
Given our current expertise in k8s, all we needed was “a little push” to start running spark there as well. And this push came in form of:
- Cost savings, as many teams often deploy their own Spark clusters using Amazon EMR, which abstracts them from dealing with the complexity of provisioning and bootstrapping at the expense of a per-instance, per-second fee. Naturally, such costs can grow quite high on large datasets; and
- Serving https://kylin.apache.org/ properly, yet not risking it becoming unmanageable in the future (we didn't want to trade EMR costs over operational costs that a more complex solution would require).
Spark supports submitting jobs natively to a k8s scheduler since version 2.3.0. Yet, companies are only starting to embrace running spark on kubernetes (understandable, as the documentation still states that k8s scheduling support is experimental). By the way, if you haven't done so, go ahead and check https://spark.apache.org/docs/latest/running-on-kubernetes.html for a comprehensive guide about the concepts and configuration options.
At OLX, we decided to take it for a spin and piece everything together so you can test is easily, as this could lead to massive cost savings in the future.
These are the requirements for running spark on k8s:
- spark >= 2.3.0;
- a docker registry to host the spark images built by the docker-image-tool.sh script;
- a k8s cluster (for local tests, minikube will do) and tooling around it (such as kubectl and helm);
- [Recommended] if not running a local test, you'll want cluster-autoscaler for enabling auto scaling nodes only when jobs are actually submitted;
- [Recommended] if running in AWS, you'll want something like kube2iam to assign proper permissions to the pods created by spark-submit;
- a sample spark job for testing
TL;DR: check https://github.com/olx-global/spark-on-k8s, we've put together the same steps you'll see in a repo with a Makefile to simplify downloading and installing all the tooling you'll need locally, without messing with your environment. For the long version, keep reading.
First, let's download and unpack spark in your local directory:
At the time this article was written, we faced an issue that prevents spark from launching pods in recent kubernetes versions. The fix is merged already, but not yet released.
For overcoming it without building spark on your own, we need to replace the kubernetes-related jars in $SPARK_HOME/jars/ with newer versions:
Now, let's start minikube (this might take a few minutes if doing it for the first time):
And deploy a docker-registry on it using helm:
Before continuing, you now must add your newly created registry into the insecure registries list of your docker daemon. As an example. if you're using Docker for Mac this is under Preferences -> Daemon:
Replace 192.168.99.105 with the result from:
And restart the docker daemon.
Once the docker daemon is back, you're finally ready to build the spark docker images and push them to your registry! Here's how you do it:
Now it's time to run a sample spark job! You can run either from your machine (outside the cluster) or from within a pod inside the cluster.
Running spark-submit from your machine
This is fairly straightforward:
While the spark job is running, try opening a new terminal to check the pods running:
If everything went well, you should see a message in the end that states “Pi is roughly 3.14…”
Running spark-submit from within the cluster
If you want to run spark-submit from within a pod, you'll have to grant the pod access to the k8s API. This is done by creating a Role with the permissions and attaching it to the pod through a service account:
Save this as a yaml file, and apply it with kubectl apply -f.
Then, launch your pod using serviceAccountName: spark (this will link the pod you're running with the role you just created)
For more information or how role-based access control works in kubernetes, check https://kubernetes.io/docs/reference/access-authn-authz/rbac/.
- On cluster mode, the driver pod is not the same that originally ran spark-submit. This helps you, given that driver pods need to be reachable from the executor and this is taken care of (which is not true for client mode). However, if your application relies on spark-submit output for logging/making decisions whether the job finished successfully or not, this won’t work (kylin relies on such output, as an example).
- On client mode, the documentation suggests creating a headless service for executors reaching out the spark driver. We found out this setup is not the best if you have more than one pod running spark-submit at the same time (the executors might reach back to the wrong pod, which will cause intermittent job failures). The way out was to set the pod’s FQDN in the $SPARK_HOME/conf/spark-defaults.conf, so each executor resolves their calling pod instead of any (that change allowed us to run multiple spark-submit jobs in parallel).
- If you rely on the performance of spark on top of HDFS, one of the key performance features is Data locality, in other words the capability to schedule jobs as close as possible to the HDFS blocks that need to be read. Such capability is lost when deploying in kubernetes currently. Make sure to benchmark the actual impact this causes in your case before investing too much too early.
- One significant limitation we found was the lack of support for k8s tolerations. This prevents you from running spark jobs on specialised nodes exclusively (such as spot instances in AWS, or GPU-powered nodes). For that reason, we decided to patch Apache Spark and add this feature (currently available here: https://github.com/apache/spark/pull/26505), since it allows us to reduce costs significantly.
EDIT: spark 3.0.0-preview supports setting pod templates, which can be used to set tolerations among other configurations. Given it's already there for next major release, there's no reason for the spark maintainers to merge this PR in branch-2.4. Check https://spark.apache.org/docs/3.0.0-preview/running-on-kubernetes.html#pod-template for more information.
When designing a spark cluster from scratch, using k8s may not currently be the most optimal way from a performance point of view. However, it doesn't seem the lack of data locality will be an issue for much longer. Initiatives such as https://github.com/GoogleCloudPlatform/spark-on-k8s-operator (although beta, it's currently under heavy development) should eventually address this.
If you're already running applications in k8s (with a properly setup cluster, which comprises cluster-autoscaler and kube2iam at least), you'd be surprised by how stable running spark on k8s is. And, in OLX case, it means zero extra maintenance and support, as everything can be managed by leveraging the same operational standards we've already mastered.
Running Spark on Kubernetes: a fully functional example and why it makes sense for OLX was originally published in OLX Group Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.