Apache Spark running on Kubernetes

9 min readJan 14, 2021

Let’s understand how the most powerful analytics engine for Big Data processing runs on Kubernetes.

Before that, we need to know what is Docker and why it has become one of the most popular tool in the DevOps world.

Docker Introduction

Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.

Containers are isolated from one another and bundle their own software, libraries and configuration files; they can communicate with each other through well-defined channels.

Setting up end-to-end stack including various technologies.

Application — NodeJS, mongoDB, Redis, Ansible

Issues building the application — Compatibility with underlying with OS, Version compatibility of the libraries and dependencies, While upgrading components like DB, have to check everytime the underlying infrastructure. Compatibility matrix issue.

Hard to setup new dev environments. Right OS, right version of components Other wise application wont run in different environments.

Developing, building and shipping the applications becomes tough. (Matrix from hell)

Containers are isolated environments can have their own process services, networking interfaces, own mounts like VM but they share the same OS Kernel.

OS consists of set of software and OS kernel. Kernel interacts with underlying hardware Linux kernel -> Fedora, Ubuntu

Software may consist of different user interface , drivers, compilers, file managers, dev tools etc.

A Docker image is an immutable (unchangeable) file that contains the source code, libraries, dependencies, tools, and other files needed for an application to run. Due to their read-only quality, these images are sometimes referred to as snapshots. They represent an application and its virtual environment at a specific point in time. Images are, in a way, just templates, you cannot start or run them. What you can do is use that template as a base to build a container.

A Docker container is a virtualized run-time environment where users can isolate applications from the underlying system. Dockers are runnable instances of image.

It resolves Compatibility matrix issue.

Docker resolving the Compatibility matrix Issue. Each set of software along with its libraries and dependencies will come along in a single docker image.

Docker is light weight as more resources are shared between the containers, for example, kernels.

Docker does depends on the underlying OS and hardware unlike VM. Applications in containerised version available in Dockerhub — Images of OS, DBs and services.

Install Docker and import image and run the instance of component version on Docker host. Run multiple instances of a component having different version and use a load balancer.

Containers running instances of images that are isolated and they have their own environment and set of processes. Software application products are Dockerised in DockerHub.

Each version of NodeJs has separate docker image. Version upgradation becomes easier.

Kubernetes — The Docker Orchestrator

With Docker we can run the single instance of application with Docker run command.

If the users increase and if the instance of node is not able to handle the load, then we deploy additional instances of our application manually and monitor their health and if the container fails because of instance failure we have to detect that and run another instance of application(node js).

Health of the Docker host — If the host crashes the containers hosted on that host becomes inaccessible too.

Orchestration — To monitor the state, performance and health of the containers, engineers are needed. Large applications with 1000s of containers, what should be the practical approach.

We need container orchestration. It is set of tools and scripts that can host containers in a production environment. It has multiple Docker hosts that can host several containers. It allows you to deploy 100s instances of your application.

Some orchestration solutions can help you scale up/down hosts as well as containers depending on the users. Apart from clustering and scaling, It supports advanced networking between this containers across different hosts, load balancing user request across different hosts, sharing storage between the hosts.

Complex auto scaling feature for production grade applications is absent in Docker swarm.

Kubernetes is a general-purpose container orchestration platform — Deploy, manage and scale

Docker run can be used to run the single instance of an application. With Kubernetes, we can run 1000 instances of the same application.

Kubernetes can auto scale based on user load, it can upgrade 2000 instances of the applications in rolling one-by-one fashion, if something goes wrong we can roll back.

Kubernetes can help test new features of the applications by upgrading only a small percentage of the running instances.

Kubernetes uses Docker hosts to host applications in form of Docker containers.

Running 1000 instances of an application.

Kubernetes Architecture

Node is a machine where Kubernetes installed, it is worker machine where Kubernetes launches. One Kubernetes can have multiple hosts having several containers running inside them, here hosts are called pods. We have more than one node in cluster to achieve high availability in case of node failure. Master is responsible for managing the clusters, node failure monitoring, how to move the workload of the failed node to another worker node.

Master helps in orchestration of containers on worker nodes.

Kubernetes consists of the following components:

API Server acts as the front end for the Kubernetes — users, management devies , command line interfaces all connect to the API server.

etcd — It is a distributed reliable key value store used by Kubernetes store all data used to manage the cluster. it stores the entire state of the cluster: its configuration, specifications, and the statuses of the running workloads.

Scheduler — It is responsible for distributing work/containers across multiple nodes. A scheduler watches for newly created Pods that have no Node assigned. For every Pod that the scheduler discovers, the scheduler becomes responsible for finding the best Node for that Pod to run on.

Controllers — Brain behind the orchestration. They are responsible for noticing and responding when nodes or containers goes down. They bring up new containers.

Containers runtime are underlying softwares used to run containers, in our case Dockers.

Kubelet is the agent which runs on each node of the cluster. They make sure the containers are running on the nodes as expected. Kubelet is a part of Kubernetes, at heart it’s a container-oriented process watcher. You can use it in isolation to manage containers running on a single host.

Running Spark on Kubernetes

There are two ways of running Spark on Kubernetes —

1) spark-submit

•Prerequisites — distribution of Spark 2.3 or above, Kubernetes cluster at version >= 1.6 with access configured to it using kubectl. Appropriate permissions to list, create, edit and delete pods in your cluster, service account credentials used by the driver pods must be allowed to create pods, services etc.

•Prefixing the master string with k8s:// will cause the Spark application to launch on the Kubernetes cluster, with the API server being contacted at api_server_url.

•Location of Image should be given for spark Docker container creation.

Finally, notice that in the above example we specify a jar with a specific URI with a scheme of local://. This URL is the location of the example jar that is already in the Docker image.

2) YAML

Command to see the number nodes running in Kubernetes cluster

Spark on Kubernetes Architecture

You submit a Spark application by talking directly to Kubernetes (precisely to the Kubernetes API server on the master node) which will then schedule a pod (simply put, a container) for the Spark driver. Once the Spark driver is up, it will communicate directly with Kubernetes to request Spark executors, which will also be scheduled on pods (one pod per executor). If dynamic allocation is enabled the number of Spark executors dynamically evolves based on load, otherwise it’s a static number.

Comparison — Spark running on YARN v/s Spark running on Kubernetes

AWS managed kubernetes service vs Apache Yarn

Advantages of running Spark on Kubernetes

By packaging Spark application as a container, you reap the benefits of containers because you package your dependencies along with your application as a single entity. Concerns around library version mismatch with respect to Hadoop version compatibility become easier to maintain using containers.

The ability to run Spark applications in the Kubernetes cluster in full isolation of each other (e.g. on different Spark versions) while enjoying the cost-efficiency of a shared infrastructure.

If running on Kubernetes: you’re free to choose your Spark version for each separate Docker image! This is in contrast to YARN, where you must use the same global Spark version for the entire cluster.

Unifying your entire tech infrastructure under a single cloud agnostic tool (if you already use Kubernetes for your non-Spark workloads). Kubernetes allows for the unification of infrastructure management under a single type of cluster for multiple workload types. You can run Spark, of course, but you can also run Python or R code, notebooks and even web applications.

In the traditional Spark-on-YARN world, you need to have a dedicated Hadoop cluster for your Spark processing and something else for Python, R, etc. This way, you can build an end-to-end lifecycle solution using single orchestrator and easily reproduce the stack in other Regions like cloud or even run in on-premises environments. Running Spark on Kubernetes means building once and deploying anywhere.

Dependency management since it’s notoriously painful with Spark. You can choose to build a new Docker image for each app or to use a smaller set of Docker images that package most of your needed libraries, and dynamically add your application-specific code on top.

Security — You can also use Kubernetes node selectors to secure infrastructure dedicated to Spark workloads. In addition, since driver pods create executor pods, you can use Kubernetes service account to control permissions using Role or ClusterRole to define fine-grained access control and run your workload securely with other workloads.

Disadvantages of Spark Running on Kubernetes

This is still a beta feature and not ready for production yet. There many features such as dynamic resource allocation, in-cluster staging of dependencies, support for PySpark & SparkR.

Support for Kerberized HDFS clusters, as well as client-mode and popular notebooks interactive execution environments are still being worked on and not available.

Use Case

Spark is superior to Hadoop and MapReduce, primarily because of its memory re-use and caching. It is blazing fast but only as long as you provide enough RAM for the workers. The moment it spills the data to the disk during calculations, and keeps it there instead of memory, getting the query result becomes slower. So, it stands to reason that you want to keep as much available memory as possible for your Spark workers. But, it’s very expensive to keep a static datacenter of either bare metal or virtual machines with enough memory capacity to run the heaviest queries. Since the queries which demand a peak resource capacity are not frequent, it would be a waste of resources and money to try and accommodate the maximum required capacity in a static cluster.

The better solution is to create a dynamic Kubernetes cluster which will use the needed resources on-demand and release them when the heavy lifting is not required anymore. This is a good solution to cover many common situations, like a summary query that runs just once a day, but consumes 15 Terabytes of memory across the cluster to complete the calculations on time and presents the summary report to a customer’s portal. It would be expensive to keep 15 Terabytes of memory as a static cluster (which is equivalent to 30 huge bare metal machines or 60–80 extra-large virtual machines if we consider AWS), when your average daily needs for memory are just around 1–2 Terabytes and are fluctuating.

Credits: Data Mechanics (Spark on Kubernetes Architecture)

Apache Spark running on Kubernetes

Written by Rahul Dey