Running Spark on Amazon Web Services — Big Data on Cloud

9 min readNov 18, 2020

The cluster computing framework Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm. Since then, it has been processing data 100x faster than the MapReduce. It typically uses the Yarn as the resource manager while running in the Hadoop Clusters. Ever wondered, how a fully functional data pipeline written using Spark and Scala is implemented on Cloud platforms like AWS. We will understand it in this blog but first, let us understand the AWS Ecosystem.

What is AWS ?

AWS is scalable on demand servers hosted by Amazon and rented by companies. Amazon Web Services (AWS) is a secure cloud services platform, offering compute power, database storage, content delivery and other functionality to help businesses scale and grow. Running web and application servers in the cloud to host dynamic websites. AWS enables you to select the operating system, programming language, web application platform, database, and other services you need. With AWS, you receive a virtual environment that lets you load the software and services your application requires. Netflix uses AWS for nearly all of its computing and storage needs.

Region and AZs in AWS

AWS now spans 77 Availability Zones within 24 geographic regions around the world. AWS Mumbai Region has over 100,000 active customers in India. An Availability Zone (AZ) is one or more discrete data centers with redundant power, networking, and connectivity in an AWS Region. All traffic between AZ’s is encrypted. The network performance is sufficient to accomplish synchronous replication between AZ’s. AZ’s make partitioning applications for high availability easy. If an application is partitioned across AZ’s, companies are better isolated and protected from issues such as power outages, lightning strikes, tornadoes, earthquakes, and more. AZ’s are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other. Multi AZ can be enabled. Fault Tolerant and highly available.

Amazon S3(Simple Storage Service)

S3 or Simple Storage in AWS which primarily used as a Data lake. S3 is durable, scalable, highly available key based data storage infrastructure. When you store data, you assign a unique object key that can later be used to retrieve the data. Amazon S3 offers a range of storage classes designed for different use cases. These include:

• S3 Standard for general-purpose storage of frequently accessed data. USE — Big Data analytics

• S3 Intelligent-Tiering for data with unknown or changing access patterns. USE — Datastore for disaster recovery

• S3 Standard-Infrequent Access (S3 Standard-IA). USE — Storing secondary backup copies of on-premise data.

• S3 One Zone-Infrequent Access (S3 One Zone-IA) for long-lived, but less frequently accessed data. Data stored in single AZ. Data is lost when AZ destroyed.

Amazon S3 Glacier (S3 Glacier). USE — Archiving/ Backup purposes. Each class has differs from each other based on durability and availability, cost, AZ. STORAGE CLASSES FOR FILES.

Features of S3

HDFS vs S3

Scalability — HDFS relies on local storage that scales horizontally. If you want to increase your storage space, you’ll either have to add larger hard drives to existing nodes, or add more machines to the cluster. This is feasible, but more costly and complicated than S3. S3 scales vertically and automatically according to your current data usage, without any need for action on your part. — S3 Wins.
Durability — A statistical model for HDFS data durability suggests that the probability of losing a block of data (64 megabytes by default) on a large 4,000 node cluster (16 petabytes total storage, 250,736,598 block replicas) is 0.00000057 (5.7 x 10^-7) in the next 24 hours, and 0.00021 (2.1 x 10^-4) in the next 365 days. S3 provides a durability of 99.999999999% of objects per year. This means that a single object could be lost per 10,000,000 objects once every 10,000 years. — S3 Wins.
Cost — In order to preserve data integrity, HDFS stores three copies of each block of data by default. This means exactly what it sounds like: HDFS requires triple the amount of storage space for your data — and therefore triple the cost. While you don’t have to enable data replication in triplicate, storing just one copy is highly risky, putting you in danger of data loss. Amazon handles the issue of data backups on S3 itself, so you pay for only the storage that you actually need. S3 also supports storing compressed files, which can help slash your storage costs — S3 Wins.

EC2 Instance Types(Nodes of the Cluster)

General purpose instances provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads. These instances are ideal for applications that use these resources in equal proportions such as web servers and code repositories.(Denoted by m5).
Memory optimized instances are designed to deliver fast performance for workloads that process large data sets. Amazon EC2 R6g instances are powered by Arm-based AWS Graviton2 processors. They deliver up to 40% better price performance over current generation R5 instances for memory-intensive application.
The other types are compute optimized(high performance web servers, high performance computing (HPC), scientific modeling, dedicated gaming servers), storage optimized(workloads that require high, sequential read and write access to very large data sets on local storage).

General Purpose Instance types and their configuration

EC2 Instance Creation

Choose the Instance type, subnet(AZ), IAM Role, Cloudwatch Monitoring, Security groups(SSH Login). Monitor the health, RAM Used, Disk used, Access logs.

IAM — Identity and Access Management

IAM has three things users, groups and Roles. Policies are written in JSON. IAM account which we get should not be the root account. IAM is a global service. Root user has unlimited access and IAM user has limited access. Policies can be attached/detached to user, groups and Roles. RDS/EMR/EC2 role read-write access to S3.

Setting user permission for certain services/user/groups/roles.

EMR Cluster — Elastic Map Reduce

It is the industry-leading cloud big data platform for processing vast amounts of data using open source tools such as Apache Spark, Apache Hive, Apache HBase, Apache Flink, and Presto. With EMR you can run Petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 3x faster than standard Apache Spark. Solutions Architect Analysis for minimizing the cost.

•For short-running jobs, you can spin up and spin down clusters and pay per second for the instances used.

•For long-running workloads, you can create highly available clusters that automatically scale to meet demand. Create Zeppelin Notebook on top of the cluster for interactive analysis.

•It is basically managed Hadoop frameworks running on ec2 instances.

States of a EMR Cluster is STARTING, RUNNING, WAITING, TERMINATING, TERMINATED(All steps completed), and TERMINATED_WITH_ERRORS.
States of EC2 instances — Running, Stopping, Stopped, Shutting-Down
Changing the security groups of master and slave nodes for SSH access. Key value Pair.
Calculate the cluster configuration before provisioning.

EMR cluster UI — 2 node cluster — 1 master and slave node

So we can store the final processed data in S3 and we can use the HDFS of EMR cluster only to store the temporary data which gets generated during shuffling of data across the cluster during the data processing.

AWS Lambda

•AWS Lambda lets you run code without provisioning or managing servers. You pay only for the compute time you consume. With Lambda, you can run code for virtually any type of application or backend service all with zero administration. Just upload your code and Lambda takes care of everything required to run and scale your code with high availability. You can set up your code to automatically trigger from other AWS services or call it directly from any web or mobile app.

•Serverless — Scaling is automated that means that servers would added automatically according to the requirement of your applications. Serverless means that we don’t have to manage the servers anymore we only have to manage the functions.

•Example — Virtual Servers like EC2 instances are limited by CPU and RAM and are continuously running. Auto-Scaling means intervention to add and remove servers. Elastic Load Balancing is redirecting the traffic to different server based on user traffic. Both of these will now happen automatically.

•With Lambda Functions we do not have to manage the servers, we can run services on demand and scaling is completely automated.

•Calculate the cluster configuration and write the lambda function in python.

Question — Lambda is going to access s3 and EMR. How many policies to attach ?

Lambda script in Python to spin up EMR cluster

Giving the cluster configuration in the code

Triggering Lambda functions through Amazon Cloudwatch.

Triggering Lambda functions when S3 has new incoming data

When a file is uploaded to S3 bucket, Lambda function will be triggered automatically

How S3 trigger for Lambda function works

After the S3 Bucket trigger is created and S3fullBucketAccess policy attached to AWS lambda function IAM Role. Lambda will constantly monitor the s3 Access logs.

AWS Step Functions

AWS Step Functions is a serverless function orchestrator that makes it easy to sequence AWS Lambda functions and multiple AWS services into business-critical applications. The output of one step acts as input into the next. Each step in your application executes in order and as expected based on your defined business logic. Cloudwatch can be used to trigger Step Function at regular intervals. Calculate the cluster configuration for particular source.

Visualization of data pipeline running through step function.(Each node is a Lambda function here).

Architecture of a Spark Data pipeline running on AWS Ecosystem.

We have four data sources here — MySQL database, SFTP server, S3 and MongoDB. We have four different Ingestion Lambda functions which once triggered will spin up 4 different EMR clusters in parallel and run spark jobs in each of them when we start getting data from these sources and ultimately store the ingested data in their raw format in S3 bucket. Next step would be to trigger Transform Lambda function which will again spin up EMR cluster and run spark jobs to transform the data. We have one or more transformation levels depending on the complexity of data. At the end, Load Lambda function will be triggered which will store the final transformed data in the Amazon Redshift. We can club together all the Lambda functions in a single Step Function and define their trigger sequence i.e. Ingestion Lambdas for ingesting raw data, Transform Lambdas for transforming the data and at last, Load Lambdas for loading the data in the target tables. This Step Function can be scheduled at regular intervals using Amazon Cloudwatch.

Cloudwatch logs metric filter is used to aggregate failures and raise alarm. Subscribe SNS topic to alarm and send a mail to your email account if a particular spark job or the entire data pipeline fails.

Running Spark on Amazon Web Services — Big Data on Cloud

Written by Rahul Dey