In a world of interconnected devices, the amount of data being generated is skyrocketing. In order to analyze and process this data, engineers often employ Machine Learning (ML) techniques that allow them to gather valuable insights and actionable information. However, as the volume of data continues to grow, the need to run big data processing pipelines arises, a demand your laptop can’t cope with.
Therefore, the use of distributed frameworks in fleets of instances is becoming increasingly popular. In order to run a distributed framework in a cluster platform like AWS EMR effectively, it is important to have a good understanding of the underlying architecture and infrastructure. In the second part of our two-blog series, we present a hands-on tutorial where we will dive into the details of setting up and configuring the necessary infrastructure to launch an EMR cluster using Lambda functions to run a sample pySpark script, the one we introduced in part one. You will need an AWS account to follow this tutorial as we’ll be using their services.
EMR (Elastic MapReduce) is an AWS solution for running big data frameworks. It provides a managed cluster platform that simplifies running tools such as Apache Hadoop, Apache Spark and Presto in order to process and analyze large volumes of data. Additionally, it provides several options for storing data, including Amazon S3, Hadoop Distributed File System (HDFS), and HBase.
EMR offers flexibility and scalability against changing processing demands by allowing you to easily adjust the number and type of node-instances in the cluster. The costs will depend on the quantity and type of instances launched. You have the opportunity to further minimize the cost by buying Reserved Instances or Spot Instances instead of on-demand. Spot Instances can provide considerable cost savings, sometimes being as low as one-tenth of on-demand pricing.
It can integrate with other AWS services to provide different capabilities and functionalities. IAM integration can be used to manage fine-grained access, also provides data encryption in transit and at rest. Instances can be launched in a virtual network of your choosing. Furthermore, it provides monitoring and logging capabilities that enable you to track cluster performance and diagnose issues.
Overall, AWS EMR is a great platform for processing and analyzing data since it provides the ability to handle complex workloads, scale as needed, lots of customization and other services integrations. We chose it for today’s tutorial and for our Spark pipeline. The parallel aspect of the framework allows us to process data and train ML models much faster and cost-effective.
There are many ways in which one could trigger an EMR Cluster. In this tutorial we are going to be launching the cluster through a Lambda, another very popular tool that we will be describing next. We opted for this method to launch the cluster because it simplifies the process of updating cluster configurations as well as the various parameters required for the script that we will be running.
Prior to creating the EMR cluster, you must create a new VPC or select an existing one that will include the subnets in which the cluster will be launched. It is recommended that you use private subnets, as having a direct route to the internet would make the cluster vulnerable to external threats.
It’s also important to ensure that the networking configuration is compatible with the requirements of the EMR cluster. This includes ensuring that the subnet has sufficient IP addresses to accommodate the number of nodes in the cluster, and that the networking is configured to allow communication between the nodes.
Furthermore, you may want to consider implementing additional network infrastructure such as a VPN or direct connect to allow secure access to the cluster from on-premises data centers or other cloud environments.
Another must is having a reliable storage system to store the scripts and data we will be using on the cluster. In this blog we will use S3. Amazon S3 is a scalable and durable object storage service that is designed to store and retrieve any amount of data from anywhere on the web.
It enables easy data transfer and sharing between different applications and services. The EMR cluster can read data from the S3 bucket and write back the results to it. Using an S3 bucket to store your EMR scripts and data provides a secure, scalable, and cost-effective solution that we can’t stop recommending.
Lastly, we will need 3 different roles, one for implementing the Lambda and two different roles for running the EMR Cluster. While it’s possible to use a single role with permissions for both services, it is not a recommended practice.
Lambda functions require an execution role. It is a permission set that defines the level of access an AWS Lambda function has to AWS services and resources. When you create a Lambda function, you need to specify an execution role that grants permission for the function to interact with other AWS services, such as S3 and EMR. As a minimum it must include the following AWS managed permissions:
AWSLambdaBasicExecutionRole, AmazonEMRFullAccessPolicy_v2 and AmazonElasticMapReduceFullAccess.
EMR will require a job flow role. This role is used to manage clusters and execute jobs. The role allows EMR to access resources that are necessary for cluster creation and operation, such as launching and terminating Amazon EC2 instances, reading and writing data to Amazon S3, and accessing other AWS services. We will select the EMR Role for EC2 option as trusted entity for this role and include AmazonElasticMapReduceforEC2Role in permitted policies.
EMR will also require a service role. This role will be responsible for granting permissions to the EMR service to perform tasks that are necessary for EMR’s operation, such as creating, modifying, or deleting resources used by EMR clusters. This includes creating temporary security groups, network interfaces, and instance profiles for the EMR cluster’s nodes. We can use the AWS pre-defined EMR_DefaultRole.
AWS Lambda is another AWS Service that allows developers to run code without the need to provision or manage servers since it manages all aspects of the infrastructure and resources, including server maintenance, operating system updates, automatic scaling, logging and more. Instead of deploying an entire server or container, you simply write code and have it executed in response to events.
It is highly available and implements a pay-as-you-go pricing model, which means you only pay for the compute time that your function uses, making it cost-effective. Use cases include file and stream processing, web applications, IoT and mobile backends, among others.
Log in to your AWS account and type Lambda on the search bar. Access the service and select the Create function button. Fill in basic information for your Lambda function. On this occasion we will create a script from scratch and will be using Python as our runtime language.
Regarding the execution role, you could create a new role or use an existing one.
We chose the role emr-launching-lambda-role we previously created with the mentioned attached policies. After you create the lambda, you will be redirected to the Code source section inside the function, which will contain an environment with a lambda_function.py file.
AWS Lambda handlers are the entry point for the code that runs in an AWS Lambda function. They are the functions that AWS Lambda invokes when it executes your function code. Since we specified a Python runtime, the file has a .py extension and we will be writing Python code. Under the Code source section you will find a Runtime settings section. There you will find the Handler property that defines a method inside a file that is our entry point, initially being the lambda_handler method inside the lambda_function file.
The AWS Lambda function handler has two parameters, event and context. Event represents the input data, it contains information about the triggering event that caused the invocation of the function, for example an HTTP request, a scheduled event or a message from another AWS service. Context parameter provides information about the current invocation and execution environment as well as the Lambda function’s runtime, such as the function name, version, and ARN, the AWS request ID, the CloudWatch Logs stream name, and the function’s remaining execution time.
As per this blog, we will be including some arguments in our Event parameter while configuring our test event, it could include any sort of information in JSON format.
The code we include on the handler file, will begin by importing some libraries, then defining the EMR Cluster and lastly implementing a return value.
By making use of boto3, a Python library that provides an interface to easily create, manage and configure AWS resources, we will be defining and creating our EMR cluster. The policies attached to the execution role we previously associated, grant the Lambda function permissions to access other AWS services, so in order to enable integration with EMR we previously attached the AmazonEMRFullAccessPolicy_v2 policy.
The first thing we will do is extract the parameters from the event that will be used in the script. These are max_iter, which refers to the maximum number of iterations the model can run over the data to learn. X1 and X2, which are meant to be the coefficients in our polynomial function. All parameters default to 10 in case they can’t be obtained from the event.
Now is time to configure the EMR cluster. We will begin by defining the basics.
Name is the name that will be adopted by the cluster. LogUri will point to an S3 folder where to store the logs, although this is not required, we highly recommend you to have EMR logs available to ease debugging. ReleaseLabel attribute is used to specify the version of software applications and Hadoop components used by the EMR cluster. Finally, Applications refers to the packages we will use and need installed on the cluster.
Now we will be defining the group of instances used on the cluster. The amount of computing resources that will be used is an absolute overkill for the task in hand. It’s done that way to show how to set up multiple instances in a cluster.
EMR always requires a master node to coordinate the overall cluster and distribute tasks to worker nodes, there are typically one or more worker nodes. These are responsible for executing data processing tasks and storing data in the HDFS. We will allocate m4.large instances for both master and slave nodes. There can only be a single master node at a time but a large number of worker nodes.
KeepJobFlowAliveWhenNoSteps determines whether the cluster should terminate automatically when there are no more running steps. Having the cluster alive can save you some time as long as you are submitting steps but it will continue to incur costs, even if there are no running steps.
Ec2SubnetId should be a subnet created within the VPC we previously mentioned.
Here we are defining a step for the EMR cluster, a step is a unit of work that you can add to a cluster to perform a specific task or processing on your data. It consists of an executable program or script, along with any necessary configuration and input/output data. There are several types of steps. The code defines one that involves submitting a Spark job to execute a script stored in S3 with some previously defined parameters.
On the last definition block we will set the VisibleToAllUsers flag to true, that way allowing us to see the cluster that was created by the lambda without the need to deal with IAM policies. We will also define JobFlowRole and ServiceRole, both previously described. You can also see how to include a tag although this is not necessary but definitely recommended if you want to use tools such as the AWS Cost Explorer more efficiently.
Let’s finish by returning a statusCode and a body with the EMR Cluster Id.
Putting it all together, the handler would look like this:
Here is a small snippet for how the Lambda’s test event could look like:
You just need to test it after deploying the changes by pressing the Test button and poof! You got yourself an EMR running cluster that will train a regression model to try to approximate a polynomial function.
In order to see the output of the model just open the cluster logs after it has finished. On S3 go to steps > s-ID_OF_NODE > stdout.gz and you will see the desired output.
In a world of interconnected devices, the amount of data being generated is skyrocketing. In order to analyze and process this data, engineers often employ Machine Learning (ML) techniques that allow them to gather valuable insights and actionable information. However, as the volume of data continues to grow, the need to run big data processing pipelines arises, a demand your laptop can’t cope with.
Therefore, the use of distributed frameworks in fleets of instances is becoming increasingly popular. In order to run a distributed framework in a cluster platform like AWS EMR effectively, it is important to have a good understanding of the underlying architecture and infrastructure. In the second part of our two-blog series, we present a hands-on tutorial where we will dive into the details of setting up and configuring the necessary infrastructure to launch an EMR cluster using Lambda functions to run a sample pySpark script, the one we introduced in part one. You will need an AWS account to follow this tutorial as we’ll be using their services.
EMR (Elastic MapReduce) is an AWS solution for running big data frameworks. It provides a managed cluster platform that simplifies running tools such as Apache Hadoop, Apache Spark and Presto in order to process and analyze large volumes of data. Additionally, it provides several options for storing data, including Amazon S3, Hadoop Distributed File System (HDFS), and HBase.
EMR offers flexibility and scalability against changing processing demands by allowing you to easily adjust the number and type of node-instances in the cluster. The costs will depend on the quantity and type of instances launched. You have the opportunity to further minimize the cost by buying Reserved Instances or Spot Instances instead of on-demand. Spot Instances can provide considerable cost savings, sometimes being as low as one-tenth of on-demand pricing.
It can integrate with other AWS services to provide different capabilities and functionalities. IAM integration can be used to manage fine-grained access, also provides data encryption in transit and at rest. Instances can be launched in a virtual network of your choosing. Furthermore, it provides monitoring and logging capabilities that enable you to track cluster performance and diagnose issues.
Overall, AWS EMR is a great platform for processing and analyzing data since it provides the ability to handle complex workloads, scale as needed, lots of customization and other services integrations. We chose it for today’s tutorial and for our Spark pipeline. The parallel aspect of the framework allows us to process data and train ML models much faster and cost-effective.
There are many ways in which one could trigger an EMR Cluster. In this tutorial we are going to be launching the cluster through a Lambda, another very popular tool that we will be describing next. We opted for this method to launch the cluster because it simplifies the process of updating cluster configurations as well as the various parameters required for the script that we will be running.
Prior to creating the EMR cluster, you must create a new VPC or select an existing one that will include the subnets in which the cluster will be launched. It is recommended that you use private subnets, as having a direct route to the internet would make the cluster vulnerable to external threats.
It’s also important to ensure that the networking configuration is compatible with the requirements of the EMR cluster. This includes ensuring that the subnet has sufficient IP addresses to accommodate the number of nodes in the cluster, and that the networking is configured to allow communication between the nodes.
Furthermore, you may want to consider implementing additional network infrastructure such as a VPN or direct connect to allow secure access to the cluster from on-premises data centers or other cloud environments.
Another must is having a reliable storage system to store the scripts and data we will be using on the cluster. In this blog we will use S3. Amazon S3 is a scalable and durable object storage service that is designed to store and retrieve any amount of data from anywhere on the web.
It enables easy data transfer and sharing between different applications and services. The EMR cluster can read data from the S3 bucket and write back the results to it. Using an S3 bucket to store your EMR scripts and data provides a secure, scalable, and cost-effective solution that we can’t stop recommending.
Lastly, we will need 3 different roles, one for implementing the Lambda and two different roles for running the EMR Cluster. While it’s possible to use a single role with permissions for both services, it is not a recommended practice.
Lambda functions require an execution role. It is a permission set that defines the level of access an AWS Lambda function has to AWS services and resources. When you create a Lambda function, you need to specify an execution role that grants permission for the function to interact with other AWS services, such as S3 and EMR. As a minimum it must include the following AWS managed permissions:
AWSLambdaBasicExecutionRole, AmazonEMRFullAccessPolicy_v2 and AmazonElasticMapReduceFullAccess.
EMR will require a job flow role. This role is used to manage clusters and execute jobs. The role allows EMR to access resources that are necessary for cluster creation and operation, such as launching and terminating Amazon EC2 instances, reading and writing data to Amazon S3, and accessing other AWS services. We will select the EMR Role for EC2 option as trusted entity for this role and include AmazonElasticMapReduceforEC2Role in permitted policies.
EMR will also require a service role. This role will be responsible for granting permissions to the EMR service to perform tasks that are necessary for EMR’s operation, such as creating, modifying, or deleting resources used by EMR clusters. This includes creating temporary security groups, network interfaces, and instance profiles for the EMR cluster’s nodes. We can use the AWS pre-defined EMR_DefaultRole.
AWS Lambda is another AWS Service that allows developers to run code without the need to provision or manage servers since it manages all aspects of the infrastructure and resources, including server maintenance, operating system updates, automatic scaling, logging and more. Instead of deploying an entire server or container, you simply write code and have it executed in response to events.
It is highly available and implements a pay-as-you-go pricing model, which means you only pay for the compute time that your function uses, making it cost-effective. Use cases include file and stream processing, web applications, IoT and mobile backends, among others.
Log in to your AWS account and type Lambda on the search bar. Access the service and select the Create function button. Fill in basic information for your Lambda function. On this occasion we will create a script from scratch and will be using Python as our runtime language.
Regarding the execution role, you could create a new role or use an existing one.
We chose the role emr-launching-lambda-role we previously created with the mentioned attached policies. After you create the lambda, you will be redirected to the Code source section inside the function, which will contain an environment with a lambda_function.py file.
AWS Lambda handlers are the entry point for the code that runs in an AWS Lambda function. They are the functions that AWS Lambda invokes when it executes your function code. Since we specified a Python runtime, the file has a .py extension and we will be writing Python code. Under the Code source section you will find a Runtime settings section. There you will find the Handler property that defines a method inside a file that is our entry point, initially being the lambda_handler method inside the lambda_function file.
The AWS Lambda function handler has two parameters, event and context. Event represents the input data, it contains information about the triggering event that caused the invocation of the function, for example an HTTP request, a scheduled event or a message from another AWS service. Context parameter provides information about the current invocation and execution environment as well as the Lambda function’s runtime, such as the function name, version, and ARN, the AWS request ID, the CloudWatch Logs stream name, and the function’s remaining execution time.
As per this blog, we will be including some arguments in our Event parameter while configuring our test event, it could include any sort of information in JSON format.
The code we include on the handler file, will begin by importing some libraries, then defining the EMR Cluster and lastly implementing a return value.
By making use of boto3, a Python library that provides an interface to easily create, manage and configure AWS resources, we will be defining and creating our EMR cluster. The policies attached to the execution role we previously associated, grant the Lambda function permissions to access other AWS services, so in order to enable integration with EMR we previously attached the AmazonEMRFullAccessPolicy_v2 policy.
The first thing we will do is extract the parameters from the event that will be used in the script. These are max_iter, which refers to the maximum number of iterations the model can run over the data to learn. X1 and X2, which are meant to be the coefficients in our polynomial function. All parameters default to 10 in case they can’t be obtained from the event.
Now is time to configure the EMR cluster. We will begin by defining the basics.
Name is the name that will be adopted by the cluster. LogUri will point to an S3 folder where to store the logs, although this is not required, we highly recommend you to have EMR logs available to ease debugging. ReleaseLabel attribute is used to specify the version of software applications and Hadoop components used by the EMR cluster. Finally, Applications refers to the packages we will use and need installed on the cluster.
Now we will be defining the group of instances used on the cluster. The amount of computing resources that will be used is an absolute overkill for the task in hand. It’s done that way to show how to set up multiple instances in a cluster.
EMR always requires a master node to coordinate the overall cluster and distribute tasks to worker nodes, there are typically one or more worker nodes. These are responsible for executing data processing tasks and storing data in the HDFS. We will allocate m4.large instances for both master and slave nodes. There can only be a single master node at a time but a large number of worker nodes.
KeepJobFlowAliveWhenNoSteps determines whether the cluster should terminate automatically when there are no more running steps. Having the cluster alive can save you some time as long as you are submitting steps but it will continue to incur costs, even if there are no running steps.
Ec2SubnetId should be a subnet created within the VPC we previously mentioned.
Here we are defining a step for the EMR cluster, a step is a unit of work that you can add to a cluster to perform a specific task or processing on your data. It consists of an executable program or script, along with any necessary configuration and input/output data. There are several types of steps. The code defines one that involves submitting a Spark job to execute a script stored in S3 with some previously defined parameters.
On the last definition block we will set the VisibleToAllUsers flag to true, that way allowing us to see the cluster that was created by the lambda without the need to deal with IAM policies. We will also define JobFlowRole and ServiceRole, both previously described. You can also see how to include a tag although this is not necessary but definitely recommended if you want to use tools such as the AWS Cost Explorer more efficiently.
Let’s finish by returning a statusCode and a body with the EMR Cluster Id.
Putting it all together, the handler would look like this:
Here is a small snippet for how the Lambda’s test event could look like:
You just need to test it after deploying the changes by pressing the Test button and poof! You got yourself an EMR running cluster that will train a regression model to try to approximate a polynomial function.
In order to see the output of the model just open the cluster logs after it has finished. On S3 go to steps > s-ID_OF_NODE > stdout.gz and you will see the desired output.