As we have seen on previous posts (Launching an EMR cluster using Lambda functions to run PySpark scripts, part 1 and 2), EMR is a very popular big data platform, designed to simplify running Big Data frameworks such as Hadoop and Spark. Taking advantage of this platform allows us to scale up and down our compute capacity, leverage cheap spot instances and integrate seamlessly with our Data Lake. When we combine EMR with Lambdas, we can launch our clusters programmatically using serverless FaaS, and react to new data becoming available, or triggering on a cron schedule.
In this post, we propose to take this approach of using Lambdas to trigger EMR clusters one step further, and use AWS Step Functions to orchestrate our ML process. AWS Step Functions is a fully-managed service that makes it easy to build and run multi-step workflows. It allows us to define a series of steps that make up our workflow, and then Step Functions executes them in order. These steps can be run sequentially, in parallel, wait for previous stages to finish, or run conditionally if other workflows fail. The image below shows an example of a Step Function. It is a series of steps, where we can see a Start, Steps, Decisions, Alternatives, and an End. This is ideal for real-life workflows, where different decisions need to be made as the process progresses.
Some other advantages to this approach are the seamless integration of StepFunctions with other services, such as Lambda, that allow us to build a complex workflow without having to orchestrate it manually, with easy out-of-the-box monitoring and error handling that StepFunctions provide – simplifying our retry logic and overall process monitoring.
So, hoping I have convinced you to try this new approach, let’s get hands on!
The generation of the Step Functions can be done in two different ways: using the Step Functions visual editor, or leveraging CloudFormation. When I began this post, I wanted to use CloudFormation, but I ran into a CloudFormation limit (CloudFormation doesn’t allow us to create StepFunctions with a definition as long as I needed it to). This is why I will use the Step Functions visual editor, combined with Amazon States Language.
From the AWS website:
The Amazon States Language is a JSON-based, structured language used to define your state machine, a collection of states, that can do work (Task states), determine which states to transition to next (Choice states), stop an execution with an error (Fail states), and so on.
We will define a workflow made up of 3 steps: data preprocessing, model training and model evaluation.
A set of examples, coded in Python using PySpark, and step by step to launch our first Step Function!
For this, we will need to create five things:
preprocessing_script.py
training_script.by
evaluation_script.py
– And the following permissions:
Once we have all of these elements ready, we’ll need to replace them in our StepFunction definition, which I share below.
At a high level, we define three steps for each stage of our workflow: cluster creation, code execution and cluster termination. We also define a “catch-all” step that will alert via SNS if there is an error on any of the code execution steps. Something interesting to highlight is that StepFunctions have a predefined step type (createCluster, addStep and terminateCluster) for each of the steps we need, which reduces significantly the amount of code we need to write for our orchestration.:
Some interesting aspects we can see – for example:
makes reference to the step above:
which allows us to create the cluster in one step, obtain its id and then use it in another step.
If we wanted to do more complex things, the idea is the same – we have to reference it in the result path or output path. More details about this here
Going back to our stack creation – once we’ve replaced the values, we can author a new Step Function. You can use this link or navigate to the StepFunctions console, select “Create State Machine” and select the option “Write your workflow in code”.
Both options will open the screen we see below, that includes a basic “Hello World” application. You will replace the code in the definition, and refresh the graph.
Once you’ve refreshed the graph, you should see the following:
The next steps are to name your Step Function, and assign it a role (the one we had created above would be ideal) – and create it with the button “Create State Machine”. Allow some minutes for it to be completed, and you’re ready to go!
You’ll find a button to “Start Execution” – which will trigger your steps. Once it starts executing, the flowchart will update with the different statuses – blue for steps in progress, green for successful steps, orange for caught errors, and red for unexpected errors.
It’s important to highlight that the EMR clusters will be available via the usual EMR console – which simplifies the process monitoring for people who are familiar with it. The fact that it was triggered via StepFunctions is completely transparent to the end users.
Going back to our Step Function and error handling, in the image below we can see two different errors. The step “TrainingStep” caught an error (which is part of the expected workflow), but the step “HandleErrors” failed. This allows us to easily find what failed and where.
On this other image, we can see that the workflow failed, but the error handling succeeded.
This is important to highlight because caught errors are part of our workflow and we have prepared for them (e.g. with alerting). Unexpected errors might leave our work in an incomplete state that requires manual intervention.
Well, thanks for making it this far and I hope you’re excited to try this out!
We can see that spinning up a set of EMR clusters is quite easy with Step Functions, and it allows us to leverage the power or EMR in a simple and straightforward way. As a final recommendation, I’d like to highlight some of the best practices for using these two services together:
With all of this in mind, onwards to your own experiments! And don’t hesitate to write with any issues or questions – happy to help!
Happy coding! 🙂
As we have seen on previous posts (Launching an EMR cluster using Lambda functions to run PySpark scripts, part 1 and 2), EMR is a very popular big data platform, designed to simplify running Big Data frameworks such as Hadoop and Spark. Taking advantage of this platform allows us to scale up and down our compute capacity, leverage cheap spot instances and integrate seamlessly with our Data Lake. When we combine EMR with Lambdas, we can launch our clusters programmatically using serverless FaaS, and react to new data becoming available, or triggering on a cron schedule.
In this post, we propose to take this approach of using Lambdas to trigger EMR clusters one step further, and use AWS Step Functions to orchestrate our ML process. AWS Step Functions is a fully-managed service that makes it easy to build and run multi-step workflows. It allows us to define a series of steps that make up our workflow, and then Step Functions executes them in order. These steps can be run sequentially, in parallel, wait for previous stages to finish, or run conditionally if other workflows fail. The image below shows an example of a Step Function. It is a series of steps, where we can see a Start, Steps, Decisions, Alternatives, and an End. This is ideal for real-life workflows, where different decisions need to be made as the process progresses.
Some other advantages to this approach are the seamless integration of StepFunctions with other services, such as Lambda, that allow us to build a complex workflow without having to orchestrate it manually, with easy out-of-the-box monitoring and error handling that StepFunctions provide – simplifying our retry logic and overall process monitoring.
So, hoping I have convinced you to try this new approach, let’s get hands on!
The generation of the Step Functions can be done in two different ways: using the Step Functions visual editor, or leveraging CloudFormation. When I began this post, I wanted to use CloudFormation, but I ran into a CloudFormation limit (CloudFormation doesn’t allow us to create StepFunctions with a definition as long as I needed it to). This is why I will use the Step Functions visual editor, combined with Amazon States Language.
From the AWS website:
The Amazon States Language is a JSON-based, structured language used to define your state machine, a collection of states, that can do work (Task states), determine which states to transition to next (Choice states), stop an execution with an error (Fail states), and so on.
We will define a workflow made up of 3 steps: data preprocessing, model training and model evaluation.
A set of examples, coded in Python using PySpark, and step by step to launch our first Step Function!
For this, we will need to create five things:
preprocessing_script.py
training_script.by
evaluation_script.py
– And the following permissions:
Once we have all of these elements ready, we’ll need to replace them in our StepFunction definition, which I share below.
At a high level, we define three steps for each stage of our workflow: cluster creation, code execution and cluster termination. We also define a “catch-all” step that will alert via SNS if there is an error on any of the code execution steps. Something interesting to highlight is that StepFunctions have a predefined step type (createCluster, addStep and terminateCluster) for each of the steps we need, which reduces significantly the amount of code we need to write for our orchestration.:
Some interesting aspects we can see – for example:
makes reference to the step above:
which allows us to create the cluster in one step, obtain its id and then use it in another step.
If we wanted to do more complex things, the idea is the same – we have to reference it in the result path or output path. More details about this here
Going back to our stack creation – once we’ve replaced the values, we can author a new Step Function. You can use this link or navigate to the StepFunctions console, select “Create State Machine” and select the option “Write your workflow in code”.
Both options will open the screen we see below, that includes a basic “Hello World” application. You will replace the code in the definition, and refresh the graph.
Once you’ve refreshed the graph, you should see the following:
The next steps are to name your Step Function, and assign it a role (the one we had created above would be ideal) – and create it with the button “Create State Machine”. Allow some minutes for it to be completed, and you’re ready to go!
You’ll find a button to “Start Execution” – which will trigger your steps. Once it starts executing, the flowchart will update with the different statuses – blue for steps in progress, green for successful steps, orange for caught errors, and red for unexpected errors.
It’s important to highlight that the EMR clusters will be available via the usual EMR console – which simplifies the process monitoring for people who are familiar with it. The fact that it was triggered via StepFunctions is completely transparent to the end users.
Going back to our Step Function and error handling, in the image below we can see two different errors. The step “TrainingStep” caught an error (which is part of the expected workflow), but the step “HandleErrors” failed. This allows us to easily find what failed and where.
On this other image, we can see that the workflow failed, but the error handling succeeded.
This is important to highlight because caught errors are part of our workflow and we have prepared for them (e.g. with alerting). Unexpected errors might leave our work in an incomplete state that requires manual intervention.
Well, thanks for making it this far and I hope you’re excited to try this out!
We can see that spinning up a set of EMR clusters is quite easy with Step Functions, and it allows us to leverage the power or EMR in a simple and straightforward way. As a final recommendation, I’d like to highlight some of the best practices for using these two services together:
With all of this in mind, onwards to your own experiments! And don’t hesitate to write with any issues or questions – happy to help!
Happy coding! 🙂