Building a Big Data Processing Pipeline – Chapter 1

Juan Irabedra
.
July 27, 2022
Building a Big Data Processing Pipeline – Chapter 1

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: the project

Every business can take advantage of data. An important challenge for organizations is to design powerful systems that can transform data into strategic decisions as fast as possible. The truth is, large amounts of data can become overwhelming.

In this article we will introduce two modern technologies that may serve as allies when it comes to processing large volumes of data in real time: Apache Flink and Apache Kafka.

Through a series of articles we will build a pipeline that work as the foundation for many interesting applications, for example:

And many more!

Why Apache Flink?

Apache Spark is a tried and true open source framework that allows near real-time data processing. Despite Spark being a popular choice among big data developers, Apache Flink is becoming one of the top technologies for a particular niche: real-time processing.

Unlike Flink, Apache Spark does not offer native stream processing. Spark uses micro-batch processing, which can simulate real-time, but its latency is slightly higher than Flink’s. Also, Flink’s job optimization is done automatically. There is no need to keep an eye on shuffles, broadcasts, and so on.

Why Apache Kafka?

Many software implementations serve as message brokers. Each one has its own perks and use cases. Some offer easy integration while others may offer, for example, robustness.

One of the most popular message brokers is RabbitMQ, which may serve as a reliable source of message streaming in many use cases. When it comes to big data applications, some properties must be guaranteed, and that is not exactly the case of RabbitMQ. Some of these properties are partition tolerance, high throughput, horizontal scalability and strong redundancy. These make Kafka an ideal candidate for big data real-time processing.  

Why Scala?

Scala is a very fast growing language which mixes the best of the object oriented programming paradigm with the best of the functional programming languages. It is incredibly versatile, succinct and considerably performant. Scala compiles into Java bytecode, which is finally executed by the JVM.

Spark made Scala one of the top programming languages for big data applications. Take into consideration that Scala is the most performing language for Spark. This fact made the big data ecosystem pay attention to Scala not only as an elegant language, but also as a performant one. As a plus, both Flink and Kafka were written in Scala (as well as in Java).  

The project

We will briefly discuss how to build a short real-time processing pipeline with 2 main modules. The following diagram depicts the main components and how they should interact.

The first component is a Spring Boot REST API. This API exposes an endpoint which accepts HTTP Post requests. The body of these requests can be any object serialization (JSON, ProtoBuf, XML and so on). In this example, the endpoint will expect a JSON object which contains the Kafka topic to publish the message to, the Kafka message key and the Kafka message value.

The second element involved in the pipeline is a Kafka topic. This topic will act as a message exchange. The Kafka setup for this example will be on local standalone mode to keep it simple.

The third element is a Flink job, which will be run in a local cluster on standalone mode. The Flink job is responsible for subscribing to the Kafka topic as a consumer. It will process the messages and push it to a Cassandra table running on a Docker container.

Finally, as mentioned above, we have a Cassandra table within a keyspace. Cassandra is a NoSQL distributed database. It is an ideal candidate for applications that need a manageable replication factor for the data.

We will cover the setup for Kafka, Flink and Cassandra. The Kafka topic producer can be any application compatible with a Kafka source, or the Kafka console producer script itself.

This setup can be fairly easily hosted in the cloud but in this article we chose to run everything locally so as to better portray what is going on behind the scenes. We built a similar pipeline with a managed Cassandra keyspace hosted on AWS Keyspaces. Since Flink does not allow consistency level customization for Scala tuples, the cloud version of this pipeline used the Cassandra Flink connector for Java to sink POJOs.

In upcoming articles we will build the components involved in the pipeline. For the sake of simplicity, this example will cover an on-premise setup, starting with Kafka.

Stay tuned for more articles about top-notch technology and practices related to making the most out of your data!

Download your e-book today!

Download your report today!

First steps towards building a real-time big data processing pipeline with Apache Kafka and Apache Flink with Scala: the project

Every business can take advantage of data. An important challenge for organizations is to design powerful systems that can transform data into strategic decisions as fast as possible. The truth is, large amounts of data can become overwhelming.

In this article we will introduce two modern technologies that may serve as allies when it comes to processing large volumes of data in real time: Apache Flink and Apache Kafka.

Through a series of articles we will build a pipeline that work as the foundation for many interesting applications, for example:

And many more!

Why Apache Flink?

Apache Spark is a tried and true open source framework that allows near real-time data processing. Despite Spark being a popular choice among big data developers, Apache Flink is becoming one of the top technologies for a particular niche: real-time processing.

Unlike Flink, Apache Spark does not offer native stream processing. Spark uses micro-batch processing, which can simulate real-time, but its latency is slightly higher than Flink’s. Also, Flink’s job optimization is done automatically. There is no need to keep an eye on shuffles, broadcasts, and so on.

Why Apache Kafka?

Many software implementations serve as message brokers. Each one has its own perks and use cases. Some offer easy integration while others may offer, for example, robustness.

One of the most popular message brokers is RabbitMQ, which may serve as a reliable source of message streaming in many use cases. When it comes to big data applications, some properties must be guaranteed, and that is not exactly the case of RabbitMQ. Some of these properties are partition tolerance, high throughput, horizontal scalability and strong redundancy. These make Kafka an ideal candidate for big data real-time processing.  

Why Scala?

Scala is a very fast growing language which mixes the best of the object oriented programming paradigm with the best of the functional programming languages. It is incredibly versatile, succinct and considerably performant. Scala compiles into Java bytecode, which is finally executed by the JVM.

Spark made Scala one of the top programming languages for big data applications. Take into consideration that Scala is the most performing language for Spark. This fact made the big data ecosystem pay attention to Scala not only as an elegant language, but also as a performant one. As a plus, both Flink and Kafka were written in Scala (as well as in Java).  

The project

We will briefly discuss how to build a short real-time processing pipeline with 2 main modules. The following diagram depicts the main components and how they should interact.

The first component is a Spring Boot REST API. This API exposes an endpoint which accepts HTTP Post requests. The body of these requests can be any object serialization (JSON, ProtoBuf, XML and so on). In this example, the endpoint will expect a JSON object which contains the Kafka topic to publish the message to, the Kafka message key and the Kafka message value.

The second element involved in the pipeline is a Kafka topic. This topic will act as a message exchange. The Kafka setup for this example will be on local standalone mode to keep it simple.

The third element is a Flink job, which will be run in a local cluster on standalone mode. The Flink job is responsible for subscribing to the Kafka topic as a consumer. It will process the messages and push it to a Cassandra table running on a Docker container.

Finally, as mentioned above, we have a Cassandra table within a keyspace. Cassandra is a NoSQL distributed database. It is an ideal candidate for applications that need a manageable replication factor for the data.

We will cover the setup for Kafka, Flink and Cassandra. The Kafka topic producer can be any application compatible with a Kafka source, or the Kafka console producer script itself.

This setup can be fairly easily hosted in the cloud but in this article we chose to run everything locally so as to better portray what is going on behind the scenes. We built a similar pipeline with a managed Cassandra keyspace hosted on AWS Keyspaces. Since Flink does not allow consistency level customization for Scala tuples, the cloud version of this pipeline used the Cassandra Flink connector for Java to sink POJOs.

In upcoming articles we will build the components involved in the pipeline. For the sake of simplicity, this example will cover an on-premise setup, starting with Kafka.

Stay tuned for more articles about top-notch technology and practices related to making the most out of your data!