Building a Real-time Streaming Pipeline with Spark, Kafka, and Cassandra: A Comprehensive Guide

In this tutorial, we delve into the intricate world of real-time data processing with an in-depth exploration of Spark, Kafka, and Cassandra. Discover how to architect a robust streaming pipeline that seamlessly integrates these powerful technologies to ingest, process, and store data in real-time. Follow along as we navigate through the setup process, explore key configurations, and dive into coding examples to unleash the potential of real-time data analytics. By the end of this guide, you’ll have the knowledge and tools to architect your own scalable and resilient streaming applications.





1. Start Kafka


bin/ config/


bin/ config/

Create topic

bin/ --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic exampletopic

bin/ --list --zookeeper localhost:2181


bin/ --broker-list localhost:9092 --topic exampletopic


bin/ --zookeeper localhost:2181 --topic exampletopic --from-beginning


bin/ --bootstrap-server localhost:9092 --topic mytopic --from-beginning

2. Start Spark


3. Start Cassandra

bin/cassandra -f

create keyspace sparkdata with replication ={'class':'SimpleStrategy','replication_factor':1};
use sparkdata;
CREATE TABLE cust_data (fname text , lname text , url text,product text , cnt counter ,primary key (fname,lname,url,product));

select * from cust_data;

Spark Kafka Cassandra Streaming Code

Start the Spark Shell with below command

bin/spark-shell --packages "com.datastax.spark:spark-cassandra-connector_2.11:2.0.2","org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0"

Run this code in the spark shell

