Apache Spark Streaming

Note: Spark Streaming is the previous generation of Spark’s streaming engine. There are no longer updates to Spark Streaming and it’s a legacy project. There is a newer and easier to use streaming engine in Spark called Structured Streaming.

Spark Structured Streaming is a scalable fault-tolerant streaming processing system that natively supports streaming workloads. Streaming computation therefore could be expressed the same way a batch computation on static data would be expressed. The Spark SQL engine will take care of running it not only incrementally but continuously, and updating the final result as streaming data continues to arrive. Spark streaming also accepts many sources, programming languages like Scala, Java, Python, or R are adoptable to perform tasks like time-based windows, aggregations, joining streaming and static data, etc. These tasks are executed using the same optimized Spark SQL engine.

Learn more: Official website
Related tags: Streaming

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Categories: Big Data, Data Engineering, DevOps & SRE | Tags: DevOps, Learning and tutorial, Spark, Apache Spark Streaming

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…

By Oskar RYNKIEWICZ

May 31, 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

By Oskar RYNKIEWICZ

Apr 18, 2019

Spark Streaming part 4: clustering with Spark MLlib

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

By Oskar RYNKIEWICZ

Jun 27, 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Python, Streaming

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

By Oskar RYNKIEWICZ

May 28, 2019

Apache Spark Streaming

Related articles

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Streaming part 4: clustering with Spark MLlib

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop