Streaming
Related articles
Spark Streaming part 4: clustering with Spark MLlib
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Streaming, Clustering, Machine Learning, Scala
Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…
Jul 11, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: Spark, Apache Spark Streaming, DevOps, Learning and tutorial
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…
Jun 19, 2019
Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop
Categories: Data Engineering, Learning | Tags: Spark, Apache Spark Streaming, Streaming, Python
Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…
May 28, 2019
Spark Streaming part 1: build data pipelines with Spark Structured Streaming
Categories: Data Engineering, Learning | Tags: Kafka, Spark, Apache Spark Streaming, Big Data, Streaming
Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…
Apr 18, 2019
Apache Flink: past, present and future
Categories: Data Engineering | Tags: Flink, Pipeline, Streaming, Kubernetes, Machine Learning, SQL
Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…
Nov 5, 2018
Apache Beam: a unified programming model for data processing pipelines
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Spark, Pipeline
In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…
May 24, 2018
What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, Spark, PySpark, Docker, Streaming, Tuning, Spark MLlib, Data Science, Kubernetes, pandas
Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…
May 23, 2018
Node CSV version 0.2.1
Categories: Node.js | Tags: CoffeeScript, CSV, Streaming, Release and features
After the announcement of the version 0.2.0 of the Node.js CSV parser at the beginning of october, we are releasing today a new version 0.2.1. This is mostly a bug fix release with enhanced…
By David WORMS
Jul 24, 2012
Node CSV version 0.1 and future developments
Categories: Node.js | Tags: CoffeeScript, CSV, Markdown, Streaming, Release and features
The Node CSV parser has just reach version 0.1 which close the 0.0.x releases. Started almost 2 years ago, the project has received a tremendous amount of participation in the form of bug reports…
By David WORMS
Jul 21, 2012
Node CSV version 0.2 with streaming API
Categories: Node.js | Tags: CSV, Data Engineering, Markdown, Streaming, Node.js
The Node CSV parser in its version 0.2 has just been released. This version is a major enhancement as it aligned the parser with the best Node.js practice in respect of streams. The CSV parser behave…
By David WORMS
Jul 2, 2012