Blog, last published articles

Auto-scaling Druid with Kubernetes

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk “Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes” by Jinchul Kim during DataWorks Summit 2019 Europe in Barcelona. […]

Spark Streaming part 4: clustering with Spark MLlib

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed with Spark MLlib can be combined with a low-latency streaming pipeline created with Spark Structured Streaming. The K-means clustering algorithm [...]

By |2019-07-12T08:07:03+00:00July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |1 Comment

Google Cloud Summit Paris Notes

Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year's event was the biggest yet in Paris, which reflect Google's commitment to position itself in the French market. In term of Cloud market shares, Google Cloud Platform (GCP) is still far behind its competitor Amazon AWS and Microsoft Azure. [...]

By |2019-06-26T19:23:32+00:00June 26th, 2019|Categories: Events|Tags: , , , , , |0 Comments

Spark Streaming part 3: tools and tests for Spark applications

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to the application's input. Testing is essential to discover as many software defects and as much flawed logic as possible before [...]

By |2019-07-11T22:14:43+00:00June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |4 Comments

Druid and Hive integration

This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description Hive and Hive LLAP Hive is an environment allowing SQL queries on data stored in HDFS. The following executors can be configured in Hive: Map [...]

By |2019-06-19T09:22:12+00:00June 17th, 2019|Categories: Blog, Data Engineering|0 Comments

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming data processing chain in a distributed environment will be presented. Cluster environment demands attention to aspects such as monitoring, stability, [...]

By |2019-07-11T22:14:21+00:00May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |2 Comments

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming engine shares the same API as with the Spark SQL engine and is as easy to use. Spark Structured Streaming [...]

By |2019-07-11T22:14:25+00:00April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |9 Comments