Data Engineering

Data Collect, Data Preparation, Data Lake, Data Governance

Data Science

Writing algorithms, Spark, Machine Learning, exploration, statistics, Python, R

Data Streaming

Message Bus, Key Performance Indicator (KPI), Threshold Detection, Time Window Queries, Intelligent Behaviors

Data Analytics

Visualization, notebooks

Latest articles

Auto-scaling Druid with Kubernetes

By |July 16th, 2019|Categories: Big Data, Container, DataWorks Summit 2019|Tags: , , , , , , , , , |

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk “Apache Druid Auto [...]

Spark Streaming part 4: clustering with Spark MLlib

By |July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed [...]

Spark Streaming part 3: tools and tests for Spark applications

By |June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to [...]

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

By |May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming [...]

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

By |April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming [...]