Data Engineering

Data Collect, Data Preparation, Data Lake, Data Governance

Data Science

Writing algorithms, Spark, Machine Learning, exploration, statistics, Python, R

Data Streaming

Message Bus, Key Performance Indicator (KPI), Threshold Detection, Time Window Queries, Intelligent Behaviors

Data Analytics

Visualization, notebooks

Latest articles

Spark Streaming part 3: tools and tests for Spark applications

By |June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to [...]

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

By |May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming [...]

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

By |April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming [...]

Gatsby.js, React and GraphQL for documentation websites

By |April 1st, 2019|Categories: Front End|Tags: , , , , , , |

In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our [...]

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

By |March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols [...]