Data Engineering

Data Collect, Data Preparation, Data Lake, Data Governance

Data Science

Writing algorithms, Spark, Machine Learning, exploration, statistics, Python, R

Data Streaming

Message Bus, Key Performance Indicator (KPI), Threshold Detection, Time Window Queries, Intelligent Behaviors

Data Analytics

Visualization, notebooks

Latest articles

Spark Streaming part 3: tools and tests for Spark applications

By |June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to [...]

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

By |May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming [...]

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

By |April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming [...]

Gatsby.js, React and GraphQL for documentation websites

By |April 1st, 2019|Categories: Front End|Tags: , , , , , , |

In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our [...]

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

By |March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols [...]

Multihoming on Hadoop

By |March 5th, 2019|Categories: Adalas Summit 2018, Big Data, Data Engineering|Tags: , , |

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an introduction to the concept [...]