Data Engineering

Data Collect, Data Preparation, Data Lake, Data Governance

Data Science

Writing algorithms, Spark, Machine Learning, exploration, statistics, Python, R

Data Streaming

Message Bus, Key Performance Indicator (KPI), Threshold Detection, Time Window Queries, Intelligent Behaviors

Data Analytics

Visualization, notebooks

Latest articles

Rook with Ceph doesn’t provision my Persistent Volume Claims!

By |September 9th, 2019|Categories: DevOps|Tags: , , , , , |

Ceph installation inside Kubernetes can be provisionned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid breaking anything on our [...]

Users and RBAC authorizations in Kubernetes

By |August 7th, 2019|Categories: Container, Data Governance|Tags: , , , , , |

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and authorization properly [...]

TensorFlow installation on Docker

By |August 5th, 2019|Categories: Container, Data Science, Learning|Tags: , , , , , |

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array (tensors) TensorFlow runs on CPU or GPU (using CUDA®). The [...]

Running Apache Hive 3, new features and tips and tricks

By |July 25th, 2019|Categories: Big Data, DataWorks Summit 2019|Tags: , , , , , , , |

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since [...]

Auto-scaling Druid with Kubernetes

By |July 16th, 2019|Categories: Big Data, Container, DataWorks Summit 2019|Tags: , , , , , , , , , |

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk “Apache Druid Auto [...]

Spark Streaming part 4: clustering with Spark MLlib

By |July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed [...]

Spark Streaming part 3: tools and tests for Spark applications

By |June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to [...]