Blog

Rook with Ceph doesn’t provision my Persistent Volume Claims!

Ceph installation inside Kubernetes can be provisionned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid breaking anything on our production cluster, we decided to experiment the installation of a k8s cluster on 3 virtual machines (one master node n1, [...]

By |2019-09-10T07:45:18+00:00September 9th, 2019|Categories: DevOps|Tags: , , , , , |0 Comments

Users and RBAC authorizations in Kubernetes

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and authorization properly managed. This article focus on how to create users with X.509 client certificates and how to manage authorizations with the [...]

TensorFlow installation on Docker

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array (tensors) TensorFlow runs on CPU or GPU (using CUDA®). The architecture is flexible and highly scalable. It can be deployed on smartphones, desktop/servers, or even servers cluster. Installation CPU Only [...]

By |2019-08-05T20:26:32+00:00August 5th, 2019|Categories: Container, Data Science, Learning|Tags: , , , , , |0 Comments

Running Apache Hive 3, new features and tips and tricks

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since July 2018 as part of HDP3 (Hortonworks Data Platform version 3). I will first review the new features available with [...]

By |2019-07-25T22:40:14+00:00July 25th, 2019|Categories: Big Data, DataWorks Summit 2019|Tags: , , , , , , , |0 Comments

Auto-scaling Druid with Kubernetes

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk “Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes” by Jinchul Kim during DataWorks Summit 2019 Europe in Barcelona. […]

Spark Streaming part 4: clustering with Spark MLlib

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed with Spark MLlib can be combined with a low-latency streaming pipeline created with Spark Structured Streaming. The K-means clustering algorithm [...]

By |2019-07-12T08:07:03+00:00July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |1 Comment

Google Cloud Summit Paris Notes

Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year's event was the biggest yet in Paris, which reflect Google's commitment to position itself in the French market. In term of Cloud market shares, Google Cloud Platform (GCP) is still far behind its competitor Amazon AWS and Microsoft Azure. [...]

By |2019-06-26T19:23:32+00:00June 26th, 2019|Categories: Events|Tags: , , , , , |0 Comments

Spark Streaming part 3: tools and tests for Spark applications

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to the application's input. Testing is essential to discover as many software defects and as much flawed logic as possible before [...]

By |2019-07-11T22:14:43+00:00June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |4 Comments

Druid and Hive integration

This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description Hive and Hive LLAP Hive is an environment allowing SQL queries on data stored in HDFS. The following executors can be configured in Hive: Map [...]

By |2019-06-19T09:22:12+00:00June 17th, 2019|Categories: Blog, Data Engineering|0 Comments