All our articles

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

CDP part 6: end-to-end data lakehouse ingestion pipeline with CDP

Categories: Big Data, Data Engineering, Learning | Tags: NiFi, Business intelligence, Data Engineering, Iceberg, Spark, Big Data, Cloudera, CDP, Data Analytics, Data Lake, Data Warehouse

In this hands-on lab session we demonstrate how to build an end-to-end big data solution with Cloudera Data Platform (CDP) Public Cloud, using the infrastructure we have deployed and configured over…

Tobias CHAVARRIA

By Tobias CHAVARRIA

Jul 24, 2023

CDP part 5: user permissions management on CDP Public Cloud

CDP part 5: user permissions management on CDP Public Cloud

Categories: Big Data, Cloud Computing, Data Governance | Tags: Ranger, Cloudera, CDP, Data Warehouse

When you create a user or a group in CDP, it requires permissions to access resources and use the Data Services. This article is the fifth in a series of six: CDP part 1: introduction to end-to-end…

Tobias CHAVARRIA

By Tobias CHAVARRIA

Jul 18, 2023

CDP part 4: user management on CDP Public Cloud with Keycloak

CDP part 4: user management on CDP Public Cloud with Keycloak

Categories: Big Data, Cloud Computing, Data Governance | Tags: EC2, Big Data, CDP, Docker Compose, Keycloak, SSO

Previous articles of the serie cover the deployment of a CDP Public Cloud environment. All the components are ready for use and it is time to make the environment available to other users to explore…

Tobias CHAVARRIA

By Tobias CHAVARRIA

Jul 4, 2023

CDP part 3: Data Services activation on CDP Public Cloud environment

CDP part 3: Data Services activation on CDP Public Cloud environment

Categories: Big Data, Cloud Computing, Infrastructure | Tags: Infrastructure, AWS, Big Data, Cloudera, CDP

One of the big selling points of Cloudera Data Platform (CDP) is their mature managed service offering. These are easy to deploy on-premises, in the public cloud or as part of a hybrid solution. The…

Albert KONRAD

By Albert KONRAD

Jun 27, 2023

CDP part 2: CDP Public Cloud deployment on AWS

CDP part 2: CDP Public Cloud deployment on AWS

Categories: Big Data, Cloud Computing, Infrastructure | Tags: Infrastructure, AWS, Big Data, Cloud, Cloudera, CDP, Cloudera Manager

The Cloudera Data Platform (CDP) Public Cloud provides the foundation upon which full featured data lakes are created. In a previous article, we introduced the CDP platform. This article is the second…

Albert KONRAD

By Albert KONRAD

Jun 19, 2023

CDP part 1: introduction to end-to-end data lakehouse architecture with CDP

CDP part 1: introduction to end-to-end data lakehouse architecture with CDP

Categories: Cloud Computing, Data Engineering, Infrastructure | Tags: Data Engineering, Hortonworks, Iceberg, AWS, Azure, Big Data, Cloud, Cloudera, CDP, Cloudera Manager, Data Warehouse

Cloudera Data Platform (CDP) is a hybrid data platform for big data transformation, machine learning and data analytics. In this series we describe how to build and use an end-to-end big data…

Stephan BAUM

By Stephan BAUM

Jun 8, 2023

Local development environments with Terraform + LXD

Local development environments with Terraform + LXD

Categories: Containers Orchestration, DevOps & SRE | Tags: Automation, DevOps, KVM, LXD, Virtualization, VM, Terraform, Vagrant

As a Big Data Solutions Architect and InfraOps, I need development environments to install and test software. They have to be configurable, flexible, and performant. Working with distributed systems…

Gauthier LEONARD

By Gauthier LEONARD

Jun 1, 2023

Data platform requirements and expectations

Data platform requirements and expectations

Categories: Big Data, Infrastructure | Tags: Data Engineering, Data Governance, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Science

A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources. It is composed of several…

David WORMS

By David WORMS

Mar 23, 2023

Keycloak deployment in EC2

Keycloak deployment in EC2

Categories: Cloud Computing, Data Engineering, Infrastructure | Tags: Security, EC2, Authentication, AWS, Docker, Keycloak, SSL/TLS, SSO

Why use Keycloak Keycloak is an open-source identity provider (IdP) using single sign-on (SSO). An IdP is a tool to create, maintain, and manage identity information for principals and to provide…

Stephan BAUM

By Stephan BAUM

Mar 14, 2023

Operating Kafka in Kubernetes with Strimzi

Operating Kafka in Kubernetes with Strimzi

Categories: Big Data, Containers Orchestration, Infrastructure | Tags: Kafka, Big Data, Kubernetes, Open source, Streaming

Kubernetes is not the first platform that comes to mind to run Apache Kafka clusters. Indeed, Kafka’s strong dependency on storage might be a pain point regarding Kubernetes’ way of doing things when…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Mar 7, 2023

Kubernetes: debugging with ephemeral containers

Kubernetes: debugging with ephemeral containers

Categories: Containers Orchestration, Tech Radar | Tags: Debug, Kubernetes

Anyone who has ever had to manipulate Kubernetes has found himself confronted with the resolution of pod errors. The methods provided for this purpose are efficient, and allow to overcome the most…

Pierre BERLAND

By Pierre BERLAND

Feb 7, 2023

Dive into tdp-lib, the SDK in charge of TDP cluster management

Dive into tdp-lib, the SDK in charge of TDP cluster management

Categories: Big Data, Infrastructure | Tags: Programming, Ansible, Hadoop, Python, TDP

All the deployments are automated and Ansible plays a central role. With the growing complexity of the code base, a new system was needed to overcome the Ansible limitations which will enable us to…

Guillaume BOUTRY

By Guillaume BOUTRY

Jan 24, 2023

Adaltas Summit 2022 Morzine

Adaltas Summit 2022 Morzine

Categories: Big Data, Adaltas Summit 2022 | Tags: Data Engineering, Infrastructure, Iceberg, Container, Data lakehouse, Docker, Kubernetes

For its third edition, the whole Adaltas crew is gathering in Morzine for a whole week with 2 days dedicated to technology the 15th and the 16Th of september 2022. The speakers choose one of the…

David WORMS

By David WORMS

Jan 13, 2023

How to build your OCI images using Buildpacks

How to build your OCI images using Buildpacks

Categories: Containers Orchestration, DevOps & SRE | Tags: CNCF, OCI, CI/CD, Docker, Kubernetes

Docker has become the new standard for building your application. In a Docker image we place our source code, its dependencies, some configurations and our application is almost ready to be deployed…

Big data infrastructure internship

Big data infrastructure internship

Categories: Big Data, Data Engineering, DevOps & SRE, Infrastructure | Tags: Infrastructure, Hadoop, Big Data, Cluster, Internship, Kubernetes, TDP

Job description Big Data and distributed computing are at the core of Adaltas. We accompagny our partners in the deployment, maintenance, and optimization of some of the largest clusters in France…

Stephan BAUM

By Stephan BAUM

Dec 2, 2022

Traefik, Docker and dnsmasq to simplify container networking

Traefik, Docker and dnsmasq to simplify container networking

Categories: Containers Orchestration, Infrastructure, Tech Radar | Tags: DNS, Gatsby, JAMstack, Linux, Docker, Network

Good tech adventures start with some frustration, a need, or a requirement. This is the story of how I simplified the management and access of my local web applications with the help of Traefik and…

David WORMS

By David WORMS

Nov 17, 2022

WasmEdge: WebAssembly runtimes are coming for the edge

WasmEdge: WebAssembly runtimes are coming for the edge

Categories: Containers Orchestration, Adaltas Summit 2021, Infrastructure, Tech Radar | Tags: JAMstack, Linux, Docker, Rust Lang, WebAssembly

With many security challenges solved by design in its core conception, lots of projects benefit from using WebAssembly. WasmEdge runtime is an efficient Virtual Machine optimized for edge computing…

Guillaume BOUTRY

By Guillaume BOUTRY

Sep 29, 2022

Ingresses and Load Balancers in Kubernetes with MetalLB and nginx-ingress

Ingresses and Load Balancers in Kubernetes with MetalLB and nginx-ingress

Categories: Containers Orchestration, Infrastructure, Tech Radar | Tags: Ingress, Kubeadm, Cluster, Deployment, Kubernetes

When it comes to exposing services from a Kubernetes cluster and making it accessible from outside the cluster, the recommended option is to use a load-balancer type service to redirect incoming…

Kellian COTTART

By Kellian COTTART

Sep 8, 2022

Spark on Hadoop integration with Jupyter

Spark on Hadoop integration with Jupyter

Categories: Adaltas Summit 2021, Infrastructure, Tech Radar | Tags: Infrastructure, Jupyter, Spark, YARN, CDP, HDP, Notebook, TDP

For several years, Jupyter notebook has established itself as the notebook solution in the Python universe. Historically, Jupyter is the tool of choice for data scientists who mainly develop in Python…

Aargan COINTEPAS

By Aargan COINTEPAS

Sep 1, 2022

Framework laptop with NixOS, a user feedback

Framework laptop with NixOS, a user feedback

Categories: Learning, Tech Radar | Tags: CLI, DevOps, Learning and tutorial, Linux, Packaging, NixOS, Open source

A new job comes with a new laptop. As such, I was given a Framework Laptop DIY Edition with the objective to install and configure it entirely with NixOS. I will share my first impressions after…

Carlos JESUS CARO

By Carlos JESUS CARO

Aug 22, 2022

Ceph object storage within a Kubernetes cluster with Rook

Ceph object storage within a Kubernetes cluster with Rook

Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Ceph, Cluster, Data Lake, Kubernetes, Storage

Ceph is a distributed all-in-one storage system. Reliable and mature, its first stable version was released in 2012 and has since then been the reference for open source storage. Ceph’s main perk is…

Luka BIGOT

By Luka BIGOT

Aug 4, 2022

MinIO object storage within a Kubernetes cluster

MinIO object storage within a Kubernetes cluster

Categories: Big Data, Data Governance, Learning | Tags: Amazon S3, Big Data, Cluster, Data Lake, Kubernetes, Storage

MinIO is a popular object storage solution. Often recommended for its simple setup and ease of use, it is not only a great way to get started with object storage: it also provides excellent…

Luka BIGOT

By Luka BIGOT

Jul 9, 2022

Architecture of object-based storage and S3 standard specifications

Architecture of object-based storage and S3 standard specifications

Categories: Big Data, Data Governance | Tags: Database, API, Amazon S3, Big Data, Data Lake, Storage

Object storage has been growing in popularity among data storage architectures. Compared to file systems and block storage, object storage faces no limitations when handling petabytes of data. By…

Luka BIGOT

By Luka BIGOT

Jun 20, 2022

TDP workshop: Become a TDP power user from your terminal

TDP workshop: Become a TDP power user from your terminal

Categories: Events, Learning | Tags: DevOps, Ansible, Hadoop, Open source, TDP

The TDP CLI is used to deploy and operate your TDP services. It relies on tdp-lib to provide control and flexibility at your fingertips. Some time ago, we announced the public release of TDP - Trunk…

Paul FARAULT

By Paul FARAULT

Jun 17, 2022

Comparison of database architectures: data warehouse, data lake and data lakehouse

Comparison of database architectures: data warehouse, data lake and data lakehouse

Categories: Big Data, Data Engineering | Tags: Data Governance, Infrastructure, Iceberg, Parquet, Spark, Data Lake, Data lakehouse, Data Warehouse, File Format

Database architectures have experienced constant innovation, evolving with the appearence of new use cases, technical constraints, and requirements. From the three database structures we are comparing…

Gonzalo ETSE

By Gonzalo ETSE

May 17, 2022

NixOS: Enabling LXD virtual machines using Flakes

NixOS: Enabling LXD virtual machines using Flakes

Categories: Hack, Learning | Tags: GitHub, Learning and tutorial, Linux, LXD, Packaging, VM, NixOS, Open source

Nixpkgs is an ever-increasing collection of software packages for Nix and NixOS. Even with more than 80,000 packages, you easily run in a situation where there is a functionality that is not yet…

Kellian COTTART

By Kellian COTTART

May 13, 2022

Databricks logs collection with Azure Monitor at a Workspace Scale

Databricks logs collection with Azure Monitor at a Workspace Scale

Categories: Cloud Computing, Data Engineering, Adaltas Summit 2021 | Tags: Metrics, Monitoring, Spark, Azure, Databricks, Log4j

Databricks is an optimized data analytics platform based on Apache Spark. Monitoring Databricks plateform is crucial to ensure data quality, job performance, and security issues by limiting access to…

Claire PLAYE

By Claire PLAYE

May 10, 2022

Introducing Trunk Data Platform: the Open-Source Big Data Distribution Curated by TOSIT

Introducing Trunk Data Platform: the Open-Source Big Data Distribution Curated by TOSIT

Categories: Big Data, DevOps & SRE, Infrastructure | Tags: Ranger, DevOps, Hortonworks, Ansible, Hadoop, HBase, Knox, Spark, Cloudera, CDP, CDH, Open source, TDP

Ever since Cloudera and Hortonworks merged, the choice of commercial Hadoop distributions for on-prem workloads essentially boils down to CDP Private Cloud. CDP can be seen as the “best of both worlds…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Apr 14, 2022

Blockchain 102: Cryptocurrencies, Wallets and DApps

Blockchain 102: Cryptocurrencies, Wallets and DApps

Categories: Adaltas Summit 2021, Infrastructure | Tags: Cryptography, Infrastructure, Blockchain, Consensus

A lot of people own cryptocurrencies today. But holding some tokens on an exchange does not mean interacting with the blockchain. The assets you trade are only numbers stored inside the exchange’s…

Gauthier LEONARD

By Gauthier LEONARD

Apr 12, 2022

JS monorepos in prod 7: Continuous Integration and Continuous Deployment with GitHub Actions

JS monorepos in prod 7: Continuous Integration and Continuous Deployment with GitHub Actions

Categories: DevOps & SRE, Front End | Tags: CI/CD, Monorepo, Node.js, Unit tests

The value of CI/CD lies in the ability to control and coordinate changes and feature addition in multiple, iterative releases while simultaneously having multiple services being actively developed in…

Alexander HOFFMANN

By Alexander HOFFMANN

Apr 6, 2022

Nix package creation: install a not yet supported font

Nix package creation: install a not yet supported font

Categories: Hack | Tags: Learning and tutorial, Linux, Packaging, GitOps, NixOS, Open source

The Nix packages collection is large with over 60 000 packages. However, chances are that sometimes the package you need is not available. You must integrate it yourself. I needed for some fonts which…

David WORMS

By David WORMS

Mar 29, 2022

Deploy your containerized AI applications with nvidia-docker

Deploy your containerized AI applications with nvidia-docker

Categories: Containers Orchestration, Data Science | Tags: containerd, DevOps, Learning and tutorial, NVIDIA, Docker, Keras, TensorFlow

More and more products and services are taking advantage of the modeling and prediction capabilities of AI. This article presents the nvidia-docker tool for integrating AI (Artificial Intelligence…

Robert Walid SOARES

By Robert Walid SOARES

Mar 24, 2022

Ansible variables: choosing the right location

Ansible variables: choosing the right location

Categories: DevOps & SRE | Tags: Infrastructure, Ansible, IaC, YAML

Defining variables for your Ansible playbooks and roles can become challenging as your project grows. Browsing the Ansible documentation, the diversity of Ansible variables location is confusing, to…

Xavier HERMAND

By Xavier HERMAND

Mar 15, 2022

Apache HBase: RegionServers co-location

Apache HBase: RegionServers co-location

Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, Infrastructure, Tuning, Hadoop, HBase, Big Data, HDP, Storage

RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUD…

Pierre BERLAND

By Pierre BERLAND

Feb 22, 2022

Reliable and reproducible Linux installation with NixOS

Reliable and reproducible Linux installation with NixOS

Categories: Infrastructure, Learning | Tags: Linux, Packaging, VM, NixOS, TDP

When using an operating system, upgrading packages or installing new ones are common tasks that introduce the risk of affecting the stability of the system. NixOS is a Linux distribution that ensures…

Florent MOUAFFO

By Florent MOUAFFO

Feb 8, 2022

Nix introduction, main concepts and commands

Nix introduction, main concepts and commands

Categories: Infrastructure, Learning | Tags: Arch Linux, CentOS, Linux, OS X, Packaging, Ubuntu, NixOS, TDP

Nix is a functional package manager for Linux and other Unix systems, making the management of packages more reliable and easy to reproduce. With a traditional package manager, when updating a package…

Florent MOUAFFO

By Florent MOUAFFO

Feb 1, 2022

Blockchain 101: Blockchains and Consensus Mechanisms

Blockchain 101: Blockchains and Consensus Mechanisms

Categories: Adaltas Summit 2021, Infrastructure, Learning | Tags: Cryptography, Infrastructure, Blockchain, Consensus

Cryptocurrencies are booming in 2021, with a market cap moving from 750 to more than 3,000 billion dollars. Let’s face it, this is mainly due to speculation. A lot of people involved do not have a…

Gauthier LEONARD

By Gauthier LEONARD

Jan 18, 2022

GitOps in practice, deploy Kubernetes applications with ArgoCD

GitOps in practice, deploy Kubernetes applications with ArgoCD

Categories: Containers Orchestration, DevOps & SRE, Adaltas Summit 2021 | Tags: Argo CD, CI/CD, Git, GitOps, IaC, Kubernetes

GitOps is a set of practices to deploy applications using Git. Application definitions, configurations, and connectivity are to be stored in a version control software such as Git. Git then serves as…

Paul-Adrien CORDONNIER

By Paul-Adrien CORDONNIER

Dec 16, 2021

JS monorepos in prod 6: CI/CD, continuous integration and deployment with Travis CI

JS monorepos in prod 6: CI/CD, continuous integration and deployment with Travis CI

Categories: DevOps & SRE, Front End | Tags: CI/CD, Monorepo, Node.js, Unit tests

Implementing continuous integration CI and continuous deployment (CD) on a monorepo is quite complex due to the diversity of multiple responsibilities between developers and the need to coordinate…

David WORMS

By David WORMS

Dec 6, 2021

Spring 2022 internship - building a Data Lab

Spring 2022 internship - building a Data Lab

Categories: Data Science, Learning | Tags: MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL

Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creation…

David WORMS

By David WORMS

Nov 24, 2021

CSV package for Node.js version 6

CSV package for Node.js version 6

Categories: Node.js | Tags: Data Engineering, Refactoring, CSV, File Format, Release and features

Version 6 of the package for Node.js is released along its sub projects. Here are the latest versions: version , latest version was NPM version , latest version was NPM version , latest version…

David WORMS

By David WORMS

Nov 15, 2021

H2O in practice: a protocol combining AutoML with traditional modeling approaches

H2O in practice: a protocol combining AutoML with traditional modeling approaches

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost

H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…

Internship in Big Data infrastructure with TDP

Internship in Big Data infrastructure with TDP

Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP

Job Description Big Data and distributed computing is at Adaltas’ core. We support our partners in the deployment, maintenance and optimization of some of France’s largest clusters. Adaltas is also an…

Daniel HARTY

By Daniel HARTY

Oct 25, 2021

Internship in Data Engineering

Internship in Data Engineering

Categories: Front End, Learning | Tags: Metrics, Monitoring, Hive, Kafka, Delta Lake, Elasticsearch, IaC, Internship, Kubernetes, Streaming

Job Description Data is a valuable business asset. Some call it the new oil. The data engineer collects, transform and refine ​​raw data into information that can be used by business analysts and data…

David WORMS

By David WORMS

Oct 25, 2021

Internship in Web Technologies

Internship in Web Technologies

Categories: Front End, Learning | Tags: DevOps, LDAP, React.js, CI/CD, Docker, GraphQL, IaC, Internship, Kubernetes, Node.js, OAuth2

Job Description As part of its Big Data activities, Adaltas Academy is an information-sharing platform bringing together articles, training content, and a knowledge base. The users of the platform are…

David WORMS

By David WORMS

Oct 14, 2021

H2O in practice: a Data Scientist feedback

H2O in practice: a Data Scientist feedback

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…

Adaltas Summit 2021, 2nd edition in corsica

Adaltas Summit 2021, 2nd edition in corsica

Categories: Adaltas Summit 2021, Learning | Tags: Ansible, Hadoop, Spark, Azure, Blockchain, Deep Learning, Docker, Terraform, Kubernetes, Node.js

For its second edition, the whole Adaltas crew is gathering in Corsica for a whole week with 2 days dedicated to technology the 23rd and the 24th of september 2021. After a year and a half of sanitary…

David WORMS

By David WORMS

Sep 21, 2021

Running your Travis CI builds locally with Docker

Running your Travis CI builds locally with Docker

Categories: DevOps & SRE, Front End | Tags: Bash, Tools, CI/CD, Node.js, Unit tests

Setting up the environment to run the tests on a CI/CD can take a few roundtrips between your host machine and the CI/CD running remotely. For every attempt, you’ll have to commit and publish your…

David WORMS

By David WORMS

Sep 6, 2021

Using Cloudera Deploy to install Cloudera Data Platform (CDP) Private Cloud

Using Cloudera Deploy to install Cloudera Data Platform (CDP) Private Cloud

Categories: Big Data, Cloud Computing | Tags: Ansible, Cloudera, CDP, Cluster, Data Warehouse, Vagrant, IaC

Following our recent Cloudera Data Platform (CDP) overview, we cover how to deploy CDP private Cloud on you local infrastructure. It is entirely automated with the Ansible cookbooks published by…

Alexander HOFFMANN

By Alexander HOFFMANN

Jul 23, 2021

An overview of Cloudera Data Platform (CDP)

An overview of Cloudera Data Platform (CDP)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: SDX, Big Data, Cloud, Cloudera, CDP, CDH, Data Analytics, Data Hub, Data Lake, Data lakehouse, Data Warehouse

Cloudera Data Platform (CDP) is a cloud computing platform for businesses. It provides integrated and multifunctional self-service tools in order to analyze and centralize data. It brings security and…

Alexander HOFFMANN

By Alexander HOFFMANN

Jul 19, 2021

Modern Python part 3: run a CI pipeline & publish your package to PiPy

Modern Python part 3: run a CI pipeline & publish your package to PiPy

Categories: DevOps & SRE | Tags: GitHub, CI/CD, Git, Python, Release and features, Unit tests

To propose a well-maintained and usable Python package to the open-source community or even inside your company, you are expected to accomplish a set of critical steps. First ensure that your code is…

Faouzi BRAZA

By Faouzi BRAZA

Jun 28, 2021

Modern Python part 2: write unit tests & enforce Git commit conventions

Modern Python part 2: write unit tests & enforce Git commit conventions

Categories: DevOps & SRE | Tags: Git, pandas, Python, Unit tests

Good software engineering practices always bring a lot of long-term benefits. For example, writing unit tests permits you to maintain large codebases and ensures that a specific piece of your code…

Faouzi BRAZA

By Faouzi BRAZA

Jun 24, 2021

Modern Python part 1: start a project with pyenv & poetry

Modern Python part 1: start a project with pyenv & poetry

Categories: DevOps & SRE | Tags: Git, Python, Release and features, Unit tests

When learning a programming language, the focus is essentially on understanding the syntax, the code style, and the underlying concepts. With time, you become sufficiently comfortable with the…

Faouzi BRAZA

By Faouzi BRAZA

Jun 9, 2021

Desacralizing the Linux overlay filesystem in Docker

Desacralizing the Linux overlay filesystem in Docker

Categories: Containers Orchestration, Infrastructure | Tags: DevOps, File system, Linux, Docker

Overlay filesystems (also called union filesystems) is a fundamental technology in Docker to create images and containers. They allow creating a union of directories to create a filesystem. Multiple…

David WORMS

By David WORMS

Jun 3, 2021

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Self-Paced training from Databricks: a guide to self-enablement on Big Data & AI

Categories: Data Engineering, Learning | Tags: Cloud, Data Lake, Databricks, Delta Lake, MLflow

Self-paced trainings are proposed by Databricks inside their Academy program. The price is $ 2000 USD for unlimited access to the training courses for a period of 1 year, but also free for customers…

Anna KNYAZEVA

By Anna KNYAZEVA

May 26, 2021

JS monorepos in prod 5: merging Git repositories and preserve commit history

JS monorepos in prod 5: merging Git repositories and preserve commit history

Categories: DevOps & SRE, Node.js | Tags: Bash, DevOps, GitHub, Packaging, Git, GitOps, JavaScript, Monorepo

At Adaltas, we maintain several open-source Node.js projects organized as Git monorepos and published on NPM. We shared our experience to work with Lerna monorepos in a set of articles: Part…

Sergei KUDINOV

By Sergei KUDINOV

May 21, 2021

Find your way into data related Microsoft Azure certifications

Find your way into data related Microsoft Azure certifications

Categories: Cloud Computing, Data Engineering | Tags: Data Governance, Azure, Data Science

Microsoft Azure has certification paths for many technical job roles such as developer, Data Engineer, Data Scientist and solution architect among others. Each of these certifications consists of…

Barthelemy NGOM

By Barthelemy NGOM

Apr 14, 2021

Bridging the DBnomics Swagger/OpenAPI schema with GraphQL

Bridging the DBnomics Swagger/OpenAPI schema with GraphQL

Categories: DevOps & SRE, Front End | Tags: Data Engineering, JAMstack, GraphQL, JavaScript, Node.js, Schema, REST

While redacting a long and fastidious document today, I came across DBnomics, an open platform federating economic datasets. Browsing its website and APIs, I found their OpenAPI schema (aka Swagger…

David WORMS

By David WORMS

Apr 8, 2021

Apache Liminal: when MLOps meets GitOps

Apache Liminal: when MLOps meets GitOps

Categories: Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags: Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow

Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learning…

Aargan COINTEPAS

By Aargan COINTEPAS

Mar 31, 2021

Storage size and generation time in popular file formats

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

Barthelemy NGOM

By Barthelemy NGOM

Mar 22, 2021

TensorFlow Extended (TFX): the components and their functionalities

TensorFlow Extended (TFX): the components and their functionalities

Categories: Big Data, Data Engineering, Data Science, Learning | Tags: Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow

Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…

JS monorepos in prod 4: unit testing with Mocha and Should.js

JS monorepos in prod 4: unit testing with Mocha and Should.js

Categories: DevOps & SRE, Front End | Tags: Automation, CI/CD, Git, GitOps, Monorepo, Node.js, Unit tests

Unit testing is essential for every long-term project and allows you to pull down functionalities of your code into isolated testable units. Indeed the main goal of a unit test is to verify if an…

David WORMS

By David WORMS

Feb 25, 2021

JS monorepos in prod 3: commit enforcement and changelog generation

JS monorepos in prod 3: commit enforcement and changelog generation

Categories: DevOps & SRE, Front End | Tags: CI/CD, Git, JavaScript, Monorepo, Node.js, Release and features, Unit tests

Conventional Commits introduces a structured format for commit messages. It standardizes the messages among all the contributors. This makes them more readable and easy to automate. It simplifies the…

David WORMS

By David WORMS

Feb 2, 2021

JS monorepos in prod 2: project versioning and publishing

JS monorepos in prod 2: project versioning and publishing

Categories: DevOps & SRE, Front End | Tags: CI/CD, Git, GitOps, JavaScript, Monorepo, Node.js, Release and features, Unit tests

One great advantage of a monorepo is to maintain coherent versions between packages and to automatize the version creation and the publication of packages. This article covers the versioning and…

David WORMS

By David WORMS

Jan 11, 2021

JS monorepos in prod 1: project initialization

JS monorepos in prod 1: project initialization

Categories: DevOps & SRE, Front End | Tags: Git, GitOps, JavaScript, Monorepo, Node.js, Release and features

Every project journey begins with the step of initialization. When your overall project is composed of multiple projects, it is tempting to create one Git repository per project. In Node.js, a project…

David WORMS

By David WORMS

Jan 5, 2021

Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin

Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin

Categories: Big Data, Infrastructure | Tags: Maven, Hadoop, HBase, Hive, Spark, Git, Release and features, TDP, Unit tests

The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Dec 18, 2020

Faster model development with H2O AutoML and Flow

Faster model development with H2O AutoML and Flow

Categories: Data Science, Learning | Tags: Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…

OAuth2 and OpenID Connect for microservices and public applications (Part 2)

OAuth2 and OpenID Connect for microservices and public applications (Part 2)

Categories: Containers Orchestration, Cyber Security | Tags: CNCF, LDAP, Micro Services, JavaScript Object Notation (JSON), OAuth2, OpenID Connect

Using OAuth2 and OpenID Connect, it is important to understand how the authorization flow is taking place, who shall call the Authorization Server, how to store the tokens. Moreover, microservices and…

David WORMS

By David WORMS

Nov 20, 2020

OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)

OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)

Categories: Containers Orchestration, Cyber Security | Tags: CNCF, Go Lang, JAMstack, LDAP, Kubernetes, OAuth2, OpenID Connect

Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokens…

David WORMS

By David WORMS

Nov 17, 2020

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…

Gauthier LEONARD

By Gauthier LEONARD

Nov 5, 2020

Rebuilding HDP Hive: patch, test and build

Rebuilding HDP Hive: patch, test and build

Categories: Big Data, Infrastructure | Tags: Maven, GitHub, Java, Hive, Git, Release and features, TDP, Unit tests

The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Oct 6, 2020

Data versioning and reproducible ML with DVC and MLflow

Data versioning and reproducible ML with DVC and MLflow

Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage

Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…

Experiment tracking with MLflow on Databricks Community Edition

Experiment tracking with MLflow on Databricks Community Edition

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn

Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…

Version your datasets with Data Version Control (DVC) and Git

Version your datasets with Data Version Control (DVC) and Git

Categories: Data Science, DevOps & SRE | Tags: DevOps, Infrastructure, Operation, Git, GitOps, SCM

Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…

Grégor JOUET

By Grégor JOUET

Sep 3, 2020

Plugin architecture in JavaScript and Node.js with Plug and Play

Plugin architecture in JavaScript and Node.js with Plug and Play

Categories: Front End, Node.js | Tags: Asynchronous, DevOps, Programming, Agile, JavaScript, Open source, Release and features

Plug and Play helps library and application authors to introduce a plugin architecture into their code. It simplifies complex code execution with well-defined interception points, also called hooks…

David WORMS

By David WORMS

Aug 28, 2020

Installing Hadoop from source: build, patch and run

Installing Hadoop from source: build, patch and run

Categories: Big Data, Infrastructure | Tags: Maven, Java, LXD, Hadoop, HDFS, Docker, TDP, Unit tests

Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Aug 4, 2020

Download datasets into HDFS and Hive

Download datasets into HDFS and Hive

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data Engineering, Data structures, Database, Hadoop, HDFS, Hive, Big Data, Data Analytics, Data Lake, Data lakehouse, Data Warehouse

Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…

Aida NGOM

By Aida NGOM

Jul 31, 2020

Comparison of different file formats in Big Data

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

Aida NGOM

By Aida NGOM

Jul 23, 2020

Automate a Spark routine workflow from GitLab to GCP

Automate a Spark routine workflow from GitLab to GCP

Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Learning and tutorial, Airflow, Spark, CI/CD, GitLab, GitOps, GCP, Terraform

A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environments…

Ferdinand DE BAECQUE

By Ferdinand DE BAECQUE

Jun 16, 2020

Importing data to Databricks: external tables and Delta Lake

Importing data to Databricks: external tables and Delta Lake

Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python

During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…

Introducing Apache Airflow on AWS

Introducing Apache Airflow on AWS

Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: PySpark, Learning and tutorial, Airflow, Oozie, Spark, AWS, Docker, Python

Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…

Aargan COINTEPAS

By Aargan COINTEPAS

May 5, 2020

Expose a Rook-based Ceph cluster outside of Kubernetes

Expose a Rook-based Ceph cluster outside of Kubernetes

Categories: Containers Orchestration | Tags: Debug, Rook, Ceph, Docker, Kubernetes

We recently deployed a LXD based Hadoop cluster and we wanted to be able to apply size quotas on some filesystems (ie: service logs, user homes). Quota is a built in feature of the Linux kernel used…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Apr 16, 2020

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial

Snowflake, the Data Warehouse for the Cloud, introduction and tutorial

Categories: Business Intelligence, Cloud Computing | Tags: Cloud, Data Lake, Data Science, Data Warehouse, Snowflake

Snowflake is a SaaS-based data-warehousing platform that centralizes, in the cloud, the storage and processing of structured and semi-structured data. The increasing generation of data produced over…

Jules HAMELIN-BOYER

By Jules HAMELIN-BOYER

Apr 7, 2020

Optimization of Spark applications in Hadoop YARN

Optimization of Spark applications in Hadoop YARN

Categories: Data Engineering, Learning | Tags: Tuning, Hadoop, Spark, Python

Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…

Ferdinand DE BAECQUE

By Ferdinand DE BAECQUE

Mar 30, 2020

MLflow tutorial: an open source Machine Learning (ML) platform

MLflow tutorial: an open source Machine Learning (ML) platform

Categories: Data Engineering, Data Science, Learning | Tags: AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Categories: Data Science, Tech Radar | Tags: Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python

Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…

Robert Walid SOARES

By Robert Walid SOARES

Mar 2, 2020

Install and debug Kubernetes inside LXD

Install and debug Kubernetes inside LXD

Categories: Containers Orchestration | Tags: Debug, Linux, LXD, Docker, Kubernetes, Node

We recently deployed a Kubernetes cluster with the need to maintain clusters isolation on our bare metal nodes across our infrastructure. We knew that Virtual Machines would provide the required…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Feb 4, 2020

Policy enforcing with Open Policy Agent

Policy enforcing with Open Policy Agent

Categories: Cyber Security, Data Governance | Tags: Ranger, Kafka, Authorization, Cloud, Kubernetes, SSL/TLS, REST

Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currently…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Jan 22, 2020

Cloudera CDP and Cloud migration of your Data Warehouse

Cloudera CDP and Cloud migration of your Data Warehouse

Categories: Big Data, Cloud Computing | Tags: Azure, Cloudera, Data Hub, Data Lake, Data Warehouse

While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…

David WORMS

By David WORMS

Dec 16, 2019

Logstash pipelines remote configuration and self-indexing

Logstash pipelines remote configuration and self-indexing

Categories: Data Engineering, Infrastructure | Tags: Docker, Elasticsearch, Kibana, Logstash, Log4j

Logstash is a powerful data collection engine that integrates in the Elastic Stack (Elasticsearch - Logstash - Kibana). The goal of this article is to show you how to deploy a fully managed Logstash…

Paul-Adrien CORDONNIER

By Paul-Adrien CORDONNIER

Dec 13, 2019

Should you move your Big Data and Data Lake to the Cloud

Should you move your Big Data and Data Lake to the Cloud

Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Azure, Cloud, CDP, Databricks, GCP

Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…

Joris RUMMENS

By Joris RUMMENS

Dec 9, 2019

Hadoop Ozone part 3: advanced replication strategy with Copyset

Hadoop Ozone part 3: advanced replication strategy with Copyset

Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes, Node

Hadoop Ozone provide a way of setting a ReplicationType for every write you make on the cluster. Right now is supported HDFS and Ratis but more advanced replication strategies can be achieved. In this…

Hadoop Ozone part 2: tutorial and getting started of its features

Hadoop Ozone part 2: tutorial and getting started of its features

Categories: Infrastructure | Tags: CLI, Learning and tutorial, HDFS, Ozone, Amazon S3, Cluster, REST

The releases of Hadoop Ozone come with a handy docker-compose file to try out Ozone. The below instructions provide details on how to use it. You can also use the Katacoda training sandbox which…

Hadoop Ozone part 1: an introduction of the new filesystem

Hadoop Ozone part 1: an introduction of the new filesystem

Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes

Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…

InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS

InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS

Categories: Big Data, Containers Orchestration | Tags: DevOps, LXD, Hadoop, Kafka, Spark, Ceph, Internship, Kubernetes, NoSQL

Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…

David WORMS

By David WORMS

Nov 26, 2019

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Categories: Data Engineering, Data Science | Tags: Flink, DevOps, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…

David WORMS

By David WORMS

Nov 26, 2019

Insert rows in BigQuery tables with complex columns

Insert rows in BigQuery tables with complex columns

Categories: Cloud Computing, Data Engineering | Tags: GCP, BigQuery, Schema, SQL

Google’s BigQuery is a cloud data warehousing system designed to process enormous volumes of data with several features available. Out of all those features, let’s talk about the support of Struct…

César BEREZOWSKI

By César BEREZOWSKI

Nov 22, 2019

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod

Categories: Data Science | Tags: GPU, Deep Learning, Horovod, Keras, TensorFlow

The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…

Grégor JOUET

By Grégor JOUET

Nov 15, 2019

Kerberos and Spnego authentication on Windows with Firefox

Kerberos and Spnego authentication on Windows with Firefox

Categories: Cyber Security | Tags: Firefox, HTTP, Kerberos, FreeIPA

In Greek mythology, Kerberos, also called Cerberus, guards the gates of the Underworld to prevent the dead from leaving. He is commonly described as a three-headed dog, a serpent’s tail, mane of…

David WORMS

By David WORMS

Nov 4, 2019

Notes on the Cloudera Open Source licensing model

Notes on the Cloudera Open Source licensing model

Categories: Big Data | Tags: CDSW, License, Cloudera Manager, Open source

Following the publication of its Open Source licensing strategy on July 10, 2019 in an article called “our Commitment to Open Source Software”, Cloudera broadcasted a webinar yesterday October 2…

David WORMS

By David WORMS

Oct 25, 2019

Innovation, project vs product culture in Data Science

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…

David WORMS

By David WORMS

Oct 8, 2019

Machine Learning model deployment

Machine Learning model deployment

Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema

“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

Sep 30, 2019

Rook with Ceph doesn't provision my Persistent Volume Claims!

Rook with Ceph doesn't provision my Persistent Volume Claims!

Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes

Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid…

Eyal CHOJNOWSKI

By Eyal CHOJNOWSKI

Sep 9, 2019

Users and RBAC authorizations in Kubernetes

Users and RBAC authorizations in Kubernetes

Categories: Containers Orchestration, Data Governance | Tags: Cyber Security, RBAC, Authentication, Authorization, Kubernetes, SSL/TLS

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and…

Robert Walid SOARES

By Robert Walid SOARES

Aug 7, 2019

TensorFlow installation on Docker

TensorFlow installation on Docker

Categories: Containers Orchestration, Data Science, Learning | Tags: CPU, Jupyter, Linux, AI, Deep Learning, Docker, TensorFlow

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…

Pierre SAUVAGE

By Pierre SAUVAGE

Aug 5, 2019

Running Apache Hive 3, new features and tips and tricks

Running Apache Hive 3, new features and tips and tricks

Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: JDBC, LLAP, Hadoop, Hive, Kafka, Release and features, Druid

Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…

Gauthier LEONARD

By Gauthier LEONARD

Jul 25, 2019

Auto-scaling Druid with Kubernetes

Auto-scaling Druid with Kubernetes

Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: CNCF, Helm, Metrics, OLAP, Operation, Container Orchestration, EC2, Cloud, Data Analytics, Kubernetes, Prometheus, Python, Druid

Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Jul 16, 2019

Mount Aladdin eToken in Firefox on Archlinux

Mount Aladdin eToken in Firefox on Archlinux

Categories: Hack | Tags: Arch Linux, Cyber Security, Firefox, Security, Smart card, 2FA

Given you’re on Archlinux and have an Aladdin eToken, let’s see how we can mount it in Firefox for web authentication. An Aladdin eToken is a cryptographic device (token, smart card) that stores…

César BEREZOWSKI

By César BEREZOWSKI

Jul 12, 2019

Spark Streaming part 4: clustering with Spark MLlib

Spark Streaming part 4: clustering with Spark MLlib

Categories: Data Engineering, Data Science, Learning | Tags: Apache Spark Streaming, Spark, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

Jun 27, 2019

Google Cloud Summit Paris Notes

Google Cloud Summit Paris Notes

Categories: Events | Tags: AWS, Azure, Cloud, GCP, Kubernetes, On-premises

Google organized its yearly Summit edition 2019 in Paris on the 18th of June. This year’s event was the biggest yet in Paris, which reflect Google’s commitment to position itself in the French market…

Tariq SAHNOUNI

By Tariq SAHNOUNI

Jun 26, 2019

Druid and Hive integration

Druid and Hive integration

Categories: Big Data, Business Intelligence, Tech Radar | Tags: LLAP, OLAP, Hive, Data Analytics, SQL, Druid

This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description…

Pierre SAUVAGE

By Pierre SAUVAGE

Jun 17, 2019

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Spark Streaming part 3: DevOps, tools and tests for Spark applications

Categories: Big Data, Data Engineering, DevOps & SRE | Tags: Apache Spark Streaming, DevOps, Learning and tutorial, Spark

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

May 31, 2019

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Categories: Data Engineering, Learning | Tags: Apache Spark Streaming, Spark, Python, Streaming

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

May 28, 2019

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Categories: Data Engineering, Learning | Tags: Apache Spark Streaming, Kafka, Spark, Big Data, Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

Apr 18, 2019

Recover from an EFI failure on a dedicated server

Recover from an EFI failure on a dedicated server

Categories: Hack | Tags: Infrastructure, Linux, Cloud

A few weeks ago, before upgrading our Ubuntu systems, we sort of messed around with our EFI partitions and the impacted servers never came back online on system reboot after the upgrade. Provisionning…

Grégor JOUET

By Grégor JOUET

Apr 16, 2019

First Class Functions in Python

First Class Functions in Python

Categories: Hack, Learning | Tags: Programming, Python

I recently watched a talk by Dave Cheney about first class functions in Go. Python supports first class functions too, so can we use them in the same ways? Absolutely. I have been using Python for a…

Arthur BUSSER

By Arthur BUSSER

Apr 15, 2019

Gatsby.js, React and GraphQL for documentation websites

Gatsby.js, React and GraphQL for documentation websites

Categories: Adaltas Summit 2018, Front End | Tags: Gatsby, HTTP, JAMstack, React.js, SEO, API, GitOps, GraphQL, JavaScript, Markdown, Node.js

In the last few months, I have started to redesign some of our Open Source project websites. This includes the websites of the Node.js CSV project, the Node.js HBase client and the Nikita project, our…

David WORMS

By David WORMS

Apr 1, 2019

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

Categories: Data Engineering | Tags: Thrift, JDBC, Hadoop, Hive, Spark, SQL

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports…

Oskar RYNKIEWICZ

By Oskar RYNKIEWICZ

Mar 25, 2019

Multihoming on Hadoop

Multihoming on Hadoop

Categories: Infrastructure | Tags: Kerberos, Hadoop, HDFS, Network

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an…

Joris RUMMENS

By Joris RUMMENS

Mar 5, 2019

Introduction to Cloudera Data Science Workbench

Introduction to Cloudera Data Science Workbench

Categories: Data Science | Tags: Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…

Mehdi ELALAMI

By Mehdi ELALAMI

Feb 28, 2019

Apache Knox made easy!

Apache Knox made easy!

Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ranger, Kerberos, LDAP, Active Directory, Knox, REST

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…

Michael HATOUM

By Michael HATOUM

Feb 4, 2019

Installing Kubernetes on CentOS 7

Installing Kubernetes on CentOS 7

Categories: Containers Orchestration | Tags: CentOS, cgroups, CNCF, DevOps, Infrastructure, Namespaces, Red Hat, VM, Ceph, Docker, Kubernetes

This article explains how to install a Kubernetes cluster. I will dive into what each step does so you can build a thorough understanding of what is going on. This article is based on my talk from the…

Arthur BUSSER

By Arthur BUSSER

Jan 29, 2019

Self-sovereign identities with verifiable claims

Self-sovereign identities with verifiable claims

Categories: Data Governance | Tags: Authentication, Blockchain, Cloud, IAM, Ledger

Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is the…

Nabil MELLAL

By Nabil MELLAL

Jan 23, 2019

Applying Deep Reinforcement Learning to Poker

Applying Deep Reinforcement Learning to Poker

Categories: Data Science | Tags: Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python

We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…

Oscar BLAZEJEWSKI

By Oscar BLAZEJEWSKI

Jan 9, 2019

LXD: The Missing Piece

LXD: The Missing Piece

Categories: Containers Orchestration | Tags: CPU, Linux, LXD, VM, Docker, Kubernetes

LXD stands for Linux Container Daemon. Yet another container technology. But LXD is very different. It stands apart from the pack. It is not necessarily better nor much faster nor more secure! But it…

Tariq SAHNOUNI

By Tariq SAHNOUNI

Dec 28, 2018

Monitoring a production Hadoop cluster with Kubernetes

Monitoring a production Hadoop cluster with Kubernetes

Categories: DevOps & SRE | Tags: Thrift, Grafana, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Kubernetes, Node, Node.js, Prometheus, Python

Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…

Paul-Adrien CORDONNIER

By Paul-Adrien CORDONNIER

Dec 21, 2018

CodaLab – Data Science competitions

CodaLab – Data Science competitions

Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…

Robert Walid SOARES

By Robert Walid SOARES

Dec 17, 2018

Native modules for Node.js with N-API

Native modules for Node.js with N-API

Categories: Adaltas Summit 2018, Front End | Tags: C++, Kerberos, NPM, JavaScript, Node.js

How to create native modules for Node.js? How to use N-API, the future of native addons development? Writing C/C++ addon is a useful and powerful feature of the Node.js runtime. Let’s explore them…

Xavier HERMAND

By Xavier HERMAND

Dec 12, 2018

Microsoft introduces Cloud Native Application Bundles

Microsoft introduces Cloud Native Application Bundles

Categories: Containers Orchestration | Tags: CLI, Helm, Packaging, Docker, Kubernetes

At DockerCon EU 2018 in Barcelona, Matt Butcher, Principal Engineer at Microsoft and inventor of Helm, introduced CNAB, Cloud Native Application Bundles, a packaging format for distributed…

Arthur BUSSER

By Arthur BUSSER

Dec 4, 2018

Jumbo, the Hadoop cluster bootstrapper

Jumbo, the Hadoop cluster bootstrapper

Categories: Infrastructure | Tags: Ambari, Automation, Ansible, Cluster, Vagrant, HDP, REST

Introducing Jumbo, a Hadoop cluster bootstrapper for developers. Jumbo helps you deploy development environments for Big Data technologies. It takes a few minutes to get a custom virtualized Hadoop…

Gauthier LEONARD

By Gauthier LEONARD

Nov 29, 2018

Main advantages of GraphQL as an alternative to REST

Main advantages of GraphQL as an alternative to REST

Categories: Front End | Tags: gRPC, API, GraphQL, JavaScript Object Notation (JSON), Node.js, Registry, REST

GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he builds…

David WORMS

By David WORMS

Nov 27, 2018

Node.js CSV version 4 - re-writing and performance

Node.js CSV version 4 - re-writing and performance

Categories: Node.js | Tags: CLI, Data Engineering, Refactoring, CSV, Release and features

Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as…

David WORMS

By David WORMS

Nov 19, 2018

Hadoop cluster takeover with Apache Ambari

Hadoop cluster takeover with Apache Ambari

Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, iptables, Kerberos, Nikita, Systemd, Cluster, HDP, Node, Node.js, REST

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this…

Leo SCHOUKROUN

By Leo SCHOUKROUN

Nov 15, 2018

Managing User Identities on Big Data Clusters

Managing User Identities on Big Data Clusters

Categories: Cyber Security, Data Governance | Tags: Kerberos, LDAP, Active Directory, Ansible, FreeIPA, IAM

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to…

David WORMS

By David WORMS

Nov 8, 2018

Apache Flink: past, present and future

Apache Flink: past, present and future

Categories: Data Engineering | Tags: Flink, Pipeline, Kubernetes, Machine Learning, SQL, Streaming

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…

César BEREZOWSKI

By César BEREZOWSKI

Nov 5, 2018

One week to discuss technology in a Moroccan riad

One week to discuss technology in a Moroccan riad

Categories: Adaltas Summit 2018, Learning | Tags: Flink, CDSW, Gatsby, React.js, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…

David WORMS

By David WORMS

Oct 11, 2018

Nvidia and AI on the edge

Nvidia and AI on the edge

Categories: Data Science | Tags: Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TensorFlow

In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…

Yliess HATI

By Yliess HATI

Oct 10, 2018

Deploying a secured Flink cluster on Kubernetes

Deploying a secured Flink cluster on Kubernetes

Categories: Big Data | Tags: Flink, Encryption, Kerberos, HDFS, Kafka, Elasticsearch, SSL/TLS

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…

David WORMS

By David WORMS

Oct 8, 2018

KVM machines for Vagrant on Archlinux

KVM machines for Vagrant on Archlinux

Categories: DevOps & SRE | Tags: Arch Linux, KVM, Linux, Virtualization, VM, Vagrant

Vagrant supports different providers to manage virtualization. In a Linux environment, you can dramatically improve VM performance by using the libvirt provider and the KVM hypervisor. This tutorial…

Gauthier LEONARD

By Gauthier LEONARD

Sep 19, 2018

Lando: Deep Learning used to summarize conversations

Lando: Deep Learning used to summarize conversations

Categories: Data Science, Learning | Tags: Micro Services, Open API, Deep Learning, Internship, Kubernetes, Neural Network, Node.js

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

Yliess HATI

By Yliess HATI

Sep 18, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Rolling Upgrade, HDFS, Spark, YARN, Docker

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

Lucas BAKALIAN

By Lucas BAKALIAN

Jul 25, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

Louis BIANCHERIN

By Louis BIANCHERIN

Jul 24, 2018

Curing the Kafka blindness with the UI manager

Curing the Kafka blindness with the UI manager

Categories: Big Data | Tags: Ambari, Ranger, Hortonworks, HDF, JMX, UI, Kafka, HDP

Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…

Lucas BAKALIAN

By Lucas BAKALIAN

Jun 20, 2018

A CoreOS development cluster with Vagrant and VirtualBox

A CoreOS development cluster with Vagrant and VirtualBox

Categories: Hack, Infrastructure | Tags: Arch Linux, CoreOS, Linux, VirtualBox, etcd, Vagrant

Following CoreOS’s instructions on how to set up a development environment in VirtualBox did not work out well for me. Here are the steps I followed to get Container Linux up and running with Vagrant…

Arthur BUSSER

By Arthur BUSSER

Jun 20, 2018

Guide to Keybase encrypted directories

Guide to Keybase encrypted directories

Categories: Cyber Security, Hack | Tags: Cryptography, Encryption, File system, Keybase, PGP, Authorization

This is a guide to using Keybase’s encrypted directories to store and share files. Keybase is a group, file and chat application who’s goal is to bring public key crypto based on PGP to everyone in…

Arthur BUSSER

By Arthur BUSSER

Jun 18, 2018

Data Lake ingestion best practices

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: NiFi, Data Governance, HDF, Operation, Avro, Hive, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

David WORMS

By David WORMS

Jun 18, 2018

Apache Hadoop YARN 3.0 – State of the union

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

Lucas BAKALIAN

By Lucas BAKALIAN

May 31, 2018

Accelerating query processing with materialized views in Apache Hive

Accelerating query processing with materialized views in Apache Hive

Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, OLAP, Hive, Release and features, SQL, Druid

The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it…

Paul-Adrien CORDONNIER

By Paul-Adrien CORDONNIER

May 31, 2018

YARN and GPU Distribution for Machine Learning

YARN and GPU Distribution for Machine Learning

Categories: Data Science, DataWorks Summit 2018 | Tags: GPU, YARN, Machine Learning, Neural Network, Storage

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

Grégor JOUET

By Grégor JOUET

May 30, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

TensorFlow on Spark 2.3: The Best of Both Worlds

Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

Yliess HATI

By Yliess HATI

May 29, 2018

Apache Metron in the Real World

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, NiFi, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, Spark, Data Science, Elasticsearch, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

Michael HATOUM

By Michael HATOUM

May 29, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak

Running Enterprise Workloads in the Cloud with Cloudbreak

Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…

Joris RUMMENS

By Joris RUMMENS

May 28, 2018

Omid: Scalable and highly available transaction processing for Apache Phoenix

Omid: Scalable and highly available transaction processing for Apache Phoenix

Categories: Big Data, DataWorks Summit 2018 | Tags: Omid, Phoenix, Transaction, ACID, HBase, SQL

Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…

Xavier HERMAND

By Xavier HERMAND

May 24, 2018

Apache Beam: a unified programming model for data processing pipelines

Apache Beam: a unified programming model for data processing pipelines

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Flink, Pipeline, Spark

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

Gauthier LEONARD

By Gauthier LEONARD

May 24, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, REST

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

Leo SCHOUKROUN

By Leo SCHOUKROUN

May 23, 2018

What's new in Apache Spark 2.3?

What's new in Apache Spark 2.3?

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

César BEREZOWSKI

By César BEREZOWSKI

May 23, 2018

Essential questions about Time Series

Essential questions about Time Series

Categories: Big Data | Tags: Grafana, HBase, Hive, ORC, Data Science, Elasticsearch, IOT, Druid

Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…

David WORMS

By David WORMS

Mar 18, 2018

Execute Python in an Oozie workflow

Execute Python in an Oozie workflow

Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…

César BEREZOWSKI

By César BEREZOWSKI

Mar 6, 2018