Articles published in 2018

Ambari - How to blueprint

Categories: Big Data, DevOps & SRE | Tags: Ambari, Automation, DevOps, Operation, Ranger, CDH, HDP, IaC, PostgreSQL, REST

As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Let’s see how to automate this process with REST requests. While really handy for deploying one or two clusters, the…

By Joris RUMMENS

Jan 17, 2018

Execute Python in an Oozie workflow

Categories: Data Engineering | Tags: Oozie, Elasticsearch, Python, REST

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that. I’ve recently designed a workflow that would interact…

By César BEREZOWSKI

Mar 6, 2018

Publishing guidelines

Categories: DevOps & SRE | Tags: Arch Linux, KVM, VM, GitLab, Vagrant, Markdown

This is as much a set of guidelines targeting everyone publishing content on the web as rules for reviewers to ensure no validation is forgotten before submitting for publication. It mostly targets…

By David WORMS

Feb 28, 2018

Essential questions about Time Series

Categories: Big Data | Tags: Druid, HBase, Hive, ORC, Data Science, Elasticsearch, Grafana, IOT

Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…

By David WORMS

Mar 18, 2018

What's new in Apache Spark 2.3?

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Python, Streaming

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

By César BEREZOWSKI

May 23, 2018

Apache Beam: a unified programming model for data processing pipelines

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Apex, Beam, Java, Pipeline, Flink, Spark, Batch processing, Python, Streaming, TCO

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. At Dataworks Summit 2018 in…

By Gauthier LEONARD

May 24, 2018

Present and future of Hadoop workflow scheduling: Oozie 5.x

Categories: Big Data, DataWorks Summit 2018 | Tags: Hadoop, Hive, Oozie, Sqoop, CDH, HDP, Python, REST

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…

By Leo SCHOUKROUN

May 23, 2018

Omid: Scalable and highly available transaction processing for Apache Phoenix

Categories: Big Data, DataWorks Summit 2018 | Tags: Omid, Phoenix, Transaction, ACID, HBase, SQL

Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…

By Xavier HERMAND

May 24, 2018

Running Enterprise Workloads in the Cloud with Cloudbreak

Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, Operation, Hadoop, AWS, Azure, GCP, HDP, OpenStack

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…

By Joris RUMMENS

May 28, 2018

Apache Metron in the Real World

Categories: Cyber Security, DataWorks Summit 2018 | Tags: Algorithm, Solr, Storm, pcap, RDBMS, HDFS, Kafka, Metron, NiFi, Spark, Data Science, Elasticsearch, SQL

Apache Metron is a storage and analytic platform specialized in cyber security. This talk was about demonstrating the usages and capabilities of Apache Metron in the real world. The presentation was…

By Michael HATOUM

May 29, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

By Yliess HATI

May 29, 2018

YARN and GPU Distribution for Machine Learning

Categories: Data Science, DataWorks Summit 2018 | Tags: arXiv, GPU, MXNet, YARN, Docker, Grafana, Machine Learning, Neural Network, Storage, TensorFlow

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

By Grégor JOUET

May 30, 2018

Accelerating query processing with materialized views in Apache Hive

Categories: Business Intelligence, DataWorks Summit 2018 | Tags: Calcite, OLAP, Druid, Hive, Release and features, SQL

The new materialized view feature is coming in Apache Hive 3.0. Jesus Camacho Rodriguez from Hortonworks held a talk ”Accelerating query processing with materialized views in Apache Hive” about it…

By Paul-Adrien CORDONNIER

May 31, 2018

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

By Lucas BAKALIAN

May 31, 2018

Data Lake ingestion best practices

Categories: Big Data, Data Engineering | Tags: Data Governance, HDF, Operation, Avro, Hive, NiFi, ORC, Spark, Data Lake, File Format, Protocol Buffers, Registry, Schema

Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…

By David WORMS

Jun 18, 2018

Guide to Keybase encrypted directories

Categories: Cyber Security, Hack | Tags: Cryptography, Encryption, File system, Keybase, PGP, Authorization

This is a guide to using Keybase’s encrypted directories to store and share files. Keybase is a group, file and chat application who’s goal is to bring public key crypto based on PGP to everyone in…

By Arthur BUSSER

Jun 18, 2018

A CoreOS development cluster with Vagrant and VirtualBox

Categories: Hack, Infrastructure | Tags: Arch Linux, CoreOS, Linux, VirtualBox, Clustering, Consensus, etcd, Vagrant

Following CoreOS’s instructions on how to set up a development environment in VirtualBox did not work out well for me. Here are the steps I followed to get Container Linux up and running with Vagrant…

By Arthur BUSSER

Jun 20, 2018

Curing the Kafka blindness with the UI manager

Categories: Big Data | Tags: Ambari, Hortonworks, HDF, JMX, UI, Kafka, Ranger, HDP

Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…

By Lucas BAKALIAN

Jun 20, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

By Louis BIANCHERIN

Jul 24, 2018

Clusters and workloads migration from Hadoop 2 to Hadoop 3

Categories: Big Data, Infrastructure | Tags: Slider, Erasure Coding, Operation, Rolling Upgrade, SLA, Hadoop, HBase, HDFS, Oozie, Spark, YARN, Docker, TCO

Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…

By Lucas BAKALIAN

Jul 25, 2018

KVM machines for Vagrant on Archlinux

Categories: DevOps & SRE | Tags: Arch Linux, KVM, Linux, Virtualization, VM, Vagrant

Vagrant supports different providers to manage virtualization. In a Linux environment, you can dramatically improve VM performance by using the libvirt provider and the KVM hypervisor. This tutorial…

By Gauthier LEONARD

Sep 19, 2018

Deploying a secured Flink cluster on Kubernetes

Categories: Big Data | Tags: Encryption, Flink, HDFS, Kafka, Elasticsearch, Kerberos, SSL/TLS

When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…

By David WORMS

Oct 8, 2018

Nvidia and AI on the edge

Categories: Data Science | Tags: Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TCO, TensorFlow

In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…

By Yliess HATI

Oct 10, 2018

One week to discuss technology in a Moroccan riad

Categories: Adaltas Summit 2018, Learning | Tags: CDSW, Gatsby, React.js, Flink, Hadoop, Knox, Data Science, Deep Learning, Kubernetes, Node.js

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the…

By David WORMS

Oct 11, 2018

Managing User Identities on Big Data Clusters

Categories: Cyber Security, Data Governance | Tags: LDAP, Active Directory, Ansible, FreeIPA, IaC, IAM, Kerberos

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to…

By David WORMS

Nov 8, 2018

Apache Flink: past, present and future

Categories: Data Engineering | Tags: Consistency, Micro Services, Pipeline, Flink, Batch processing, Kubernetes, Ledger, Machine Learning, Scikit-learn, SQL, Storage, Streaming

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…

By César BEREZOWSKI

Nov 5, 2018

Hadoop cluster takeover with Apache Ambari

Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, iptables, Nikita, Systemd, Cluster, HDP, IaC, Kerberos, Node, Node.js, REST, SCM

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this…

By Leo SCHOUKROUN

Nov 15, 2018

Node.js CSV version 4 - re-writing and performance

Categories: Node.js | Tags: CLI, Data Engineering, Refactoring, CSV, Release and features

Today, we release a new major version of the Node.js CSV parser project. Version 4 is a complete re-writing of the project focusing on performance. It also comes with new functionalities as well as…

By David WORMS

Nov 19, 2018

Main advantages of GraphQL as an alternative to REST

Categories: Front End | Tags: gRPC, API, GraphQL, JavaScript Object Notation (JSON), Node.js, Registry, REST

GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he builds…

By David WORMS

Nov 27, 2018

Jumbo, the Hadoop cluster bootstrapper

Categories: Infrastructure | Tags: Ambari, Automation, Ansible, Cluster, Vagrant, HDP, IaC, Python, REST, SCM

Introducing Jumbo, a Hadoop cluster bootstrapper for developers. Jumbo helps you deploy development environments for Big Data technologies. It takes a few minutes to get a custom virtualized Hadoop…

By Gauthier LEONARD

Nov 29, 2018

Native modules for Node.js with N-API

Categories: Adaltas Summit 2018, Front End | Tags: C++, NPM, JavaScript, Kerberos, Node.js

How to create native modules for Node.js? How to use N-API, the future of native addons development? Writing C/C++ addon is a useful and powerful feature of the Node.js runtime. Let’s explore them…

By Xavier HERMAND

Dec 12, 2018

LXD: The Missing Piece

Categories: Containers Orchestration | Tags: CPU, Linux, LXD, VM, Docker, Kubernetes

LXD stands for Linux Container Daemon. Yet another container technology. But LXD is very different. It stands apart from the pack. It is not necessarily better nor much faster nor more secure! But it…

By Tariq SAHNOUNI

Dec 28, 2018

Monitoring a production Hadoop cluster with Kubernetes

Categories: DevOps & SRE | Tags: Thrift, Shinken, Hadoop, Knox, Cluster, Docker, Elasticsearch, Grafana, Kubernetes, Node, Node.js, Prometheus, Python

Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest…

By Paul-Adrien CORDONNIER

Dec 21, 2018

CodaLab – Data Science competitions

Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, Machine Learning, MySQL, Node.js, Python

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…

By Robert Walid SOARES

Dec 17, 2018

Microsoft introduces Cloud Native Application Bundles

Categories: Containers Orchestration | Tags: CLI, Helm, Packaging, Docker, Kubernetes

At DockerCon EU 2018 in Barcelona, Matt Butcher, Principal Engineer at Microsoft and inventor of Helm, introduced CNAB, Cloud Native Application Bundles, a packaging format for distributed…

By Arthur BUSSER

Dec 4, 2018

Lando: Deep Learning used to summarize conversations

Categories: Data Science, Learning | Tags: CockroachDB, FoundationDB, Micro Services, NATS, Open API, React.js, Speech to text, Swagger, Vue.js, Kafka, Deep Learning, GitLab, IaC, Internship, JWT, Kubernetes, Neural Network, Node.js, Python

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

By Yliess HATI

Sep 18, 2018