Hadoop

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming data processing chain in a distributed environment will be presented. Cluster environment demands attention to aspects such as monitoring, stability, [...]

By |2019-07-11T22:14:21+00:00May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |2 Comments

Multihoming on Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an introduction to the concept and its applications for real-world businesses. […]

By |2019-03-05T18:48:18+00:00March 5th, 2019|Categories: Adaltas Summit 2018, Big Data, Data Engineering|Tags: , , |0 Comments

Monitoring a production Hadoop cluster with Kubernetes

Monitoring a production grade Hadoop cluster is a real challenge and needs to be constantly evolving. The software we use today is based on Nagios. Very efficient when it comes to the simplest surveillance, it is not able to meet the need for a more complex verification. In this article, we will propose an architecture [...]

Hadoop cluster takeover with Apache Ambari

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this operation was required and how we did it. […]

By |2018-11-20T13:54:41+00:00November 15th, 2018|Categories: Adaltas Summit 2018, Big Data|Tags: , , , |0 Comments

One week to discuss technology in a Moroccan riad

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the useful with the pleasant, learn and share the feet in the swimming pool. The rule is simple, each participant [...]

By |2019-07-17T13:57:28+00:00October 11th, 2018|Categories: Adaltas Summit 2018|Tags: , , , , , , , |0 Comments

Hive, Calcite and Druid

BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal solutions RDBMS (Mysql..): don't scale, need caching but adhoc queries remain slow Key/value store (HBase...): quick but takes forever to compute (pre-materialization of data) Context Created in 2011, open-sourced [...]

By |2019-06-21T22:05:23+00:00July 14th, 2016|Categories: Big Data|Tags: , , , , |0 Comments

Red Hat Storage Gluster and its integration with Hadoop

[crayon-5d33ac185db15188189386/] I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will conclude with the integration between Red Hat Storage and Hadoop, especially what we can expect before conducting an [...]

By |2019-06-21T20:40:15+00:00July 3rd, 2016|Categories: Big Data|Tags: , , , , |0 Comments

Storage and massive processing with Hadoop

Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects for a growing number of web players (Yahoo!, EBay, Facebook, LinkedIn, Twitter) and their size continues to increase. Yahoo! has 45,000 machines with the [...]

By |2019-06-23T21:31:57+00:00November 26th, 2010|Categories: Big Data|Tags: , , , , |0 Comments