Blog

Yahoo’s Vespa Engine

Vespa is Yahoo’s fully autonomous and self-sufficient big data processing and serving engine. It aims at serving results of queries on huge amounts of data in real time. An example of this would be serving search results or recommendations to a user. Yahoo – or Oath – recently made Vespa open source on GitHub. At [...]

By |2018-06-05T22:36:53+00:00October 16th, 2017|Categories: Tech Radar|Tags: , |0 Comments

Exposing Kafka on two different networks

A Big Data setup usually requires you to have multiple networking interface, let’s see how to set up Kafka on more than one of them. Kafka is a open-source stream processing software platform system wich functions like a publish/subscribe distributed messaging. It is designed for high throughput with built-in partitioning, replication, and fault tolerance. [...]

By |2018-06-05T22:37:00+00:00July 22nd, 2017|Categories: Blog|Tags: , |0 Comments

Change Ambari’s topbar color

We recently had a client that has multiple environments (Production, Integration, Testing, ...) running on HDP and managed using one Ambari instance per cluster. One of the questions that came up was the folloging: We need a way to distinguish our environment when on Ambari and the cluster name is visually not enough, how can [...]

By |2018-06-05T22:37:01+00:00July 9th, 2017|Categories: Hack|Tags: , |0 Comments

MiNiFi: Data at Scales & the Values of Starting Small

This post is part of the Series of the Dataworks Summit 2017 (ex-Hadoop Summit) Speaker is Aldrin Piri from Hortonworks This conference presented rapidly Apache NiFi and explained where MiNiFi came from: basically it's a NiFi minimal agent to deploy on small devices to bring data to a cluster's NiFi pipeline (ex: IoT). Here are [...]

By |2018-06-05T22:37:03+00:00July 8th, 2017|Categories: Blog, Events|Tags: , , , , |0 Comments

HDP cluster supervision

About With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data. One main concern while building these infrastructures is the capacity to continuously monitor the cluster's health and report issues as fast as possible. This is where supervision comes in. [...]

By |2018-06-05T22:37:04+00:00July 5th, 2017|Categories: Big Data|2 Comments

Get in control of your workflows with Apache Airflow

Presentation by Christian Trebing from BlueYonder Introduction Use case : how to handle data coming in regularly from customers ? Option 1 : use CRON only time triggers hard error handling inconvenient when overlapping Option 2 : Writing a workflow processing tool start is easy soon you reach limits: invest much more than envisionned of work with [...]

By |2018-06-05T22:37:05+00:00July 17th, 2016|Categories: Events|0 Comments

Apache Apex : next gen Big Data analytics

Presentation by Thomas Weise from DataTorrent (developpers of Apex) Introduction Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind : Yarn is used for resource managing and ordonnancing HDFS is used to store persistant states Application development model   A stream [...]

By |2018-06-05T22:37:06+00:00July 17th, 2016|Categories: Events|0 Comments

EclairJS – Putting a Spark in Web Apps

Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and rich environment with NPM. EclairJS is a NodeJS library that provides bindings to a Spark application : An RDD is bound to a JS object that is made [...]

By |2018-06-05T22:37:06+00:00July 17th, 2016|Categories: Events|0 Comments