Big Data

TensorFlow on Spark 2.3: The Best of Both Worlds

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. […]

Running Enterprise Workloads in the Cloud with Cloudbreak

This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool for cloud environments, Cloudbreak, describes and comments features that Peter and Richard explained in their talk, and give some personal guidelines on when and why [...]

By |2018-06-06T09:16:58+00:00May 28th, 2018|Categories: Big Data, DataWorks Summit 2018|Tags: , , , |1 Comment

Omid: Scalable and highly available transaction processing for Apache Phoenix

Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. […]

By |2018-06-05T22:36:36+00:00May 24th, 2018|Categories: Big Data, DataWorks Summit 2018, Events|Tags: , , , , , |1 Comment

Apache Beam: a unified programming model for data processing pipelines

In this article, we will review the concepts, the history and the future of Apache Beam, that may well become the new standard for data processing pipelines definition. […]

Present and future of Hadoop workflow scheduling: Oozie 5.x

During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of Oozie 5.X, which is the main subject of this article. They spent some time discussing the Apache Ambari’s Workflow Scheduler and its way [...]

By |2018-06-05T22:36:37+00:00May 23rd, 2018|Categories: Big Data, DataWorks Summit 2018|Tags: , |2 Comments

Essential questions about Time Series

Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. […]

By |2018-06-05T22:36:40+00:00March 19th, 2018|Categories: Big Data, Data Engineering|Tags: , , , , , |0 Comments

HDP cluster supervision

About With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data. One main concern while building these infrastructures is the capacity to continuously monitor the cluster's health and report issues as fast as possible. This is where supervision comes in. [...]

By |2018-06-05T22:37:04+00:00July 5th, 2017|Categories: Big Data|2 Comments

Hive, Calcite and Druid

BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal solutions RDBMS (Mysql..): don't scale, need caching but adhoc queries remain slow Key/value store (HBase...): quick but takes forever to compute (pre-materialization of data) Context Created in 2011, open-sourced [...]

By |2018-06-05T22:37:07+00:00July 14th, 2016|Categories: Big Data|0 Comments

Oracle to Apache Hive with the Oracle SQL Connector

In a previous article published last week, I introduced the choices available to connect Oracle and Hadoop. In a follow up article, I covered the Oracle SQL Connector, its installation and integration with Apache Hadoop and more specifically how to declare a file present inside HDFS, the Hadoop filesystem, as a database table inside the [...]

By |2018-06-05T22:37:10+00:00July 27th, 2013|Categories: Big Data|2 Comments