Big Data

Oracle and Hive, how data are published?

In the past few days, I've published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with HDFS and a third one explaining how to install and use the Oracle SQL Connector with Hive. Those last two [...]

By | 2018-06-05T22:37:17+00:00 July 6th, 2013|Categories: Big Data|0 Comments

Composants for CDH and HDP

I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting, April 2013, I am comparing the Cloudera distribution 4.2.0 and the Hortonwork Data Plaftorm 2.0.0. CDH 4.2.0 bigtop-jsvc 1.0.10-cdh4.2.0 bigtop-tomcat 6.0.35-cdh4.2.0 datafu [...]

By | 2017-11-21T20:17:08+00:00 July 2nd, 2013|Categories: Big Data|0 Comments

Stockage HDFS et Hive – comparaison entre les formats de fichiers et les méthodes de compression

Il y a quelques jours, nous avons conduit un test dans le but de comparer différent format de fichiers et méthodes de compression disponible dans Hive. Parmi ces formats, certains sont natifs à HDFS et s’appliquent à tous les utilisateurs d’Hadoop. La suite de tests est composée de requête Hive toutes similaires qui créent une [...]

By | 2018-06-05T22:37:21+00:00 July 15th, 2012|Categories: Big Data|0 Comments

Two Hive UDAF to convert an aggregation to a map

I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The first function converts an aggregation into a map and is internally using a Java HashMap. The second function extends [...]

By | 2018-06-05T22:37:23+00:00 March 6th, 2012|Categories: Big Data|0 Comments

Timeseries storage in Hadoop and Hive

In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows. Before importing the dataset into Hive, we will be exploring different optimization options expected to impact speed and storage size. [...]

By | 2018-06-05T22:37:29+00:00 January 10th, 2012|Categories: Big Data|0 Comments

Présentation de MapReduce

Les systèmes d’information ont de plus en plus de données à stocker et à traiter. Des entreprises comme Google, Facebook, Twitter mais encore bien d’autre stockent des quantités d’information astronomiques en provenance de leurs clients et doivent être en mesure de les servir par les meilleurs recommandation tout en assurant la pérennité de leurs systèmes. [...]

By | 2018-06-05T22:37:36+00:00 June 26th, 2010|Categories: Big Data|0 Comments