Big Data
Data, and the insight it offers, is essential for business to innovate and differentiate. Coming from a variety of sources, from inside the firewall out to the edge, the growth of data in terms of volume, variety and speed leads to innovative approaches. Today, Data Lakes allow organizations to accumulate huge reservoirs of information for future analysis. At the same time, the Cloud provides easy access to technologies to those who do not have the necessary infrastructure and Artificial Intelligence promises to proactively simplify management.
With Big Data technologies, Business Intelligence is entering a new era. Hadoop and the likes, NoSQL databases, and Cloud managed infrastrutures store and represent structured and unstructured data and time series such as logs and sensors. From collect to visualization, the whole processing chain is processed in batch and real time.
Articles related to Big Data
Categories: Big Data, Data Engineering, Data Science, Learning | Tags: Beam, CI/CD, Data Engineering, Deep Learning, Pipeline, Data Science, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow
Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…
Mar 3, 2021
Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin
Categories: Big Data, Infrastructure | Tags: Hive, Maven, Spark, Git, Unit tests, Hadoop, HBase, Release and features
The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN…
Dec 18, 2020
Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)
Categories: Big Data, Cloud Computing, Data Engineering | Tags: HDFS, NiFi, Authentication, Authorization, Hadoop, Azure Data Lake Storage (ADLS), Azure, OAuth2
As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…
Nov 5, 2020
Rebuilding HDP Hive: patch, test and build
Categories: Big Data, Infrastructure | Tags: Hive, Maven, Git, GitHub, Java, Unit tests, Release and features
The Hortonworks HDP distribution will soon be deprecated in favor of Cloudera’s CDP. One of our clients wanted a new Apache Hive feature backported into HDP 2.6.0. We thought it was a good opportunity…
Oct 6, 2020
Installing Hadoop from source: build, patch and run
Categories: Big Data, Infrastructure | Tags: HDFS, Maven, Docker, Java, LXD, Unit tests, Hadoop
Commercial Apache Hadoop distributions have come and gone. The two leaders, Cloudera and Hortonworks, have merged: HDP is no more and CDH is now CDP. MapR has been acquired by HP and IBM BigInsights…
Aug 4, 2020
Download datasets into HDFS and Hive
Categories: Big Data, Data Engineering | Tags: Analytics, HDFS, Hive, Big Data, Data Analytics, Data Engineering, Data structures, Database, Hadoop, Data Lake, Data Warehouse
Introduction Nowadays, the analysis of large amounts of data is becoming more and more possible thanks to Big data technology (Hadoop, Spark,…). This explains the explosion of the data volume and the…
By Aida NGOM
Jul 31, 2020
Comparaison of different file formats in Big Data
Categories: Big Data, Data Engineering | Tags: Analytics, Avro, HDFS, Hive, Kafka, MapReduce, ORC, Spark, Batch processing, Big Data, CSV, Data Analytics, Data structures, Database, JSON, Protocol Buffers, Hadoop, Parquet, Kubernetes, XML
In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…
By Aida NGOM
Jul 23, 2020
Automate a Spark routine workflow from GitLab to GCP
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Airflow, Spark, CI/CD, Learning and tutorial, GitLab, GCP, Terraform
A workflow consists in automating a succession of tasks to be carried out without human intervention. It is an important and widespread concept which particularly apply to operational environments…
Jun 16, 2020
Introducing Apache Airflow on AWS
Categories: Big Data, Cloud Computing, Containers Orchestration | Tags: Airflow, Oozie, Spark, PySpark, Docker, Learning and tutorial, AWS, Python
Apache Airflow offers a potential solution to the growing challenge of managing an increasingly complex landscape of data management tools, scripts and analytics processes. It is an open-source…
May 5, 2020
Cloudera CDP and Cloud migration of your Data Warehouse
Categories: Big Data, Cloud Computing | Tags: Cloudera, Data Hub, Data Lake, Data Warehouse, Azure
While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate…
By David WORMS
Dec 16, 2019
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Cloud, CDP, Databricks, GCP, Azure
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…
Dec 9, 2019
InfraOps & DevOps Internship - build a Big Data & Kubernetes PaaS
Categories: Big Data, Containers Orchestration | Tags: Kafka, Spark, DevOps, LXD, NoSQL, Hadoop, Ceph, Kubernetes
Context The acquisition of a high-capacity cluster is in line with Adaltas’ desire to build a PAAS-type offering to use and to provide Big Data and container orchestration platforms. The platforms are…
By David WORMS
Nov 26, 2019
Notes on the Cloudera Open Source licensing model
Categories: Big Data | Tags: CDSW, License, Cloudera Manager, Open source
Following the publication of its Open Source licensing strategy on July 10, 2019 in an article called “our Commitment to Open Source Software”, Cloudera broadcasted a webinar yesterday October 2…
By David WORMS
Oct 25, 2019
Machine Learning model deployment
Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema
“Enterprise Machine Learning requires looking at the big picture … from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…
Sep 30, 2019
Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: Druid, Hive, Kafka, JDBC, LLAP, Hadoop, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…
Jul 25, 2019
Auto-scaling Druid with Kubernetes
Categories: Big Data, Business Intelligence, Containers Orchestration | Tags: EC2, Druid, CNCF, Container Orchestration, Data Analytics, Helm, Metrics, OLAP, Operation, Prometheus, Cloud, Kubernetes, Python
Apache Druid is an open-source analytics data store which could leverage the auto-scaling abilities of Kubernetes due to its distributed nature and its reliance on memory. I was inspired by the talk…
Jul 16, 2019
Spark Streaming part 3: DevOps, tools and tests for Spark applications
Categories: Big Data, Data Engineering, DevOps & SRE | Tags: Spark, Apache Spark Streaming, DevOps, Learning and tutorial
Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data…
Jun 19, 2019
Druid and Hive integration
Categories: Big Data, Business Intelligence, Tech Radar | Tags: Druid, Hive, Data Analytics, LLAP, OLAP, SQL
This article covers the integration between Hive Interactive (LDAP) and Druid. One can see it as a complement of the Ultra-fast OLAP Analytics with Apache Hive and Druid article. Tools description…
Jun 17, 2019
Apache Knox made easy!
Categories: Big Data, Cyber Security, Adaltas Summit 2018 | Tags: Ranger, Kerberos, LDAP, Active Directory, REST, Knox
Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? Apache Knox overview Apache Knox is an application gateway for interacting in a…
Feb 4, 2019
Hadoop cluster takeover with Apache Ambari
Categories: Big Data, DevOps & SRE, Adaltas Summit 2018 | Tags: Ambari, Automation, HDP, iptables, Kerberos, Nikita, REST, Systemd, Cluster, Node, Node.js
We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this…
Nov 15, 2018
Deploying a secured Flink cluster on Kubernetes
Categories: Big Data | Tags: Flink, HDFS, Kafka, Elasticsearch, Encryption, Kerberos, SSL/TLS
When deploying secured Flink applications inside Kubernetes, you are faced with two choices. Assuming your Kubernetes is secure, you may rely on the underlying platform or rely on Flink native…
By David WORMS
Oct 8, 2018
Clusters and workloads migration from Hadoop 2 to Hadoop 3
Categories: Big Data, Infrastructure | Tags: HDFS, Slider, Spark, YARN, Docker, Erasure Coding, Rolling Upgrade
Hadoop 2 to Hadoop 3 migration is a hot subject. How to upgrade your clusters, which features present in the new release may solve current problems and bring new opportunities, how are your current…
Jul 25, 2018
Curing the Kafka blindness with the UI manager
Categories: Big Data | Tags: Ambari, Kafka, Ranger, Hortonworks, HDP, HDF, JMX, UI
Today it’s really difficult for developers, operators and managers to visualize and monitor what happens in a Kafka cluster. This articles covers a new graphical interface to oversee Kafka. It was…
Jun 20, 2018
Data Lake ingestion best practices
Categories: Big Data, Data Engineering | Tags: Avro, Hive, NiFi, ORC, Spark, Data Governance, HDF, Operation, Protocol Buffers, Data Lake, File Format, Registry, Schema
Creating a Data Lake requires rigor and experience. Here are some good practices around data ingestion both for batch and stream architectures that we recommend and implement with our customers…
By David WORMS
Jun 18, 2018
Apache Hadoop YARN 3.0 – State of the union
Categories: Big Data, DataWorks Summit 2018 | Tags: HDFS, MapReduce, YARN, Cloudera, Docker, GPU, Hortonworks, Hadoop, Data Science, Release and features
This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…
May 31, 2018
Running Enterprise Workloads in the Cloud with Cloudbreak
Categories: Big Data, Cloud Computing, DataWorks Summit 2018 | Tags: Cloudbreak, HDP, Operation, Hadoop, AWS, GCP, Azure, OpenStack
This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin. It presents Hortonworks’ automated deployment tool…
May 28, 2018
Omid: Scalable and highly available transaction processing for Apache Phoenix
Categories: Big Data, DataWorks Summit 2018 | Tags: ACID, Omid, Phoenix, Transaction, HBase, SQL
Apache Omid provides a transactional layer on top of key/value NoSQL databases. In practice, it is usually used on top of Apache HBase. Credits to Ohad Shacham for his talk and his work for Apache…
May 24, 2018
Present and future of Hadoop workflow scheduling: Oozie 5.x
Categories: Big Data, DataWorks Summit 2018 | Tags: Hive, Oozie, Sqoop, HDP, REST, Hadoop, CDH
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of…
May 23, 2018
Essential questions about Time Series
Categories: Big Data | Tags: Druid, Hive, ORC, Elasticsearch, Graphana, IOT, HBase, Data Science
Today, the bulk of Big Data is temporal. We see it in the media and among our customers: smart meters, banking transactions, smart factories, connected vehicles … IoT and Big Data go hand in hand. We…
By David WORMS
Mar 19, 2018
Ambari - How to blueprint
Categories: Big Data, DevOps & SRE | Tags: Ambari, Ranger, Automation, DevOps, Operation, REST
As infrastructure engineers at Adaltas, we deploy Hadoop clusters. A lot of them. Let’s see how to automate this process with REST requests. While really handy for deploying one or two clusters, the…
Jan 17, 2018
Cloudera Sessions Paris 2017
Categories: Big Data, Events | Tags: EC2, Cloudera, Altus, CDSW, SDX, PaaS, CDH, Data Science, Azure
Adaltas was at the Cloudera Sessions on October 5, where Cloudera showcased their new products and offerings. Below you’ll find a summary of what we witnessed. Note: the information were aggregated in…
Oct 16, 2017
Change Ambari's topbar color
Categories: Big Data, Hack | Tags: Ambari, Front-end
We recently had a client that has multiple environments (Production, Integration, Testing, …) running on HDP and managed using one Ambari instance per cluster. One of the questions that came up was…
Jul 9, 2017
MiNiFi: Data at Scales & the Values of Starting Small
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: MiNiFi, NiFi, Cloudera, C++, HDP, HDF, IOT
This conference presented rapidly Apache NiFi and explained where MiNiFi came from: basically it’s a NiFi minimal agent to deploy on small devices to bring data to a cluster’s NiFi pipeline (ex: IoT…
Jul 8, 2017
Advanced multi-tenant Hadoop and Zookeeper protection
Categories: Big Data, Infrastructure | Tags: Zookeeper, DoS, iptables, Operation, Scalability, Clustering, Consensus
Zookeeper is a critical component to Hadoop’s high availability operation. The latter protects itself by limiting the number of maximum connections (maxConns = 400). However Zookeeper does not protect…
Jul 5, 2017
HDP cluster monitoring
Categories: Big Data, DevOps & SRE, Infrastructure | Tags: Alert, Ambari, HDP, Metrics, Monitoring, REST
With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data. One main concern while building these infrastructures…
Jul 5, 2017
Hive Metastore HA with DBTokenStore: Failed to initialize master key
Categories: Big Data, DevOps & SRE | Tags: Hive, Bug, Infrastructure
This article describes my little adventure around a startup error with the Hive Metastore. It shall be reproducable with any secure installation, meaning with Kerberos, with high availability enabled…
By David WORMS
Jul 21, 2016
Get in control of your workflows with Apache Airflow
Categories: Big Data, Tech Radar | Tags: Airflow, DevOps, Cloud, Python
Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder. Introduction Use case: how to handle data coming in regularly from customers…
Jul 17, 2016
Hive, Calcite and Druid
Categories: Big Data | Tags: Analytics, Druid, Hive, Database, Hadoop
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal…
By David WORMS
Jul 14, 2016
Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: HDFS, GlusterFS, Red Hat, Hadoop, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…
By David WORMS
Jul 3, 2015
Composants for CDH and HDP
Categories: Big Data | Tags: Flume, Hive, Oozie, Sqoop, Zookeeper, Cloudera, Hortonworks, HDP, Hadoop, CDH
I was interested to compare the different components distributed by Cloudera and HortonWorks. This also gives us an idea of the versions packaged by the two distributions. At the time of this writting…
By David WORMS
Sep 22, 2013
State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Kafka, Mesos, Phoenix, Pig, Hadoop, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…
By David WORMS
Jul 8, 2013
Oracle and Hive, how data are published?
Categories: Big Data | Tags: Hive, Sqoop, Oracle, Data Lake
In the past few days, I’ve published 3 related articles: a first one covering the option to integrate Oracle and Hadoop, a second one explaining how to install and use the Oracle SQL Connector with…
By David WORMS
Jul 6, 2013
The state of Hadoop distributions
Categories: Big Data | Tags: Cloudera, Hortonworks, Intel, Oracle, Hadoop
Apache Hadoop is of course made available for download on its official webpage. However, downloading and installing the several components that make a Hadoop cluster is not an easy task and is a…
By David WORMS
May 11, 2013
HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Analytics, Hive, ORC, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…
By David WORMS
Mar 13, 2012
Hadoop and HBase installation on OSX in pseudo-distributed mode
Categories: Big Data, Learning | Tags: Big Data, Hue, Infrastructure, Hadoop, HBase, Deployment
The operating system chosen is OSX but the procedure is not so different for any Unix environment because most of the software is downloaded from the Internet, uncompressed and set manually. Only a…
By David WORMS
Dec 1, 2010
Storage and massive processing with Hadoop
Categories: Big Data | Tags: HDFS, Hadoop, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…
By David WORMS
Nov 26, 2010
Node HBase, a NodeJs client for Apache HBase
Categories: Big Data, Node.js | Tags: Big Data, REST, HBase, Node.js
HBase is a “column familly” database from the Hadoop ecosystem built on the model of Google BigTable. HBase can accommodate very large volumes of data (tera or peta) while maintaining high…
By David WORMS
Nov 1, 2010
MapReduce introduction
Categories: Big Data | Tags: MapReduce, Big Data, Java, JavaScript
Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must be…
By David WORMS
Jun 26, 2010