Storage
Storage is the capacity to retain digital information on a computer component. In practice, storage is organised in hierarchy, placing hot data which required fast but costly access closer to the CPU and cold data further away on slower but persistent devices sometimes accessed through the network. Fast but volatile storage is most often called "memory.".
The main characteristics of storage inclue volatility, mutability, accessibility, adressability, capacity, performance, energy use and security.
- Learn more
- Wikipedia
Related articles

Apache HBase: RegionServers co-location
Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, HDP, Infrastructure, Tuning, Hadoop, HBase, Big Data, Storage
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUD…
Feb 22, 2022

OAuth2 and OpenID Connect, a gentle and working introduction (Part 1)
Categories: Containers Orchestration, Cyber Security | Tags: CNCF, Go Lang, JAMstack, LDAP, Kubernetes, OpenID Connect
Understanding OAuth2, OpenID and OpenID Connect (OIDC), how they relate, how the communications are established, and how to architecture your application with the given access, refresh and id tokens…
By David WORMS
Nov 17, 2020

Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…
Sep 30, 2020

Rook with Ceph doesn't provision my Persistent Volume Claims!
Categories: DevOps & SRE | Tags: PVC, Linux, Rook, Ubuntu, Ceph, Cluster, Internship, Kubernetes
Ceph installation inside Kubernetes can be provisioned using Rook. Currently doing an internship at Adaltas, I was in charge of participating in the setup of a Kubernetes (k8s) cluster. To avoid…
Sep 9, 2019

Running Apache Hive 3, new features and tips and tricks
Categories: Big Data, Business Intelligence, DataWorks Summit 2019 | Tags: Druid, Hive, JDBC, LLAP, Hadoop, Kafka, Release and features
Apache Hive 3 brings a bunch of new and nice features to the data warehouse. Unfortunately, like many major FOSS releases, it comes with a few bugs and not much documentation. It is available since…
Jul 25, 2019

Apache Flink: past, present and future
Categories: Data Engineering | Tags: Flink, Pipeline, Kubernetes, Machine Learning, SQL, Streaming
Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink…
Nov 5, 2018

YARN and GPU Distribution for Machine Learning
Categories: Data Science, DataWorks Summit 2018 | Tags: YARN, GPU, Machine Learning, Neural Network, Storage
This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…
By Grégor JOUET
May 30, 2018

Notes after Katacoda Training on Kubernetes Container Orchestration
Categories: Containers Orchestration, Learning | Tags: Helm, Ingress, Kubeadm, CNI, Micro Services, Minikube, Kubernetes
A few weeks ago, I dedicated two days to follow the turorials available on Katacoda, the interactive learning platform for Kubernetes or any other container orchestration platform. I’m sharing my…
By David WORMS
Dec 14, 2017

Kubernetes Storage Primitives for Stateful Workloads
Categories: Cloud Computing, Containers Orchestration, Open Source Summit Europe 2017 | Tags: Container Storage Interface (CSI), PVC, Azure, Docker, GCE, Kubernetes, Storage
This article is based on the presentation “Introduction to Kubernetes Storage Primitives for Stateful Workloads” from the OSS Convention Prague 2017 by the {Code} team. So, let’s start, what is…
Oct 28, 2017

Kubernetes 1.8
Categories: Containers Orchestration, Open Source Summit Europe 2017 | Tags: containerd, CRD, OCI, RBAC, Kubernetes, Network, Release and features
The 1.8 release of Kubernetes brings a lot of new things. With 2500+ pull request, 2000+ commits, 400+ commiters, Kubernetes added 39 new features in this version. This is the richest release in terms…
Oct 24, 2017

Hive, Calcite and Druid
Categories: Big Data | Tags: Druid, Hive, Business intelligence, Database, Hadoop
BI/OLAP requires interactive visualization of complex data streams: Real time bidding events User activity streams Voice call logs Network trafic flows Firewall events Application KPIs Traditionnal…
By David WORMS
Jul 14, 2016

Red Hat Storage Gluster and its integration with Hadoop
Categories: Big Data | Tags: GlusterFS, Red Hat, Hadoop, HDFS, Storage
I had the opportunity to be introduced to Red Hat Storage and Gluster in a joint presentation by Red Hat France and the company StartX. I have here recompiled my notes, at least partially. I will…
By David WORMS
Jul 3, 2015

State of the Hadoop open-source ecosystem in early 2013
Categories: Big Data | Tags: Flume, Mesos, Phoenix, Pig, Hadoop, Kafka, Mahout, Data Science
Hadoop is already a large ecosystem and my guess is that 2013 will be the year where it grows even larger. There are some pieces that we no longer need to present. ZooKeeper, hbase, Hive, Pig, Flume…
By David WORMS
Jul 8, 2013

Merging multiple files in Hadoop
Categories: Hack | Tags: File system, Hadoop, HDFS
This is a command I used to concatenate the files stored in Hadoop HDFS matching a globing expression into a single file. It uses the “getmerge” utility of but contrary to “getmerge”, the final…
By David WORMS
Jan 12, 2013

HDFS and Hive storage - comparing file formats and compression methods
Categories: Big Data | Tags: Hive, ORC, Business intelligence, Parquet, File Format
A few days ago, we have conducted a test in order to compare various Hive file formats and compression methods. Among those file formats, some are native to HDFS and apply to all Hadoop users. The…
By David WORMS
Mar 13, 2012

Two Hive UDAF to convert an aggregation to a map
Categories: Data Engineering | Tags: Hive, Java, HBase, File Format
I am publishing two new Hive UDAF to help with maps in Apache Hive. The source code is available on GitHub in two Java classes: “UDAFToMap” and “UDAFToOrderedMap” or you can download the jar file. The…
By David WORMS
Mar 6, 2012

Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: Hive, CRM, timeseries, Tuning, Hadoop, HDFS, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…
By David WORMS
Jan 10, 2012

Storage and massive processing with Hadoop
Categories: Big Data | Tags: Hadoop, HDFS, Storage
Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects…
By David WORMS
Nov 26, 2010