Tuning
Related articles

Apache HBase: RegionServers co-location
Categories: Big Data, Adaltas Summit 2021, Infrastructure | Tags: Ambari, Database, HDP, Infrastructure, Tuning, Hadoop, HBase, Big Data, Storage
RegionServers are the processes that manage the storage and retrieval of data in Apache HBase, the non-relational column-oriented database in Apache Hadoop. It is through their daemons that any CRUD…
Feb 22, 2022

Optimization of Spark applications in Hadoop YARN
Categories: Data Engineering, Learning | Tags: Tuning, Hadoop, Spark, Python
Apache Spark is an in-memory data processing tool widely used in companies to deal with Big Data issues. Running a Spark application in production requires user-defined resources. This article…
Mar 30, 2020

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod
Categories: Data Science | Tags: GPU, Deep Learning, Horovod, Keras, TensorFlow
The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…
By Grégor JOUET
Nov 15, 2019

Introduction to Cloudera Data Science Workbench
Categories: Data Science | Tags: Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook
Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…
Feb 28, 2019

TensorFlow on Spark 2.3: The Best of Both Worlds
Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, YARN, C++, CPU, GPU, Tuning, Spark, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow
The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…
By Yliess HATI
May 29, 2018

What's new in Apache Spark 2.3?
Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, ORC, PySpark, Tuning, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming
Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…
May 23, 2018

Timeseries storage in Hadoop and Hive
Categories: Data Engineering | Tags: Hive, CRM, timeseries, Tuning, Hadoop, HDFS, File Format
In the next few weeks, we will be exploring the storage and analytic of a large generated dataset. This dataset is composed of CRM tables associated to one timeserie table of about 7,000 billiard rows…
By David WORMS
Jan 10, 2012