Apache Hadoop MapReduce

MapReduce is a distributed data processing framework. It is part of the Apache Hadoop framework and works on top of Apache HDFS.

This framework permits efficient processing of large amount of data distributed across multiple nodes.

During a MapReduce job, the data is split into chunks that are processed in parallel by the MapReduce tasks. The two main tasks of MapReduce are:

  • Mapper: The mapper tasks process records one-by-one and output key/value pairs. The key is the input and the value is the result of the operation.
  • Reducer: The reducer task process the result of the mappers grouped by the same key. The reducer performs an aggregation operation for each group.

All processing steps are persisted in HDFS. In the case of failure, MapReduce can recover from previous processing steps. This assures high availability of the system.

Related articles

Internship in Big Data infrastructure with TDP

Internship in Big Data infrastructure with TDP

Categories: Infrastructure, Learning | Tags: Cyber Security, DevOps, Java, Hadoop, IaC, Internship, TDP

Job Description Big Data and distributed computing is at Adaltas’ core. We support our partners in the deployment, maintenance and optimization of some of France’s largest clusters. Adaltas is also an…

Daniel HARTY

By Daniel HARTY

Oct 25, 2021

Storage size and generation time in popular file formats

Storage size and generation time in popular file formats

Categories: Data Engineering, Data Science | Tags: Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

Barthelemy NGOM

By Barthelemy NGOM

Mar 22, 2021

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

Categories: Big Data, Cloud Computing, Data Engineering | Tags: NiFi, Hadoop, HDFS, Authentication, Authorization, Azure, Azure Data Lake Storage (ADLS), OAuth2

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently…

Gauthier LEONARD

By Gauthier LEONARD

Nov 5, 2020

Comparison of different file formats in Big Data

Comparison of different file formats in Big Data

Categories: Big Data, Data Engineering | Tags: Business intelligence, Data structures, Avro, HDFS, ORC, Parquet, Batch processing, Big Data, CSV, JavaScript Object Notation (JSON), Kubernetes, Protocol Buffers

In data processing, there are different types of files formats to store your data sets. Each format has its own pros and cons depending upon the use cases and exists to serve one or several purposes…

Aida NGOM

By Aida NGOM

Jul 23, 2020

Hadoop Ozone part 1: an introduction of the new filesystem

Hadoop Ozone part 1: an introduction of the new filesystem

Categories: Infrastructure | Tags: HDFS, Ozone, Cluster, Kubernetes

Hadoop Ozone is an object store for Hadoop. It is designed to scale to billions of objects of varying sizes. It is currently in development. The roadmap is available on the project wiki. This article…

Apache Hadoop YARN 3.0 – State of the union

Apache Hadoop YARN 3.0 – State of the union

Categories: Big Data, DataWorks Summit 2018 | Tags: GPU, Hortonworks, Hadoop, HDFS, MapReduce, YARN, Cloudera, Data Science, Docker, Release and features

This article covers the ”Apache Hadoop YARN: state of the union” talk held by Wangda Tan from Hortonworks during the Dataworks Summit 2018. What is Apache YARN? As a reminder, YARN is one of the two…

Lucas BAKALIAN

By Lucas BAKALIAN

May 31, 2018

Options to connect and integrate Hadoop with Oracle

Options to connect and integrate Hadoop with Oracle

Categories: Data Engineering | Tags: Database, Java, Oracle, R, RDBMS, Avro, HDFS, Hive, MapReduce, Sqoop, NoSQL, SQL

I will list the different tools and libraries available to us developers in order to integrate Oracle and Hadoop. The Oracle SQL Connector for HDFS described below is covered in a follow up article…

David WORMS

By David WORMS

May 15, 2013

Hadoop and R with RHadoop

Hadoop and R with RHadoop

Categories: Business Intelligence, Data Science | Tags: Thrift, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce, Data Analytics

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…

David WORMS

By David WORMS

Jul 19, 2012

MapReduce introduction

MapReduce introduction

Categories: Big Data | Tags: Java, MapReduce, Big Data, JavaScript

Information systems have more and more data to store and process. Companies like Google, Facebook, Twitter and many others store astronomical amounts of information from their customers and must be…

David WORMS

By David WORMS

Jun 26, 2010

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain