Apache Hadoop is a system for building shared storage and processing infrastructures for large volumes of data (multiple terabytes or petabytes). Hadoop clusters are used by a wide range of projects for a growing number of web players (Yahoo!, EBay, Facebook, LinkedIn, Twitter) and their size continues to increase. Yahoo! has 45,000 machines with the largest cluster of 4,000 servers and 40 PB while Facebook reported to store 20 PBs on the same HDFS cluster (for Hadoop Distributed File System).
Dotcoms were the first companies to see their data volume grow exponentially. Many have based their business model on the processing of these data. Both Google and Facebook derive most of their income from data analysis for advertising purposes. Unable to wait for traditional software editors, these companies have invested heavily in the development of new softwares to cope with this explosion while exploiting new concepts. Today, thanks to the Open Source model, these technologies are present in a large number of industries and become a key component of many government companies and services.
Hadoop is the open source implementation of two Google papers. The first, published in 2003, describes the architecture of GFS (for Google Distributed Filesystem). The second, published in 2004, introduces the Map-Reduce paradigm. At that time, Doug Cutting, today at Cloudera, was working on Nutch, an open source software of the Apache Foundation including an Internet vacuum cleaner and a search engine. Nutch’s storage and computing requirements have led to the implementation of Google’s work into what will become Hadoop.