Hadoop and R with RHadoop

Hadoop and R with RHadoop

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. RHadoop is built out of 3 components which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those R packages and cover their installation and basic usage.

Note: This article has been update on August 2nd 2012 to reflect the version 1.3.1 of rmr and the version 1.0.4 of rhdfs.


The rmr package offers Hadoop MapReduce functionalities in R. For Hadoop users, writing MapReduce programs in R may be considered easier, more productive and more elegant with much less code than in java and easier deployment. It is great to prototype and do research. For R users, it opens the doors of MapReduce programming and access to Big Data analysis.

The rmr package must not be seen as Hadoop streaming even if internally it uses the streaming architecture. You can do Hadoop streaming with R without any of those packages since the language support stdin and stdout access. Also, rmr programs are not meant to be more efficient than those written in Java and other languages.

Finally, from the wiki:

rmr does not provide a map reduce version of any of the more than 3000 packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the map reduce paradigm or rmr does not create an exception.


The rhdfs package offers basic connectivity to the Hadoop Distributed File System. It comes with convenient functions to browse, read, write, and modify files stored in HDFS.


The rhbase package offers basic connectivity to HBase. It comes with convenient functions to browse, read, write, and modify tables stored in HBASE.


You must have at your disposal a working installation of Hadoop. It is recommanded and tested with the Cloudera CDH3 distribution. Consult the RHadoop wiki for alternative installation and future evolution. At the time of this writing, the Cloudera CDH4 distribution is not yet compatible and is documented as a work in progress. All the common Hadoop services must be started as well as the HBase Thrift server in case you want to test rhbase.

Below I have translated my Chef recipes into shell commands. Please contact me directly if you find an error or wish to read the original Chef recipes.

Thrift dependency

Note, if my memory is accurate, maven (apt-get install maven2) might also be required.

apt-get install libboost-dev libevent-dev libtool flex bison g++ automake pkg-config
apt-get install libboost-test-dev
apt-get install libmono-dev ruby1.8-dev libcommons-lang-java php5-dev
cd /tmp
curl http://apache.multidist.com/thrift/0.8.0/thrift-0.8.0.tar.gz | tar zx
cd thrift-0.8.0
make install
rm -rf thrift-0.8.0

R installation, environment and package dependencies

# R installation
apt-get r-base

# External package repository, go to the R site to find your nearest location
echo 'options(repos=structure(c(CRAN="http://cran.fr.r-project.org")))' >> /etc/R/Rprofile.site

# Global Java variable
cat > /etc/profile.d/java.sh <<DELIM
export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
. /etc/profile.d/java.sh

# Global Hadoop variables
cat > /etc/profile.d/r.sh <<DELIM
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF=/etc/hadoop/conf
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-chd3u4.jar
. /etc/profile.d/r.sh

# R package dependencies
apt-get install r-cran-rjava
apt-get install r-cran-rcpp
Rscript -e 'install.packages("RJSONIO");'
Rscript -e 'install.packages("itertools");'
Rscript -e 'install.packages("digest");'


cd /tmp
curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.1.tar.gz -o rmr_1.3.1.tar.gz
R CMD INSTALL rmr rmr_1.3.1.tar.gz
rm -rf rmr_1.3.1.tar.gz


cd /tmp
curl -L http://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz -o rhdfs_1.0.4.tar.gz
R CMD INSTALL rhdfs rhdfs_1.0.4.tar.gz
rm -rf rhdfs_1.0.4.tar.gz


cd /tmp
# Check Thrift
# pkg-config --cflags thrift | grep -I/usr/local/include/thrift
cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/
# Compile rhbase
curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz -o rhbase_1.0.4.tar.gz
R CMD INSTALL rhbase rhbase_1.0.4.tar.gz
rm -rf rhbase_1.0.4.tar.gz


We are now ready to test our installation. Let’s use the second example present on the tutorial of the RHadoop wiki. This example start with a standart R script wich generates a list of values and count their occurences:

groups = rbinom(100, n = 500, prob = 0.5)
tapply(groups, groups, length)

It then translate the last script into a scalable MapReduce script:

groups = rbinom(100, n = 500, prob = 0.5)
groups = to.dfs(groups)
result = mapreduce(
    input = groups,
    map = function(k,v) keyval(v, 1),
    reduce = function(k,vv) keyval(k, length(vv)))

The result is now stored inside the ‘/tmp’ folder of HDFS. Here are two commands to print the file path and the file content:

# Print the HDFS file path
# Show the file content
print(from.dfs(result, to.data.frame=T))
Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain