Hadoop and R with RHadoop

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. RHadoop is built out of 3 components which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those R packages and cover their installation and basic usage.

Note: This article has been update on August 2nd 2012 to reflect the version 1.3.1 of rmr and the version 1.0.4 of rhdfs.

rmr

The rmr package offers Hadoop MapReduce functionalities in R. For Hadoop users, writing MapReduce programs in R may be considered easier, more productive and more elegant with much less code than in java and easier deployment. It is great to prototype and do research. For R users, it opens the doors of MapReduce programming and access to Big Data analysis.

The rmr package must not be seen as Hadoop streaming even if internally it uses the streaming architecture. You can do Hadoop streaming with R without any of those packages since the language support stdin and stdout access. Also, rmr programs are not meant to be more efficient than those written in Java and other languages.

Finally, from the wiki:

rmr does not provide a map reduce version of any of the more than 3000 packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the map reduce paradigm or rmr does not create an exception.

rhdfs

The rhdfs package offers basic connectivity to the Hadoop Distributed File System. It comes with convenient functions to browse, read, write, and modify files stored in HDFS.

rhbase

The rhbase package offers basic connectivity to HBase. It comes with convenient functions to browse, read, write, and modify tables stored in HBASE.

Installation

You must have at your disposal a working installation of Hadoop. It is recommanded and tested with the Cloudera CDH3 distribution. Consult the RHadoop wiki for alternative installation and future evolution. At the time of this writing, the Cloudera CDH4 distribution is not yet compatible and is documented as a work in progress. All the common Hadoop services must be started as well as the HBase Thrift server in case you want to test rhbase.

Below I have translated my Chef recipes into shell commands. Please contact me directly if you find an error or wish to read the original Chef recipes.

Thrift dependency

Note, if my memory is accurate, maven (apt-get install maven2) might also be required.

apt-get install libboost-dev libevent-dev libtool flex bison g++ automake pkg-config
apt-get install libboost-test-dev
apt-get install libmono-dev ruby1.8-dev libcommons-lang-java php5-dev
cd /tmp
curl http://apache.multidist.com/thrift/0.8.0/thrift-0.8.0.tar.gz | tar zx
cd thrift-0.8.0
./configure
make
make install
rm -rf thrift-0.8.0

R installation, environment and package dependencies

# R installation
apt-get r-base

# External package repository, go to the R site to find your nearest location
echo 'options(repos=structure(c(CRAN="http://cran.fr.r-project.org")))' >> /etc/R/Rprofile.site

# Global Java variable
cat > /etc/profile.d/java.sh <<DELIM
export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre
DELIM
. /etc/profile.d/java.sh

# Global Hadoop variables
cat > /etc/profile.d/r.sh <<DELIM
export HADOOP_HOME=/usr/lib/hadoop
export HADOOP_CONF=/etc/hadoop/conf
export HADOOP_CMD=/usr/bin/hadoop
export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-chd3u4.jar
DELIM
. /etc/profile.d/r.sh

# R package dependencies
apt-get install r-cran-rjava
apt-get install r-cran-rcpp
Rscript -e 'install.packages("RJSONIO");'
Rscript -e 'install.packages("itertools");'
Rscript -e 'install.packages("digest");'

rmr

cd /tmp
curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.1.tar.gz -o rmr_1.3.1.tar.gz
R CMD INSTALL rmr rmr_1.3.1.tar.gz
rm -rf rmr_1.3.1.tar.gz

rhdfs

cd /tmp
curl -L http://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz -o rhdfs_1.0.4.tar.gz
R CMD INSTALL rhdfs rhdfs_1.0.4.tar.gz
rm -rf rhdfs_1.0.4.tar.gz

rhbase

cd /tmp
# Check Thrift
# pkg-config --cflags thrift | grep -I/usr/local/include/thrift
cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/
# Compile rhbase
curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz -o rhbase_1.0.4.tar.gz
R CMD INSTALL rhbase rhbase_1.0.4.tar.gz
rm -rf rhbase_1.0.4.tar.gz

Usage

We are now ready to test our installation. Let’s use the second example present on the tutorial of the RHadoop wiki. This example start with a standart R script wich generates a list of values and count their occurences:

groups = rbinom(100, n = 500, prob = 0.5)
tapply(groups, groups, length)

It then translate the last script into a scalable MapReduce script:

require('rmr')
groups = rbinom(100, n = 500, prob = 0.5)
groups = to.dfs(groups)
result = mapreduce(
    input = groups,
    map = function(k,v) keyval(v, 1),
    reduce = function(k,vv) keyval(k, length(vv)))

The result is now stored inside the ‘/tmp’ folder of HDFS. Here are two commands to print the file path and the file content:

# Print the HDFS file path
print(result())
# Show the file content
print(from.dfs(result, to.data.frame=T))

Share this article