Hadoop and R with RHadoop
By David WORMS
Jul 19, 2012
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. RHadoop is built out of 3 components which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those R packages and cover their installation and basic usage.
Note: This article has been update on August 2nd 2012 to reflect the version 1.3.1 of rmr and the version 1.0.4 of rhdfs.
The rmr package offers Hadoop MapReduce functionalities in R. For Hadoop users, writing MapReduce programs in R may be considered easier, more productive and more elegant with much less code than in java and easier deployment. It is great to prototype and do research. For R users, it opens the doors of MapReduce programming and access to Big Data analysis.
The rmr package must not be seen as Hadoop streaming even if internally it uses the streaming architecture. You can do Hadoop streaming with R without any of those packages since the language support stdin and stdout access. Also, rmr programs are not meant to be more efficient than those written in Java and other languages.
Finally, from the wiki:
rmr does not provide a map reduce version of any of the more than 3000 packages available for R. It does not solve the problem of parallel programming. You still have to write parallel algorithms for any problem you need to solve, but you can focus only on the interesting aspects. Some problems are believed not to be amenable to a parallel solution and using the map reduce paradigm or rmr does not create an exception.
The rhdfs package offers basic connectivity to the Hadoop Distributed File System. It comes with convenient functions to browse, read, write, and modify files stored in HDFS.
The rhbase package offers basic connectivity to HBase. It comes with convenient functions to browse, read, write, and modify tables stored in HBASE.
You must have at your disposal a working installation of Hadoop. It is recommanded and tested with the Cloudera CDH3 distribution. Consult the RHadoop wiki for alternative installation and future evolution. At the time of this writing, the Cloudera CDH4 distribution is not yet compatible and is documented as a work in progress. All the common Hadoop services must be started as well as the HBase Thrift server in case you want to test rhbase.
Below I have translated my Chef recipes into shell commands. Please contact me directly if you find an error or wish to read the original Chef recipes.
Note, if my memory is accurate, maven (
apt-get install maven2) might also be required.
apt-get install libboost-dev libevent-dev libtool flex bison g++ automake pkg-config apt-get install libboost-test-dev apt-get install libmono-dev ruby1.8-dev libcommons-lang-java php5-dev cd /tmp curl http://apache.multidist.com/thrift/0.8.0/thrift-0.8.0.tar.gz | tar zx cd thrift-0.8.0 ./configure make make install rm -rf thrift-0.8.0
R installation, environment and package dependencies
# R installation apt-get r-base # External package repository, go to the R site to find your nearest location echo 'options(repos=structure(c(CRAN="http://cran.fr.r-project.org")))' >> /etc/R/Rprofile.site # Global Java variable cat > /etc/profile.d/java.sh <<DELIM export JAVA_HOME=/usr/lib/jvm/java-6-sun/jre DELIM . /etc/profile.d/java.sh # Global Hadoop variables cat > /etc/profile.d/r.sh <<DELIM export HADOOP_HOME=/usr/lib/hadoop export HADOOP_CONF=/etc/hadoop/conf export HADOOP_CMD=/usr/bin/hadoop export HADOOP_STREAMING=/usr/lib/hadoop/contrib/streaming/hadoop-streaming-0.20.2-chd3u4.jar DELIM . /etc/profile.d/r.sh # R package dependencies apt-get install r-cran-rjava apt-get install r-cran-rcpp Rscript -e 'install.packages("RJSONIO");' Rscript -e 'install.packages("itertools");' Rscript -e 'install.packages("digest");'
cd /tmp curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rmr_1.3.1.tar.gz -o rmr_1.3.1.tar.gz R CMD INSTALL rmr rmr_1.3.1.tar.gz rm -rf rmr_1.3.1.tar.gz
cd /tmp curl -L http://github.com/downloads/RevolutionAnalytics/RHadoop/rhdfs_1.0.4.tar.gz -o rhdfs_1.0.4.tar.gz R CMD INSTALL rhdfs rhdfs_1.0.4.tar.gz rm -rf rhdfs_1.0.4.tar.gz
cd /tmp # Check Thrift # pkg-config --cflags thrift | grep -I/usr/local/include/thrift cp /usr/local/lib/libthrift-0.8.0.so /usr/lib/ # Compile rhbase curl -L https://github.com/downloads/RevolutionAnalytics/RHadoop/rhbase_1.0.4.tar.gz -o rhbase_1.0.4.tar.gz R CMD INSTALL rhbase rhbase_1.0.4.tar.gz rm -rf rhbase_1.0.4.tar.gz
We are now ready to test our installation. Let’s use the second example present on the tutorial of the RHadoop wiki. This example start with a standart R script wich generates a list of values and count their occurences:
groups = rbinom(100, n = 500, prob = 0.5) tapply(groups, groups, length)
It then translate the last script into a scalable MapReduce script:
require('rmr') groups = rbinom(100, n = 500, prob = 0.5) groups = to.dfs(groups) result = mapreduce( input = groups, map = function(k,v) keyval(v, 1), reduce = function(k,vv) keyval(k, length(vv)))
The result is now stored inside the ‘/tmp’ folder of HDFS. Here are two commands to print the file path and the file content:
# Print the HDFS file path print(result()) # Show the file content print(from.dfs(result, to.data.frame=T))