Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin

Build your open source Big Data distribution with Hadoop, HBase, Spark, Hive & Zeppelin

Leo SCHOUKROUN

By Leo SCHOUKROUN

Dec 18, 2020

The Hadoop ecosystem gave birth to many popular projects including HBase, Spark and Hive. While technologies like Kubernetes and S3 compatible object storages are growing in popularity, HDFS and YARN hold their place for on-premise and large use cases. Interested in compiling your personal Hadoop-based distribution from source yourself? This article comes with all the peculiarities of the build process and will teach you how to do it.

Following our previous articles Installing Hadoop from source and Rebuilding HDP Hive, this article enriches the series and looks into the process of building multiple open source Big Data projects while managing the dependencies they have with one another. The goal is to create a “mini” Big Data distribution around Hadoop-based components by building the projects from source and making the builds dependent on one another. In this article, the selected projects are:

Why did we pick these projects? Hadoop gives both a distributed filesystem and resource scheduler with HDFS and YARN. HBase is a reference project for scalable NoSQL storage. Hive brings a SQL layer on top of Hadoop familiar to developers and with a JDBC/ODBC interface for analytics. Spark is perfect for in-memory compute and data transformation. Zeppelin offers a user-friendly web-based interface to interact with all the previous components.

Accurate instructions on building Apache projects are sometimes a bit difficult to find. For example, the Apache Hive’s build instructions is outdated.

This article will go around the main concepts of building different Apache projects. Every command is shared and all the steps are reproducible.

Project versions

All the projects given above are independent. However, they are designed to work nicely together and they do follow each other around important features. For example, Hive 3 is best used with Hadoop 3. It is important to have consistency when picking the versions we will build. All the official Apache releases are marked as tags in their git repository, they have different names depending on the project.

The selected versions to build our platform is summarized in the following table:

Component Version Git tag
Apache Hadoop 3.1.1 rel/release-3.1.1
Apache HBase 2.2.3 rel/2.2.3
Apache Spark 2.4.0 v2.4.0
Apache Hive 3.1.2 rel/release-3.1.2
Apache Zeppelin 0.8.2 v0.8.2

Let’s now start by cloning all the repositories of the projects and checkout the tags we targeted.

git clone --branch rel/release-3.1.1 https://github.com/apache/hadoop.git
git clone --branch rel/2.2.3 https://github.com/apache/hbase.git
git clone --branch rel/release-3.1.2 https://github.com/apache/hive.git
git clone --branch v2.4.0 https://github.com/apache/spark.git
git clone --branch v0.8.2 https://github.com/apache/zeppelin.git

Note: For this article, we will build the projects versions’ “as is”. If you want to know how to apply a patch, test and build a release, check out our previous article Installing Hadoop from source.

Versioned distribution

A distribution is a collection of multiple components that are built, assembled, and configured as a logical unit. Each component is identified by a name and a version. The distribution is a unique combination of the embedded components. This unique combination is also identified by a name and a version.

To reflect the component and distribution naming pattern, the final name of a component, which we called the release name, looks like {project_name}-{project_version}-{distrib_name}-{distrib_version}.tar.gz, for exemple hadoop-3.1.1-mydistrib-0.1.0.tar.gz.

The release name is defined in the Maven pom.xml definition files of all the project submodules. They must all reflect the new release name. To consistently and easily update all the files, use the mvn command multiple times during the article:

mvn versions:set -DgenerateBackupPoms=false -DnewVersion=3.1.1-mydistrib-0.1.0

The -DgenerateBackupPoms=false parameter is optional: it avoids the generation of pom.xml.versionsBackup files after the update.

Once the pom.xml files are up to date, we can build and package each project following their respective instructions.

Build environment

To obtain a reproducible build environment, all the builds described in this article are run within a Docker container. The Apache Hadoop project provides a Docker image that contains everything needed to build Hadoop. We use this image to build all the projects in our distribution.

To start the build environment, the hadoop Git repository provides the start-build-env.sh available in the project root folder.

Navigate to the hadoop directory we just cloned and run:

./start-build-env.sh

 _   _           _                    ______
| | | |         | |                   |  _  \
| |_| | __ _  __| | ___   ___  _ __   | | | |_____   __
|  _  |/ _` |/ _` |/ _ \ / _ \| '_ \  | | | / _ \ \ / /
| | | | (_| | (_| | (_) | (_) | |_) | | |/ /  __/\ V /
\_| |_/\__,_|\__,_|\___/ \___/| .__/  |___/ \___| \_(_)
                              | |
                              |_|

This is the standard Hadoop Developer build environment.
This has all the right tools installed required to build
Hadoop from source.

Important: The docker run command inside start-build-env.sh mounts the project repository as a local volume. By default, only the hadoop repository is mounted. To mount the other repositories, modify the ./start-build-env.sh script and mount the parent folder. Thus, all the cloned repositories will be available within the container.

Build a custom Hadoop release

Hadoop logo

Apache Hadoop (HDFS/YARN) is a dependency on all the other projects in our distribution so we should start by building this one first.

We want to be able to differentiate our version of Apache Hadoop from the official release, let’s do that by changing the name of the version. We use the versions:set subcommand to update the pom.xml declaration files:

mvn versions:set -DgenerateBackupPoms=false -DnewVersion=3.1.1-mydistrib-0.1.0

We can then build a release of our mydistrib-0.1.0 Hadoop version with:

mvn clean install -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip=true

-Pdist and -Dtar is a special profile defined in hadoop-project-dist/pom.xml. It launches a script after the compilation that copies all the necessary files (JARs, configuration files, etc.) and makes a .tar.gz archive. Why did we use install and not package? The question is answered in the next part.

Once the build is done, the archive is located on your host machine at:

./hadoop-dist/target/hadoop-3.1.1-mydistrib-0.1.0.tar.gz

The archive is available outside of the container because the directory is mounted from the host (see ./start-build-env.sh).

What comes next? Hive has a dependency on Hadoop, HBase, and Spark whereas Spark and HBase only depend on Hadoop. We should build HBase and Spark next.

Build a custom HBase release

HBase logo

Before building Apache HBase from source, we must not forget to change the name and the version of the release to differentiate it from the Apache distribution:

mvn versions:set -DgenerateBackupPoms=false -DnewVersion=2.2.3-mydistrib-0.1.0

We must ensure that the HBase build refers to our version of the Hadoop dependency.

The version of Hadoop used for building is defined in the root pom.xml file of the project. We need to replace the default Hadoop version from the “hadoop-3.0” profile:

<profile>
  <id>hadoop-3.0</id>
  <activation>
    <property>
      <name>hadoop.profile</name>
      <value>3.0</value>
    </property>
  </activation>
  [...]
  <properties>
    <hadoop.version>${hadoop-three.version}</hadoop.version>

To change this value for the build, we can either modify the pom.xml above permanently or override the value of hadoop.version property at build time. Let’s try with the second option:

mvn clean \
  -DskipTests -Dhadoop.profile=3.0 -Dhadoop-three.version=3.1.1-mydistrib-0.1.0 \
  package assembly:single install

Notice how the parameter to package the JARs in an archive (assembly:single) is different from the one we used for Hadoop in the previous section. Every project has its own way of functioning when it comes to release making.

We can find the HBase release at ./hbase-assembly/target/hbase-2.2.3-mydistrib-0.1.0-bin.tar.gz. Let’s make sure it embeds the correct Hadoop JARs by checking what’s inside:

cp ./hbase-assembly/target/hbase-2.2.3-mydistrib-0.1.0-bin.tar.gz /tmp
cd /tmp
tar -xvzf hbase-2.2.3-mydistrib-0.1.0-bin.tar.gz
cd hbase-2.2.3-mydistrib-0.1.0
find . -name "*hadoop*"
[...]
./lib/hadoop-yarn-client-3.1.1-mydistrib-0.1.0.jar
./lib/hadoop-mapreduce-client-core-3.1.1-mydistrib-0.1.0.jar
./lib/hadoop-hdfs-3.1.1-mydistrib-0.1.0.jar
./lib/hadoop-auth-3.1.1-mydistrib-0.1.0.jar
[...]

We can find the previously built Hadoop JARs inside the archive. This means that the “mydistrib” version of HBase is dependent on the “mydistrib” version of Hadoop which is what we wanted to achieve.

Build a custom Spark release

Spark logo

As for HBase, we need to make sure that the Apache Spark distribution we will make has a dependency on our version of Hadoop.

In the pom.xml file of Spark, we can see that the Hadoop dependencies are defined as profile. The one we are looking for is in lines 2691-2698 of the file:

<profile>
  <id>hadoop-3.1</id>
  <properties>
    <hadoop.version>3.1.0</hadoop.version>
    <curator.version>2.12.0</curator.version>
    <zookeeper.version>3.4.9</zookeeper.version>
  </properties>
</profile>

We need to change the to 3.1.1-mydistrib-0.1.0, we can do it with the sed command this time:

sed -i "s/<hadoop.version>3.1.0<\\/hadoop.version>/<hadoop.version>3.1.1-mydistrib-0.1.0<\\/hadoop.version>/" pom.xml

Which gives the following result:

<profile>
  <id>hadoop-3.1</id>
  <properties>
    <hadoop.version>3.1.1-mydistrib-0.1.0</hadoop.version>
    <curator.version>2.12.0</curator.version>
    <zookeeper.version>3.4.9</zookeeper.version>
  </properties>
</profile>

Of course, the 3.1.1-mydistrib-0.1.0 JAR files are not available in the public Maven Repository but the mvn install command that we ran copied the compiled JAR files to our local Maven repository which is located in ~/.m2 by default.

Let’s see if we can find these JARs:

ll ~/.m2/repository/org/apache/hadoop/hadoop-hdfs-client/3.1.1-mydistrib-0.1.0 
total 6.4M
-rw-r--r-- 1 leo leo 4.8M Oct 14 17:55 hadoop-hdfs-client-3.1.1-mydistrib-0.1.0.jar
-rw-r--r-- 1 leo leo 5.9K Oct 14 17:53 hadoop-hdfs-client-3.1.1-mydistrib-0.1.0.pom
-rw-r--r-- 1 leo leo 1.4M Oct 14 17:55 hadoop-hdfs-client-3.1.1-mydistrib-0.1.0-sources.jar
-rw-r--r-- 1 leo leo 153K Oct 14 17:55 hadoop-hdfs-client-3.1.1-mydistrib-0.1.0-tests.jar
-rw-r--r-- 1 leo leo  94K Oct 14 17:55 hadoop-hdfs-client-3.1.1-mydistrib-0.1.0-test-sources.jar

The build container and the local host share the same ~/.m2 directory (see the ./start-build-env.sh script of Hadoop).

Note: If you run mvn package instead of mvn install for the Hadoop build, the JARs could not have been found in .m2 resulting in the following error when trying to build HBase:

[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  01:09 min
[INFO] Finished at: 2020-12-02T10:42:46Z
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project hbase-hadoop2-compat: Could not resolve dependencies for project org.apache.hbase:hbase-hadoop2-compat:jar:2.2.3-mydistrib-0.1.0: Could not find artifact org.apache.hadoop:hadoop-hdfs-client:jar:3.1.1-mydistrib-0.1.0 in central (https://repo.maven.apache.org/maven2) -> [Help 1]

Again, we can change the version name for our release of Spark:

mvn versions:set -DgenerateBackupPoms=false -DnewVersion=2.4.0-mydistrib-0.1.0

Because Spark 2.x needs to be built with JDK 8 (see SPARK-25820 for details), we need to switch the java version inside the container before building:

sudo update-java-alternatives --set /usr/lib/jvm/java-1.8.0-openjdk-amd64

Note: By default, the OpenJDK 11 is the active version inside the build container. If you do not perform the step above, the build step will fail with the following stack trace:

[ERROR] import sun.misc.Cleaner;
[ERROR]                ^
[ERROR]   symbol:   class Cleaner
[ERROR]   location: package sun.misc

The Apache Spark project comes with a make-distribution.sh script which is essentially a wrapper around mvn. Let’s use this script to build Spark:

./dev/make-distribution.sh \
  --name my-release --tgz \
  -Phive -Phive-thriftserver -Pyarn -Phadoop-3.1

The -Phadoop-3.1 will load the profile we just edited while the other profiles are mandatory to user Spark on YARN and SparkSQL wth Hive.

Once the build is done, the archive is available at:

./spark-2.4.0-mydistrib-0.1.0-bin-my-release.tgz

Build a custom Hive release

Hive logo

Moving on to Apache Hive. As declared inside the project’s pom.xml file, Hive has both Hadoop and Spark as build dependencies, we need to change the *.version properties to match our previous builds:

sed -i "s/<hadoop.version>3.1.0<\\/hadoop.version>/<hadoop.version>3.1.1-mydistrib-0.1.0<\\/hadoop.version>/" pom.xml
sed -i "s/<spark.version>2.3.0<\\/spark.version>/<spark.version>2.4.0-mydistrib-0.1.0<\\/spark.version>/" pom.xml

The standalone-metastore Maven sub-module also has a dependency to Hadoop, let’s not forget that one:

sed -i "s/<hadoop.version>3.1.0<\\/hadoop.version>/<hadoop.version>3.1.1-mydistrib-0.1.0<\\/hadoop.version>/" standalone-metastore/pom.xml

The build command for Hive is the following:

mvn clean install -Pdist -DskipTests

Once the build is done, the archive is available at:

./packaging/target/apache-hive-3.1.2-mydistrib-0.1.0-bin.tar.gz

Build a custom Zeppelin release

Zeppelin logo

The final piece of our Big Data distribution: Apache Zeppelin. It is a notebook with a friendly web-based user interface.

For Zeppelin, it is a bit more complicated to make a release with custom resources than for the previous projects.

Zeppelin’s pom.xml file provides convenient profiles to build the project while choosing the versions of Hadoop and Spark which can be tweaked like the following in our case:

-Pspark-2.4 -Dspark.version=2.4.0-mydistrib-0.1.0 -Phadoop-3.1 -Dhadoop.version=3.1.1-mydistrib-0.1.0

On top of that, the Zeppelin’s package stage is configured to fetch the built Spark release online to copy its python/lib/py4j-0.10.7-src.zip file in Zeppelin.

This step can be found at line 397 of spark/interpreter/pom.xml:

<configuration>
  <target>
    <delete dir="../../interpreter/spark/pyspark" />
    <copy file="${project.build.directory}/${spark.archive}/python/lib/py4j-${py4j.version}-src.zip" todir="${project.build.directory}/../../../interpreter/spark/pyspark" />
    <zip basedir="${project.build.directory}/${spark.archive}/python" destfile="${project.build.directory}/../../../interpreter/spark/pyspark/pyspark.zip" includes="pyspark/*.py,pyspark/**/*.py" />
  </target>
</configuration>

By default, it fetches the archive at https://archive.apache.org/dist/spark/${spark.archive}/${spark.archive}-bin-without-hadoop.tgz which is the value of spark.bin.download.url. We need to change it because the official Apache archive site will not be hosting our custom release. In our case, we decided to host it on a private Nexus repository. If we do not set this URL, the build would fail with the following error:

[ERROR] Failed to execute goal com.googlecode.maven-download-plugin:download-maven-plugin:1.6.0:wget (download-sparkr-files) on project r: IO Error: Error while expanding /dataclan/zeppelin/rlang/target/spark-2.4.0-mydistrib-0.1.0-bin-without-hadoop.tgz: Not in GZIP format -> [Help 1]

Note: The default archive name says “without-hadoop” because we could have packaged Spark without the Hadoop/YARN libs (with the profile -Phadoop-provided). In this case, we can proceed with our previously built Spark distribution. Indeed, we only care about its pyspark files.

The final build command is:

mvn clean package -DskipTests -Pbuild-distr -Pspark-2.4 -Dspark.version=2.4.0-mydistrib-0.1.0 -Phadoop-3.1 -Dhadoop.version=3.1.1-mydistrib-0.1.0 -Dspark.bin.download.url=https://nexus*****.com/repository/mydistrib-releases/mydistrib-0.1.0/spark-2.4.0-mydistrib-0.1.0.tgz

Once the build is done, the archive is available at:

./zeppelin-distribution/target/zeppelin-0.8.2-mydistrib-0.1.0.tar.gz

Conclusion

We have seen how to build a functional Big Data distribution including popular components like HDFS, Hive, and HBase. As stated in the introduction, the projects are built “as is” which means that no additional features are added to the official release and the build is not thoroughly tested. Where to go from here? If you are interested in seeing how to patch a project to make a new release, you can read our previous article Installing Hadoop from source in which we also take a glimpse at running its unit tests.

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.