Connecting to ADLS Gen2 from Hadoop (HDP) and Nifi (HDF)

As data projects built in the Cloud are becoming more and more frequent, a common use case is to interact with Cloud storage from an existing on premise Big Data platform. Microsoft Azure recently introduced Data Lake Storage Gen2, that is built on top of Azure Blob and offers HDFS-like management of data on Azure. Because it is a quite a new product (GA on Feb. 2019), connecting to ADLS Gen2 from HDP and HDF is not yet supported in public releases.

In this article, we will see how to write data to ADLS Gen2 using:

Getting access to ADLS Gen2

Azure Data Lake Storage Gen2 supports the same authorization options than Azure Blob storage. In this article, we will access ADLS using a service principal, which leaves us with 3 options:

Using an OAuth access token to authenticate with user/password, we will use this method with DistCp
Using a Shared Account Key, what is equivalent to an admin password of the storage account and is not recommended
Using a Shared Access Signature (SAS), we will use this method with NiFi

Creating a service principal in Azure AD

First, we need to create a service principal in our Azure tenant. In fact, we will create an application object, that will have one service principal for each tenant on which it is used:

Open the Azure Portal
Go to the Azure Active Directory panel (you can find it using the search bar)
1. Go to App registrations → New registration
2. Just fill in the Name field (e.g. hdp-hdf-adls-app) and click Register
Once the application is created, it appears in your “Owned applications”
To be able to authenticate as the service principal, we will need a secret
1. Go to your-app → Certificates & secrets
2. Generate a new client secret and save it for later

Giving access to an ADLS storage account

In this article, we will give full access to the storage account to the service principal (for fine grained access see Manage access rights with RBAC):

Go to your storage account page → Access Control (IAM)
Assigne the role Storage Blob Data Contributor to your application

Copying data from HDP using DistCp

Connecting to ADLS Gen2 from Hadoop is only supported since Hadoop 3.2.0. Therefore it is not possible on any of the public HDP releases that only carry Hadoop up to version 3.1.1 without the feature backport.

Gathering needed binaries

To write data to ADLS Gen2, we will need to use:

The Hadoop 3.3.0 binaries for the Hadoop CLI and MapReduce2
The hadoop-azure jar
The wildfly-openssl jar for secured connection to ADLS

Use these commands to download and upload the hadoop binaries to HDFS (requires jq package):

# First create a directory to store all this stuff
mkdir distcp-to-adls
cd !$

# Fill in those properties
user=$USER
hadoop_version=3.3.0

# Download urls
hadoop_azure_jar="https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-azure/$hadoop_version/hadoop-azure-$hadoop_version.jar"
wildfly_openssl_jar=https://repo1.maven.org/maven2/org/wildfly/openssl/wildfly-openssl/1.0.7.Final/wildfly-openssl-1.0.7.Final.jar

# Create the HDFS working directory
hdfs_dir="/user/$user/.distcp-adls"
hdfs dfs -mkdir -p "$hdfs_dir"
hdfs dfs -rm -r -f "$hdfs_dir/*"

# Get Hadoop prefered mirror url
hadoop_path="hadoop/common/hadoop-$hadoop_version/hadoop-$hadoop_version.tar.gz"
hadoop_mirror=$(curl "https://www.apache.org/dyn/closer.cgi/$hadoop_path&asjson=1" | jq -r '.preferred')
hadoop_url="$hadoop_mirror$hadoop_path"

# Download Hadoop 3.3.0 binaries
mkdir lib
wget -P lib "$hadoop_url" && wget -P lib "$hadoop_azure_jar" && wget -P lib "$wildfly_openssl_jar"

# Rename root directory in Hadoop binary to match DistCp internal behavior
cd lib && tar -xzf "hadoop-$hadoop_version.tar.gz"
mv "hadoop-$hadoop_version" hadoop
tar -czf mapreduce.tar.gz hadoop && cd ..

# Put Hadoop binary in HDFS for use in DistCp
hdfs dfs -put lib/mapreduce.tar.gz "$hdfs_dir"

# Gather only needed Hadoop conf files
mkdir hadoop-conf
cp /etc/hadoop/conf/*-site.xml hadoop-conf

If everything worked fine, you have:

All the binaries in ./lib:

ls -1 ./lib
hadoop
hadoop-3.3.0.tar.gz
hadoop-azure-3.3.0.jar
mapreduce.tar.gz
wildfly-openssl-1.0.7.Final.jar

The Hadoop binaries as mapreduce.tar.gz in HDFS at /user/$user/.distcp-adls:

hdfs dfs -ls /user/leonardg/.distcp-adls
Found 1 items
-rw-r--r--   3 leonardg users  500735363 2020-09-24 19:14 /user/leonardg/.distcp-adls/mapreduce.tar.gz

Storing credentials in a Hadoop CredentialProvider

In order not to store our service principal credentials in plain text, let’s use a Hadoop CredentialProvider in which we will store the OAuth endpoint, the application (client) ID and the application secret:

# Fill in those properties (find them on Azure Portal)
tenant=mycompany.onmicrosoft.com
app_id=MY_APP_ID

# Don't store the app secret in your history
read -s app_secret

# Store credentials in a Hadoop credential provider
cred_provider="$hdfs_dir/adls2keyfile.jceks"
oauth_endpoint="https://login.microsoftonline.com/$tenant/oauth2/token"

hdfs dfs -rm -f "$cred_provider"
hadoop credential create fs.azure.account.oauth2.client.endpoint \
  -provider "jceks://hdfs$cred_provider" -value "$oauth_endpoint"
hadoop credential create fs.azure.account.oauth2.client.id \
  -provider "jceks://hdfs$cred_provider" -value "$app_id"
hadoop credential create fs.azure.account.oauth2.client.secret \
  -provider "jceks://hdfs$cred_provider" -value "$app_secret"

If everything worked fine you should see 3 properties in your CredentialProvider:

hadoop credential list -provider "jceks://hdfs$cred_provider"
Listing aliases for CredentialProvider: jceks://hdfs/user/leonardg/.distcp-adls/adls2keyfile.jceks
fs.azure.account.oauth2.client.id
fs.azure.account.oauth2.client.endpoint
fs.azure.account.oauth2.client.secret

Copying files with DistCp

Now that everything is setup, we can copy files from HDFS to ADLS with DistCp:

# Fill in those properties
user=$USER
hdfs_dir="/user/$user/.distcp-adls"
hdfs_cp_dir=/path/to/dir/in/hdfs
storage_account=ADLS_STORAGE_ACCOUNT
container=ADLS_CONTAINER
adls_cp_dir=path/to/copy/dir

adls_cp_url="abfss://$container@$storage_account.dfs.core.windows.net/$adls_cp_dir"

lib/hadoop/bin/hadoop \
  --config "./hadoop-conf" \
  distcp \
  -Dfs.azure.account.auth.type=OAuth \
  -Dfs.azure.account.oauth.provider.type=org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider \
  -Dhdp.version="$hdp_version" \
  -Dmapreduce.application.framework.path="$hdfs_dir/mapreduce.tar.gz#mr-framework" \
  -Dhadoop.security.credential.provider.path="jceks://hdfs$hdfs_dir/adls2keyfile.jceks" \
  -libjars "./lib/hadoop-azure-3.3.0.jar,./lib/wildfly-openssl-1.0.7.Final.jar" \
  "$hdfs_cp_dir" "$adls_cp_url"

Writing data with NiFi

Native ADLS Gen2 processors were introduced in NiFi 1.12.0 (NIFI-7103). While is was possible to interact with ADLS Gen2 using the HDFS processors, this implies storing the storage account access key in plain text in the HDFS core-site.xml what is obviously not recommended. Here we will focus on using the new native processors.

Import needed processors

First, we need to make the new processors available to our NiFi instance (note: you don’t have to upgrade your whole NiFi cluster, the following procedure was tested on NiFi 1.9.0):

Download NiFi 1.12.0 binaries from the official website
Extract the archive. The 2 NARs we are interested in are nifi-azure-nar-1.12.0.nar and nifi-azure-services-api-nar-1.12.0.nar. You will find them in the lib folder.
On all your NiFi nodes:
1. Locate the NiFi home (e.g. /usr/hdf/current/nifi on HDF)
2. Look for existing Azure NARs in the lib folder:
```
ls -l /usr/hdf/current/nifi/lib | grep azure
nifi-azure-nar-1.9.0.3.4.1.1-4.nar
```
3. Remove the existing Azure NARs (make a backup)
4. Copy the 2 NiFI 1.12.0 Azure NARs mentionned in step 2. to the lib folder
5. Make sure that they are owned by the nifi user:
```
chown nifi:nifi /usr/hdf/current/nifi/lib/nifi-azure-*
```
Restart the NiFi cluster (via Ambari on HDF)
Check that the new Azure processors are available (e.g. PutAzureDataLakeStorage):

Generating a Shared Access Signature

A ADLSCredentialsControllerService is used by NiFi to store the credentials to access ADLS. It takes either an account key or a SAS token to authenticate. Even if both crendential types are encrypted by NiFi, we will use the SAS method which is recommended by Microsoft.

Let’s generate a user delegation SAS for our ADLS container. A user delegation SAS uses an Active Directory account to sign the SAS (instead of an account key for account/service SAS). It allows permissions double check on access:

Check of the access permissions of the user that signed the SAS
Check of the access permissions granted by the SAS (see Specify permissions)

The easiest way to generate a SAS is using Azure CLI:

Install Azure CLI
Login to the CLI using the service principal created in the article’s first section (password = secret)
```
az login --service-principal -u SP_ID --tenant mycompany.onmicrosoft.com
Password:
```

Use the az storage container generate-sas command (adjust permissions):

#Fill in those properties
storage_account=ADLS_STORAGE_ACCOUNT
container=ADLS_CONTAINER

az storage container generate-sas -n "$container" \
  --account-name "$storage_account" \
  --subscription "MySubscription" \
  --permission dlrw \
  --expiry 2020-09-25T00:00:00Z \
  --as-user --auth-mode login

Save the generated SAS

Note: the maximum duration of a user delegation SAS is 7 days, so you will have to automate its renewal somehow, or use a service SAS which can have longer lifetime.

Configure the dataflow

We can now interact with ADLS Gen2 from a NiFi dataflow:

Create a ADLSCredentialsControllerService
Fill in the properties:
1. Storage Account Name
2. SAS Token: Add a ? before the SAS generated above. E.g.
```
?se=2020-11-10T00%3A00%3A00Z&sp=rwdl&sv=2018-11-09&sr=...
```
Create whatever processor you need to access ADLS Gen2, using the ADLSCredentialsControllerService created above

Alternative solutions

Azure Data Factory

An alternative solution to copy data from an on-premise Hadoop cluster to Azure is to use Azure Data Factory, that offers a bunch of connectors:

In addition Data Factory lets you create a self-hosted integration runtime inside your on-premise private network.

All the setup is well explained on Azure documentation so we won’t talk more about it here!

Recap

In this article we have seen that even if is not natively supported, we can easily use later versions of a few components to interact with ADLS Gen2 from both HDP and HDF platforms.

Share this article