Managing authorizations with Apache Sentry
Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.
With this article, we will show you how we are using Apache Sentry at Adaltas. For this demonstration, we have pulled one use case that we have been faced on at one of our customer: Pages Jaunes. For the sake of privacy, data displayed in this article is faked.
Pages Jaunes is the biggest French web-announcer, and sells its audience to its customer. In order to convince potential customers, commercials have to present potential outcomes to their customer to join Pages Jaunes website. To this extent, Data Scientist in Pages Jaunes have to predict at best the plausible audience of their customer. So we had to introduce to the market team the Data Lake already present.
Let’s call these Data Scientists
marketing_analysts. Our objective is to give read access for
marketing_analysts to the database
dw_audience and to give full access to a sandbox database that we are going to call
- CDH 5.8 on CentOs 6
- Hue 3.10
The first thing you have to do is to create appropriate Unix group and Unix users on all hosts of your cluster.
If your cluster is connected to an LDAP, just add new entries in your LDAP. If not, rather than create user one by one on each host, you can deploy it using a deployment tools suc as Nikita or Ansible.
Let’s create the group
usr_marketing_analysts then two data scientists: John Doe and Marcelus Wallace. In the below, we use the command
ryba exec which itself rely on Nikita to distribute SSH commands:
./bin/ryba exec 'sudo groupadd grp_marketing_analysts' ./bin/ryba exec 'sudo adduser -g grp_marketing_analysts usr_marketing_analysts' ./bin/ryba exec 'sudo adduser -G grp_marketing_analysts jdoe' ./bin/ryba exec 'sudo adduser -G grp_marketing_analysts mwallace'
First create the HDFS directory with HDFS superuser that will storage the database and set the right permissions. From an edge node:
sudo -u hdfs hdfs dfs -mkdir -p /user/usr_marketing_analysts/warehouse/dw_marketing_analysts sudo -u hdfs hdfs dfs -chown -R usr_marketing_analysts:grp_marketing_analysts /user/usr_marketing_analysts
According to your policy, set a quota to this directory
sudo -u hdfs hdfs dfsadmin -setSpaceQuota 100g /user/usr_marketing_analysts
Then set an ACL to allow hive and impala users to write into these directories:
sudo -u hdfs hdfs dfs -setfacl -R -m user:impala:rwx /user/usr_marketing_analysts/warehouse sudo -u hdfs hdfs dfs -setfacl -R -m user:hive:rwx /user/usr_marketing_analysts/warehouse
Create the database with Hive/Impala superuser according to the previous directory
sudo -u hive hive -e CREATE DATABASE dw_marketing_analysts LOCATION '/user/usr_marketing_analysts/warehouse/dw_marketing_analysts'
Create the group and the users in Hue. This part is pretty straight-forward thanks to the Hue web UI.
Go to Security > Hive Tables panel and click on Roles on the left side. Now create a new roles which has to be named with the group name according to your policies.
You have to specify Hive privileges and HDFS privileges.
Here we have set up privileges for nominative users, you can apply the exact same process for applicative users.
We have also set up privileges on databases, but you can apply authorizations finer grained on tables or columns. For more information on privileges and their hierarchies, please visit the Sentry documentation.