Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster.
With this article, we will show you how we are using Apache Sentry at Adaltas. For this demonstration, we have pulled one use case that we have been faced on at one of our customer : Pages Jaunes. For the sake of privacy, data displayed in this article is faked.
The use case
Pages Jaunes is the biggest french web-announcer, and sells its audience to its customer. In order to convince potential client, commercials has to present potential outcomes for this customer to join Pages Jaunes website. To this extent, data scientist in Pages Jaunes has to predict at best the plausible audience of this customer. So we had to integrate the market team in the data lake already set up.
Let’s call these data scientists marketing_analysts. Our objective is to give read access for marketing_analysts to the database ‘dw_audience’ and to give full access to a ‘sandbox database’ that we are going to call dw_marketing_analysts.
– CDH 5.8 on CentOs 6
– Hue 3.10
First step : Create missing Unix groups and users
The first thing you have to do is to create appropriate Unix group and Unix users on all hosts of your cluster.
If your cluster is connected to a LDAP, just add new entries in your LDAP. If not, rather than create user one by one on each host, you can deploy it thanks to Ryba.
Let’s create the group grp_marketing_analysts, an applicative_user usr_marketing_analysts then two data scientists : John Doe and Marcelus Wallace:
./bin/ryba exec ‘sudo groupadd grp_marketing_analysts’
./bin/ryba exec ‘sudo adduser -g grp_marketing_analysts usr_marketing_analysts’
./bin/ryba exec ‘sudo adduser -G grp_marketing_analysts jdoe’
./bin/ryba exec ‘sudo adduser -G grp_marketing_analysts mwallace’
Second step : Create missing database dw_marketing_analysts
First create the HDFS directory with HDFS superuser that will storage the database and set the right permissions. From an edge node :
sudo -u hdfs hdfs dfs -mkdir -p /user/usr_marketing_analysts/warehouse/dw_marketing_analysts
sudo -u hdfs hdfs dfs -chown -R usr_marketing_analysts:grp_marketing_analysts /user/usr_marketing_analysts
According to your policy, set a quota to this directory
sudo -u hdfs hdfs dfsadmin -setSpaceQuota 100g /user/usr_marketing_analysts
Then set an ACL to allow hive and impala users to write into these directories:
sudo -u hdfs hdfs dfs -setfacl -R -m user:impala:rwx /user/usr_marketing_analysts/warehouse
sudo -u hdfs hdfs dfs -setfacl -R -m user:hive:rwx /user/usr_marketing_analysts/warehouse
Create the database with Hive/Impala superuser according to the previous directory
sudo -u hive hive -e 'CREATE DATABASE dw_marketing_analysts LOCATION '/user/usr_marketing_analysts/warehouse/dw_marketing_analysts'
Third step : Set up privileges with Sentry through the Hue web UI
Create the group and the users in Hue. This part is pretty straight-forward thanks to the Hue web UI. Go to Security > Hive Tables panel and click on Roles on the left side. Now create a new roles which has to be named with the group name according to your policies.
You have to specify Hive privileges and HDFS privileges.
Here we have set up privileges for nominative users, you can apply the exact same process for applicative users.
We have also set up privileges on databases, but you can apply authorizations finer grained on tables or columns. For more information on privileges and their hierarchies, please visit the Sentry documentation.