Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications?

Apache Knox overview

Apache Knox is an application gateway for interacting in a secure way with the REST APIs and the user interfaces of one or more Hadoop clusters. Out of the box it provides :

On the other hand, it is not an alternative to Kerberos for strong authentication of an Hadoop cluster, nor a channel for acquiring or exporting large volumes of data.

We can define the benefits of the gateway in four different categories:

  • Enhanced security through the exposure of REST and HTTP services without revealing the details of the Hadoop cluster, the filter on the vulnerabilities of web applications or the use of the SSL protocol on services that do not have this possibility;
  • Centralized control through the use of a single gateway, which facilitates auditing and authorizations (with Apache Ranger);
  • Simplified access thanks to the encapsulation of services with Kerberos or the use of a single SSL certificate;
  • Enterprise integration through leading market solutions (Microsoft Active Directory, LDAP, Kerberos, etc) or using custom solutions (Apache Shiro, etc).

Apache Knox example of architecture

Kerberos encapsulation

Encapsulation is mainly used for products that are incompatible with the Kerberos protocol. The user provides his username and password via the use of the HTTP Basic Auth protocol.

Kerberos encapsulation kinematics

Simplified management of client certificates

The user relies solely on the Apache Knox certificate, so we centralize the certificates of the different services on the Apache Knox servers and not on all the clients. Very useful when revoking and creating a new certificate.

Certificate Management via Apache Knox

Apache Ranger integration

Apache Knox includes an Apache Ranger agent to check the permissions of users who want to access cluster ressources.

Apache Knox plugin operation kinematics for Apache Ranger

Hadoop URLs VS Knox URLs

Using Apache Knox URLs obscures the cluster architecture and allows users to remember only one URL.

Service Hadoop URL Apache Knox URL
WebHDFS http://namenode-host:50070/webhdfs https://knox-host:8443/gateway/default/webhdfs
WebHCat http://webhcat-host:50111/templeton https://knox-host:8443/gateway/default/templeton
Apache Oozie http://oozie-host:11000/oozie https://knox-host:8443/gateway/default/oozie
Apache HBase http://hbase-host:60080 https://knox-host:8443/gateway/default/hbase
Apache Hive http://hive-host:10001/cliservice https://knox-host:8443/gateway/default/hive

Customizing Apache Knox

To configure the Apache Knox gateway, we need to modify the topology type files. These files consist of three components: Providers, HA Provider and Services.


You can find the topology files in the topologies directory <GATEWAY_HOME>/conf/topologies. The name of the topology file will dictate the Apache Knox URL, if the file is named sandbox.xml, the URL will be: https://knox-host:8443/gateway/sandbox/webhdfs.

Here is an example of a topology file:

Be careful, if you use Apache Ambari, the topologies admin, knoxsso, manager and default must be modified via the web interface, otherwise in case of restart of the service the files will be overwritten.


Providers add new features (authentication, federation, authorization, identity-assertion, etc) to the gateway that can be used by different services, usually one or more filters that are added to one or more topologies.

The Apache Knox gateway supports federation by adding HTTP header. Federation provides a quick way to implement single sign-on (SSO) by propagating user and group information. Use it only in a highly controlled network environment.

The default authentication provider is Apache Shiro (ShiroProvider). It is used for authentication to an Active Directory or LDAP. For authentication via Kerberos, we will use HadoopAuth.

There are five main identity-assertion providers:

  • Default: this is the default provider for easy mapping of names and / or groups of users. It is responsible for establishing the identity passed on to the rest of the cluster;
  • Concat: it is the provider that allows the composition of a new name and / or groups of users by concatenating a prefix and / or a suffix;
  • SwitchCase: this provider resolves the case where the ecosystem requires a specific case for the names and / or groups of users;
  • Regex: it allows incoming identities to be translated using a regular expression;
  • HadoopGroupProvider: this provider looks for the user within a group to greatly facilitate the definition of permissions (via Apache Ranger).

In case you use the HadoopGroupProvider provider, you are only required to use groups for setting up permissions (via Apache Ranger), a JIRA (KNOX-821 : Identity Assertion Providers must be able to be chained together) was opened to be able to chain together several identity providers.


Services, in turn, add new routing rules to the gateway. The services are located in <GATEWAY_HOME>/data/services directory. Here is an example of the files that make up the HIVE service:

In the hive directory, we find the directory 0.13.0 (which corresponds to the version of the service). And inside this folder we find the rewrite.xml and service.xml files.

Personalized services

Let’s assume a REST application available at http://rest-application:8080/api/. Just create a new service and add it to the topology file.

service.xml file, the value of {{ knox_service_version }} must be equal to the name of the parent folder.

rewrite.xml file where we re-write the URL (adding here /api):

For more details on setting up custom services behind Apache Knox, here is a link to an article.

Tips and tricks

Setting up SSL

When backing up the private key in the keystore, the $PRIVATEKEYALIAS value must be gateway-identity-passphrase.

The password must match the one used when generating the master key (Apache Knox Master secret). That’s why we use the same password everywhere ($JKSPASS).

Common mistakes

First of all, a clean restart of Apache Knox solves come problems and purges the logs before restarting the query that is problematic.

Bulky answer

If your queries fail quickly and return a 500 error, the request response may be too large (8Kb by default). You will find something like this in the gateway-audit.log file:

Just modify the topology to add a parameter in the service in question:

Apache Solr

If you have enabled auditing Apache Solr in Apache Ranger (xasecure.audit.destination.solr=true), it is possible that in case of problems with Apache Solr no left space filesystem, Apache Knox will not work anymore.

Apache Hive connections often disconnected via Apache Knox

To correct this problem, you must add to the gateway-site configuration file (these values must be modified according to your environment).

How to check my deployment?

Apache Knox provides a client to check several things in the deployment of your instance.

Certificate validation

Validate the certificate used by Apache Knox instance, to verify that you have the certificates of your company.

Topology validation

Validation that the description of a cluster (clustername equals the topology) follows the correct formatting rules.

Authentication and authorization via LDAP

This command tests the ability of a cluster configuration (topology) to authenticate a user with ShiroProvider settings. The –g parameter lists the groups of which a user is a member. The –u and –p parameters are optional, if they are not provided, the terminal will ask you for one.

Topology / LDAP binding

This command tests the ability of a cluster configuration (topology) to authenticate the user provided only with ShiroProvider settings.

Gateway test

Using the HTTP Basic access authentication method:

Using the Kerberos authentication method:

In this case, we use the kdefault topology, which uses a HadoopAuth authentication provider.

Direct reading

The /usr/hdp/current/knox-server/data/deployments/ directory contains the different folders corresponding to your different topologies. Whenever you update your topology, a new directory is created based on{{ topologyname }}.topo.{{ timestamp }}.

In the %2F/WEB-INF/ subdirectory you will find the file rewrite.xml which is a concatenation of the topology file and services files. In this file, you can check that your rewrite rules have been taken into account.

Apache Knox Ranger plugin debug

This configuration makes it possible to see what is transmitted to Apache Ranger via the Apache Knox plugin. You must modify the gateway-log4j.properties file as below. Restart your Apache Knox instances and check the longs in ranger.knoxagent.log file.

Improve response times via Apache Knox

If you have much higher response times through Apache Knox, you can change these different settings in the gateway.site configuration file.


In conclusion, Apache Knox is a powerful tool, with Apache Ranger audits, to filter and audit all access to your environment(s). But it also allows you to be used as a classic gateway in front of your various personalized services by means of configuration. For example, you can add Kerberos authentication in front of your REST APIs that do not have it. Now, it’s up to you to play and make your feedback in the comment section.