Managing User Identities on Big Data Clusters

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to understand how these different services fit together and whether they should be shared across multiple clusters. Also, which strategy to choose and what are its impacts if it is planned to migrate at a later date the knowledge of users from a local service to an Active Directory or FreeIPA instance already present in the company.

Authentication and identification

Hadoop rely on the Kerberos protocol for authentication. Authentication is about validating that a user is the person he claimed to be. For instance, if I submit a request to a system, the role of authentication is not to validate that I have the permissions to submit this request but only to confirm that I am the person I pretend to be. Concerning Kerberos, the identity of a user is defined by a principal. This principal is attached to a domain and takes two forms: {username}@{realm} and {username}/{fqdn}@{realm}.The later form takes the former and incorporates the complete domain name of the host. It applies to system users installed on a node. For example, a my_app component installed on the node_1.cluster_01.company.com node might receive a principal name like my_app/node_01.cluster_01.company.com@CLUSTERT01_COMPANY. This type of principals is used by systems using Kerberos to authenticate its internal services in addition to the external users who access it. To authenticate with Kerberos, there are two mechanisms: password and keytab. A keytab is a file that is only accessible to a user. Using this file provides an alternative to the usage of a password. The drawback is that anyone else who has access to another’s keytab will be able to steal his identity. To prevent against a compromise keytab, each keytab is associated with a unique number. The generation for a given principal of a new keytab increment this number and invalidates the previously generated keytabs.

Once authenticated, users must be known by the underlying operating system. Identification consists in associating a user with properties. Hadoop delegates to the system the knowledge of users and their affiliations to different groups. The knowledge of a user in Linux implies the existence of a uid associated with each user and a gid associated with each group. For example, when YARN starts a Spark job, the files needed for this job (jar, configuration, …) are copied on each of the nodes running this job with the permissions of the user who submitted the job. Since Big Data components are distributed across multiple nodes to form a cluster, it is imperative for the configuration to be homogeneous across the overall cluster. By default, Linux stores his users in the /etc/passwd file. It is possible to point the system to a remote database, traditionally an LDAP directory. On RHEL and CentOS, the simplest way to configure the system is to use the SSSD service.

The translation of a principal into a username is done through a set of rules present in the auth-to-local property. For example, it is possible to define a rule that translates the main hdfs@CLUSTER01_COMPANY and the main data_node/worker_01.cluster_01.company.com@CLUSTER01_COMPANY into the hdfs identity.

Authentication is handled by a Kerberos server while identification generally uses an LDAP server. The three most common solutions are:

The different type of users

We distinguish 3 types of users that we define in Adaltas:

Service users
Application users
Nominative users

Service users

A Linux service is an application running as a background process by the system. To avoid having unencrypted passwords on the system, the principals of Hadoop components do not have a known password and reference a keytab. They are declared in Kerberos as {username}/{fqdn}@{domain}. Thus, each component is authenticated by a unique keytab. It is also possible to create users in the more classic form {username}@{domain}. For example, for convenience, it is common to create a principal hdfs@MY_HADOOP_DOMAIN to create a user with extended permissions on the HDFS file system.

Service users are only known from inside a cluster. Therefore, it is appropriate to regroup them into a Kerberos domain per cluster. This simplify the isolation of the cluster against other components configured outside. In addition, the users created are not likely to conflict with those of another cluster with the same name as would be the case in the above-mentioned example of the hdfs@MY_HADOOP_DOMAIN.

There are some workarounds but using a unique Kerberos domain per Hadoop cluster simplify its comprehension and administration.

The identity of the system users doesn’t need to be delegated to an external system centralizing its management. The user properties will not be updated and only his uid and his gid are in used by the host system. There are no groups associated with this user, other than those declared at the installation of the service. For these reasons, we use the internal mechanism offered by the system based on POSIX and the /etc/passwd file.

The only precaution is to make sure that the uid and gid provisioned by the system do not conflict with those of the LDAP of the company. It is recommended to either configure the system to use a range compatible with those used inside the company or to use a provisioning tool (Ansible, Nikita, …) to control the generated uid and gid.

Application users

Application users are associated with a project, an application, or a set of processes. They are theoretically the only ones able to launch jobs on a production cluster. This rule becomes more flexible depending on the size of the cluster, its role in the organization and the sensibility of the data as well as the engaged SLAs.

When launching a new Big Data cluster in an organization, it is common for clusters to cumulate the roles of production, pre-production, and even development environnements. With the usage growing and becoming mature, new clusters are provisionned. Critical applications, both in terms of data sensitivity and operational impacts in the event of a service interruption, may require that only a few “qualified” users are in charge of processing. At this point, developers will use a cluster dedicated to development. Also, users and data consuming applications with ad-hoc processing will be moved to a cluster or dedicated space for their uses, commonly called Data Lab.

If the cluster supports both the roles of Data Lake and Data Lab, it is not absurd to open access to Data Analysts and Data Scientists to consume the data. In this case, the focus should be on multi-tenancy. The allocatation of resources and the selection of the available components must ensure and guarantee the fair share of resources to each applications and users. By leveraging the HDFS tiering and YARN labels, Adaltas has deployed clusters which are split into two distinct spaces: the Data Lake with only application users and the Data Lab with a population made of Data Analysts and Data Scientists.

The identities of application users are usually shared between different clusters. For instance, a user on behalf of an application must exists on the production, pre-production, qualification, and even development clusters. However, it is not always recommanded to share the authentication of a user. A keytab, or password, would be valid on all the clusters referencing a same Kerberos server. This is not always desirable and the architecture must be carefully crafted conjointly with group membership considerations. The recommendation is to store user identities in a shared LDAP and use a dedicated Kerberos server for each cluster. In general, there is no intensive use of user groups but if so, it is also better to maintain it at the cluster level.

Nominative users

Nominative users are associated with a physical person. There are several user profiles. Some interact directly with the system, others through an application. Some require qualified data, others interact with raw data.

Data Analysts often consume data via analytic/BI tools. They create and consume more or less dynamic dashboards as well as KPIs. Data Scientists need the data at all levels of their life cycle to explore and produce new indicators, reductions and predictions.

The identity of registered users and their authentication mechanism is already present in most organizations, often delegated to an Active Directory or FreeIPA. It is recommended to integrate new clusters with the services already present in the company and as early as possible. This centralization allows a single management of the passwords and the guarantee of the inaccessibility of the non-active users. Assigning a user to a group can either be managed by a central authority or by a service attached to the cluster. In both cases, it is recommended that it be specific to each cluster. The same person can be a member of one group on one cluster but not on another.

Nominative user migration

If using the central user directory present in the organisation is immediately impossible but nearly anticipated, it is relatively easy to postpone this action. The only imperative is to keep the same usernames between the old and the new system. Thanks to the auth-to-local mechanism, two Kerberos principals will be translated by the same username.

The process will be simple and almost transparent for end users. The only action they will have to take will be to use a new keytab if they use it or to abandon their cluster-specific password in favor of their usual password. This password is probably the same as the one used to log on to their workstation.

The platform administrator should update all Kerberos client files to define the new domain and add a translation rule to the auth-to-local property to accommodate the new domain.

Conclusion

As usual, it is recommended to start simple and start small. In order to make the evolution of the platform sustainable, it is possible to complicate the architecture by introducing new services and integrating existing platforms with existing services.

Share this article