Data governance represents a set of procedures to ensure important data are formally managed through the company.

It provides trust in the datasets as well as user responsibility in case of low data quality. This is of particular importance inside a Big Data platform fully integrated inside the company where multiple dataset, multiple treatments and multiple users coexist.

Organization and responsibilities

The right organization for the people eases the communication and the comprehension between teams while promoting an agile and data-centric culture. The concept of a single point of accountability is a major principle to achieve an effective project governance and it establishes new responsibilities (Data Council, Data Steward…).


The company is responsible to define a set of naming rules to ensure the integrity and the coherence of the system. The purpose is to guaranty to business and technical users the comprehension of names while enforcing coherent conventions, structures and names. Attribution of names must: be meaningful, be comprehensible without external explanations, reflects the targeted resource usages, differentiates itself from other names as much as possible, maximizes full name when possible, uses the same abbreviation, be singular.


Each components of the cluster offer by nature their own rules for access control. The ACLs of a filesystem are different from those of a relational database or a stream processing engine. The purpose of Apache Ranger and Apache Sentry is to centralize inside a single interface the management of each component. From the web interfaces, the declaration of fine-grained permissions can be deployed across the cluster: files and directories in HDFS, databases; tables and columns in Hive; topics in Kafka. HDFS, Yarn, Hive, HBase, Knox, Storm, SOLR et Kafka are some of the supported components. In a multi-tenant cluster, Apache Ranger provides the opportunity to delegate to selected users the management of permissions on selected source of data. It gives control and flexibility to the teams who are now in capacity to enforce an appropriate governance.

Ressources allocation

Inside a multi-tenant environment, YARN carries the responsibility to ensure the availability of allocated resources to its users and groups of users. The resources traditionally managed by YARN are the memory and the CPU. Lately, the latest evolution of YARN reports the management of the network and disks.  Through its ownership, process execution is associated to scheduling queues with a dedicated amount of cluster resources. Yarn enforces the disponibility of allocated resources for each user.

Metadata, Data Lineage

The usage of tags enables the traceability of the data accross its data lifecycle: collect, qualification, enrichment, consumption. This process inform about where does the data come from, where it went through, who are the people or the application who access it and how was it altered. Having all those information systematically collected allows for data classification, user and application behavior captur, follow and analyse data related actions, ensure the respect usage according to the security policies in place. Those functionalities are integrated to components such as HDFS, Hive Sqoop, Falcon or Storm, with new componenents being integrated regularly. For greater flexibility, meta data can be send manually through an HTTP REST API.

Authentification and Identification

The integration of Kerberos inside an Hadoop cluster ensures the identity for all external and internal communications. The integration with the LDAP server or the Active Directory of the company simplify user and group management and the ownership rules between them.

Data Quality

Data qualification is the responsibility of the development teams. Unique interlocutor must be identified to be accountable and endorse responsibilities. It is crucial to constituate a readable responsiblity chain in which roles are not shared. Teams can rely on an existing toolset to validate and apply the relevant schema to each and every record. Moreover, the core components must prevent against a potential corruption of the data at rest and in motion. For example, HDFS uses checksums to prevent block of data from corruption and Kafka provides garantees in the order and delivery of data depending on the data nature and the expected performances.

Data Lifecycle

Information Lifecycle Management (ILM) encompasses the overall collect and traitment chain. It purposes is to plan the processing of data accross one or several clusters, to store and archive data while securing and preserving retention time. In the Hadoop stack, the component Apache Falcon fills this purpose.

Governance roles

Compliance Officer

Track, understand and protect access to sensitive data.

Am I prepared for an audit?
Who’s accessing what data?
What are they doing with the data?
Is sensitive data gouverned and protected?

Data Steward & Curator

Manage and organize data assets at Hadoop scale.

How to efficiently manage data lifecycle, from ingest to purge?
How do I classify data efficiently?
How do I make data available to my end users efficiently?
Is sensitive data gouverned and protected?

Data Scientists & BI Users

Effortlessly find and trust the data that matters most.

How can I explore data on my own?
Can I trust what I find?
How do I use what I find?
How do I find and use related data sets?

Hadoop Admin & DBAs

Boost user productivity and cluster performance.

How is data being used today?
How can I optimize for future workloads?
How can I quickly take advantage of Hadoop risk-free?

Effective communication is key to success.

Ideas come from many places. Make sure your team is talking to the rest of the organization.

The involved energies must balance methodological and technical excellence with practicality and usability.

The selected projects must leverage the collective wisdom though discussions and decision making.