Data Governance

Data governance represents a set of procedures to ensure important data are formally managed through the company.

It provides trust in the datasets as well as user responsibility in case of low data quality. This is of particular importance inside a Big Data platform fully integrated inside the company where multiple dataset, multiple treatments and multiple users coexist.

Governance foundation
Organisation, responsabilités

Organization, responsibilities

The right organization for the people eases the communication and the comprehension between teams while promoting an agile and data-centric culture. The concept of a single point of accountability is a major principle to achieve an effective project governance and it establishes new responsibilities (Data Council, Data Steward…).

Autorisation, ACL


Each components of the cluster offer by nature their own rules for access control. Each of the components of the cluster inherently has its own access control mechanisms. Fine grained access rules on a file system are not managed the same way as the one of a relational database. These rules can be based on roles (RBAC), on tags or even on the geolocation of IP address

Identité, authentification

Authentification, Identification

Identity management includes user information and their existence, their group membership and the management rules applied to them. It is shared accross the company with the integration of the target platform to the company's LDAP server or its Active Directory.



The company is responsible to define a set of naming rules to ensure the integrity and the coherence of the system. The purpose is to guaranty to business and technical users the comprehension of names while enforcing coherent conventions, structures and names. Attribution of names must: be meaningful, be comprehensible without external explanations, reflects the targeted resource usages, differentiates itself from other names as much as possible, maximizes full name when possible, uses the same abbreviation, be singular.

Metadonnées, Data Lineage

Metadata, Data Lineage

The usage of tags enables the traceability of the data accross its data lifecycle: collect, qualification, enrichment, consumption. This process inform about where does the data come from, where it went through, who are the people or the application who access it and how was it altered. Having all those information systematically collected allows for data classification, user and application behavior captur, follow and analyse data related actions, ensure the respect usage according to the security policies in place.

Qualité de la donnée

Data Quality

Data qualification is the responsibility of the development teams. Unique interlocutor must be identified to be accountable and endorse responsibilities. It is crucial to constituate a readable responsiblity chain in which roles are not shared. Teams can rely on an existing toolset to validate and apply the relevant schema to each and every record. Moreover, the core components must prevent against a potential corruption of the data at rest and in motion.

Allocation des ressources

Ressources allocation

Inside a multi-tenant environment, YARN carries the responsibility to ensure the availability of allocated resources to its users and groups of users. The resources traditionally managed by YARN are the memory and the CPU. Lately, the latest evolution of YARN reports the management of the network and disks. Through its ownership, process execution is associated to scheduling queues with a dedicated amount of cluster resources. Yarn enforces the disponibility of allocated resources for each user.

Cycle de vie de la donnée

Data Lifecycle

Information Lifecycle Management (ILM) encompasses the overall collect and traitment chain. It purposes is to plan the processing of data accross one or several clusters, to store and archive data while securing and preserving retention time.

Articles related to gouvernance

Policy enforcing with Open Policy Agent

Categories: Cyber Security, Data Governance | Tags: Kafka, Ranger, Authorization, REST, Cloud, Kubernetes, SSL/TLS

Open Policy Agent is an open-source multi-purpose policy engine. Its main goal is to unify policy enforcement across the cloud native stack. The project was created by Styra and it is currently…



Jan 22, 2020

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…


By David WORMS

Oct 8, 2019

Users and RBAC authorizations in Kubernetes

Categories: Containers Orchestration, Data Governance | Tags: Authentication, Authorization, Cyber Security, RBAC, Kubernetes, SSL/TLS

Having your Kubernetes cluster up and running is just the start of your journey and you now need to operate. To secure its access, user identities must be declared along with authentication and…

Robert Walid SOARES

By Robert Walid SOARES

Aug 7, 2019

Self-sovereign identities with verifiable claims

Categories: Data Governance | Tags: Authentication, Blockchain, Ledger, Cloud, IAM

Towards a trusted, personal, persistent, and portable digital identity for all. Digital identity issues Self-sovereign identities are an attempt to solve a couple of issues. The first is the…



Jan 23, 2019

Managing User Identities on Big Data Clusters

Categories: Cyber Security, Data Governance | Tags: Ansible, FreeIPA, Kerberos, LDAP, Active Directory, IAM

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to…


By David WORMS

Nov 8, 2018

Managing authorizations with Apache Sentry

Categories: Data Governance | Tags: Ansible, Hue, Database, LDAP, Nikita, Sentry, CDH, Deployment

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. With this article, we will show you how we are using Apache Sentry at…



Jul 24, 2017

About the new BSD license and its difference with other BSD licenses

Categories: Data Governance | Tags: License, Open source

As a non restrictive Open Source license, the “new BSD license” is a commonly used license across the Node.js community. However, this is only one of the BSD license available along the original “BSD…


By David WORMS

Aug 8, 2013

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.