Metadata, Data Lineage
The usage of tags enables the traceability of the data accross its data lifecycle: collect, qualification, enrichment, consumption. This process inform about where does the data come from, where it went through, who are the people or the application who access it and how was it altered. Having all those information systematically collected allows for data classification, user and application behavior captur, follow and analyse data related actions, ensure the respect usage according to the security policies in place. Those functionalities are integrated to components such as HDFS, Hive Sqoop, Falcon or Storm, with new componenents being integrated regularly. For greater flexibility, meta data can be send manually through an HTTP REST API.
Authentification and Identification
The integration of Kerberos inside an Hadoop cluster ensure the identity for all external and internal communications. The integration with the LDAP serveur or the Active Directory of the company simplify user and group management and the ownership rules between them.
Data qualification is the responsibility of the development teams. Unique interlocutor must be identified to be accountable and endorse responsibilities. It is crusial to constituate a readable responsiblity chain in which roles are not shared. Teams can rely on an existing toolset to validate and apply the relevant schema to each and every records. Moreover, the core components must prevent against a potential corruption of the data at rest and in motion. For example, HDFS use checksums to prevent block of data from corruption and Kafka provide garantees in the order and delivery of data depending on the data nature and the expected performances.
Information Lifecycle Management (ILM) encompasses the overall collect and traitment chain. It purposes is to plan the processing of data accross one or several clusters, to store and archive data while securing and preserving retention time. In the Hadoop stack, the component Apache Falcon fill this purpose.