Data platform requirements and expectations
By David WORMS
Mar 23, 2023
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
A big data platform is a complex and sophisticated system that enables organizations to store, process, and analyze large volumes of data from a variety of sources.
It is composed of several components that work together in a secured and governed platform. As such, a big data platform must meet a variety of requirements to ensure that it can handle the diverse and evolving needs of the organization.
Note, due to the extensive nature of the domain, it is not feasible to provide a comprehensive and exhaustive list of requirements. We invit you to contact us to share additionnal enhancements.
This area includes the ingestion of data from various sources, their treatment, and their storage in a suitable format.
Ability to consume data from various sources including databases, file systems, APIs, and data streams.
Ability to consume data in both batch and streaming.
Support for reading and writing file formats and table formats such as JSON, CSV, XML, Avro, Parquet, Delta Lake and Iceberg.
Definition for the quality requirements for the data, such as data completeness, data accuracy, and data consistency, and ensure that the ingestion pipeline can validate and cleanse the data as needed.
Transformation des données
Determine whether the data needs to be transformed or enriched before it can be stored or analyzed.
Ensure that the ingestion pipeline can handle failures or outages of the data sources or the ingestion pipeline itself, and can recover and resume ingestion without data loss.
Provide solutions capable of addressing expected volume and throughput variations.
This area includes the storage, the managment, and the retrieval of large volumes of data.
The ability to access the data reliably and with minimal downtime, ensuring high availability of the data.
The ability to ensure data is not lost due to hardware failures or other errors, with data replication and backup strategies in place.
The ability to store and retrieve data quickly and efficiently, with low latency and high throughput.
Storage and management of growing volumes of data, with the ability to scale up and down as needed by acquiring and releasing additional resources.
Data lifecycle management by applying changes and adding missing data and the possibility of reverting to a previous version.
This area includes the processes for preparing and exposing the data for further analysis.
Ability to support multiple data types and formats and ability to integrate with various distributed data processing and analysis tools.
Cleanse the data to remove or correct errors, inconsistencies, and missing values.
Combine and integrate multiple data sources into a single dataset, resolving any schema or format differences.
Transform the data to prepare it for downstream processing or analysis, such as aggregating, filtering, sorting, or pivoting.
Enhance the data with additional information to provide more context and insights.
Reduce the volume of data by summarizing or sampling it, while preserving the essential characteristics and insights.
Data normalization and denormalization
Normalize the data to remove redundancies and inconsistencies, ensuring that the data is stored in a consistent format and denormalization to improve performances.
This area is the practice of monitoring and managing the quality, integrity, and performance of data as it flows through the platform.
Ensuring that the data is valid, accurate, and consistent, and meets the expected format and schema.
Tracking the path of data as it flows through the system to identify any issues or anomalies.
Data quality monitoring
Continuously monitoring the quality of data and raising alerts when anomalies or errors are detected.
Monitoring the performance of the system, including latency, throughput, and resource utilization, to ensure that the system is performing optimally.
Managing the metadata associated with the data, including data schema, data dictionaries, and data catalog, to ensure that it is accurate and up-to-date.
This area includes the requirements to access, transfer, analyze and visualize the data to extract insights and actionable information.
CLI environments and graphical interfaces available to users for data processing and visualization.
Provision of data access via REST, RPC and JDBC/ODBC communication protocols.
Perform exploratory data analysis to understand data characteristics and quality, extract patterns, relationships, or insights from the data, using statistical or machine learning algorithms.
Ensure that the data is secure and protected from unauthorized access or breaches, by implementing appropriate security controls and protocols.
Visualize the data to communicate insights and findings to stakeholders, using charts, graphs, or other visualizations.
The area cover the security and the management of a big data platform.
Data regulation and compliance
The ability to ensure compliance with data governance policies and regulations, such as data privacy laws, data usage practices, data retention policies, and data access controls.
Fine-grained access control
Ability to control access and data sharing on all proposed services with management policies taking into account the characteristics and specificities of each.
Data filtering and masking
Filtering of data by row and by column, application of masks on sensitive data.
Encryption at rest and in transit with SSL/TLS.
Integration into the information system
Integration of users and user groups with the corporate directory.
Isolation of the platform in the network and centralize access through a single entry point.
Provision of a graphical interface for the configuration and monitoring of services, the management of data access controls and the governance of the platform.
Monitoring and alerts
Exposing metrics and alerts that monitor and ensure the health and performance of the various services and applications.
This area covers the acquisition of new resources as well as the maintenance requirements.
Selection between a cloud or an on-premise infrastructure, taking into account that cloud offers flexible and scalable storage and processing of large datasets with cost efficiencies, while on-premise deployment provides greater control, security and compliance over data but requires significant upfront investment and ongoing maintenance costs.
Dissociation between resources dedicated to storage and processing and, in some circumstances, collocation of processing and data.
Provision of a storage infrastructure in line with the volumes expressed.
Provision of a computing infrastructure capable of evolving with future usages brought by projects and users in the fields of data engineering, data analysis and data science.
The ability to store and manage data cost-effectively, with consideration of the cost of storage and the cost of managing and operating the storage solution.
Cost management and total cost of ownership (TCP)
Control and calculation of the total cost of the solution taking into account all the factors and specificities of the platform such as infrastructure, staff, acquisition of licenses, deadlines, use, team turnover, technical debt, …
Support for platform users with the aim of ensuring the acquisition of new skills for the teams, the validation of the architecture choices, the deployment of patches and features, and the proper use of the available resources.
Overall, a big data platform must be able to handle the diverse and evolving needs of the organization, while ensuring that the solution is highly flexible, resilient, and performant, that data is secure, compliant, and of high quality, that insights and findings are communicated effectively accross the various stakeholders, and that it remains cost-effective to operate over time.