While one of our customer is anticipating a move to the Cloud and with the recent announcement of Cloudera CDP availability mi-september during the Strata conference, it seems like the appropriate time to deep dive into the new Cloudera Big Data offering introduced after the merging of Cloudera and Hortonworks about a year ago.
CDP defined itself as the industry first enterprise data cloud service. It delivers powerful self service analytics accross hybrid and multi cloud environments while preserving a sophisticated and granular security and governance policies.
According to Cloudera, an enterprise data platform is characterized by:
- multi-function analytics
- hybrid and multi cloud
- secured and governed
- open platform
It provides the ability to implement many types of use cases on multiple platforms with a consistent view of the data, the security and the governance control. It democratizes data without risking compliance and regulatory concerns.
The Cloudera Data Platform (CDP) consists of a control plane completed with numerous services called experiences.
The control plane is managed by the Cloudera’s Shared Data Experience (SDX) which is responsible for the security and governance capabilities in CDP. The Management Console is composed of the Workload Manager, the Replication Manager and a Data Catalog. Before deploying any cluster, an environment must be registered into the Management Console. Among other things, it defines the network isolation, declares the data source stored in Amazon S3 or Azure ADLS v2, deploys a dedicated FreeIPA for user/group identities and password authentication, centralizes the Data Catalog and schema with the Hive Metastore, enforces the authorization policies with Ranger, setups the Audit Tracking with Ranger and the governance with Atlas and exposes the Gateway with Knox.
Concerning user management and security, a FreeIPA server is automatically provisioned when an environment is created. It is responsible for synchronizing the users and making them available to CDP services, Kerberos service principal management, and more. External users and applications will access the various services through the Knox gateway using SAML authentication.
At the moment, the CDP service is only available on the Cloud with AWS. We will also cover Azure since some of our customers are either using it or looking to it. It is also worth to mention that GCP availability is set for first quarter 2020. While CDP was originally released for AWS, it is not yet open to the general public. The CDP web-based public console is accessible online but not open to us, the crowd.
Altus and Cloudbreak are the current alternatives to CDP to provision resources on the Cloud and deploy clusters. Our understanding is that both Altus and Cloudbreak are being deprecated. Note, the Data Hub service covered below seems to be an evolution of Cloudbreak. Additionally, the future of HDInsight, built upon the HortonWorks distribution is unsure after 2020.
The new platform comes with a lot of new names and acronyms. This is a non exclusive list of them.
- CDP - Cloudera Data Platform
CDP is an integrated data platform that is easy to deploy, manage, and use on cloud and bare metal environments, piloted from the Management Console and leveraging services such as the Data Catalog, the Replication Manager and the Workload Manager.
- SDX- Shared Data Experience
The security and governance capabilities in CDP.
- EDH - Enterprise Data Hub
Integrated suite of analytic engines ranging from stream and batch data processing to data warehousing, operational database, and machine learning.
- Cloudera Data Warehouse
A service offering auto-scaling, highly concurrent and cost effective analytics service based on ephemeral Kubernetes cluster with attached cloud storage.
- Data Lake
When you register an environment in CDP’s Management Console, a Data Lake is automatically deployed. The Data Lake runs in the virtual network of the environment and provides security and governance layer for the environment’s workload resources, such as Data Hub clusters. When you start a workload cluster in the context of a CDP environment, the workload cluster is automatically “attached” with the security and governance infrastructure of the Data Lake. “Attaching” your workload resources to the Data Lake instance allows the attached cluster workloads to access data and run in the security context provided by the Data Lake. The following technologies provide capabilities for the Data Lake: Schema with Hive Metastore; Authorization Policies with Ranger; Audit Tracking with Ranger; Governance with Atlas; Gateway with Knox.
- CDP Identity Management
It includes CDP user management system, FreeIPA, identity federation, and Knox authentication. Administrators can federate access to CDP by configuring an external identity provider. Users/groups and password are authenticated via SSO to an SAML-compliant identity provider (IDP, e.g., Okta or KeyCloak)
- SMM - Streams Messaging Manager
An operations monitoring/ management tool that provides end-to-end visibility in an enterprise Apache Kafka environment.
- SRM - Streams Replication Manager
An enterprise-grade replication solution that enables fault tolerant, scalable and robust cross-cluster Kafka topic replication.
- CFM - Cloudera Flow Management
A no-code data ingestion and management solution powered by Apache NiFi.
- CSP - Cloudera Stream Processing
It provides advanced messaging, stream processing and analytics capabilities powered by Apache Kafka as the core stream processing engine.
One of the first brick to be available in CDP is the Cloudera Data Warehouse. It creates self-service data warehouses for team and business analysts to make it easy to provision a new data warehouse and share specific subsets of the data with specific teams and departments in such a manner that it is easy to throw away the warehouse if it is not neeeded anymore. Data Warehouses can be provisioned with the choice of Impala or Hive as engines.
The goal of of CDP Data Warehouses are to:
- Spend more time on creating value than fire fighting
- Prioritization between the respect of SLAs and Time to Value
Cloud helps but comes with its own challenges:
- Ensure consistent governance
- Easy transition on-premise workload to the Cloud
- Apps and users need to work on a consistent infrastructure or be forced to re-test and re-tool
- Lost of tracability, how to be sure we know where the data is, how it got there and what is what
- Resource contention (or under used) & noisy neighbors
- Business operations continuity while dealing with fluctuating & seasonal workloads with queries running only at the end of the week/month/year which can become resource pain points and cause other workload to fail to meet SLAs
- Move “noisy” workload into the cloud, autoscale for peak times and auto suspends for lulls
- Specific ad-hoc report can scale independently without affecting neighbors and can be suspended for cost control
- Adjust to changes of business objectives over time, infrastructure being in line with business reality
Advantages of the cloud:
- Remove resources constraints: infer and discover resources
- Speed up use-cases on-boarding: shared metadata across every platform
- Safe and secure with SDX common security and governance
- Answer demand spikes: warm pool of resources with proactive allocation
- Avoid data locked in cluster with seamless and continuous data movements
- Cost under control by suspending unused resources automatically
- Answer resource demand and isolate users & growth in real time
The main motivations for going into the cloud is to get teams to focus on the right projects, to control infrastructure costs and to gain access to elastic compute and storage.
However, it creates new challenges. Capacity planning in a public cloud datacenter must be carefully assessed. Architecting and configuring a database to get high performance, concurrency, resource isolation, etc. in multi-tenancy environments is hard and implies the management of infrastructure costs. It is necessary to migrate data, metadata and security policies and to preserve data lineage.
If not done carefully, cloud migration can be risky and expensive and it can take a long time.
2 main stages:
Stage 1: laying the foundation
- Define the blueprint of a DW: defining required software components
- Set up replication mechanism for data and metadata
- Use blueprint and replication mechanism to provision the DW and replicate data securely
Stage 2: deploy apps and operationalize them
- Manage incremental data replication
- Maintain the Data Warehouse: software stack management, hardware capacity management, scale the system
- Onboard new apps, users, monitoring: responsible to ensure that the end user get the expected SLAs
The Data Hub provides near complete control of the data warehouse setup and migration experience. The DBA takes ownership of the cluster definition, the capacity planing, the data migration, the scaling, the configuration and the fine-grain tuning.
It is based on blueprint definitions similar to the ones present in Ambari and Cloudbreak in which are defined the resources and the components before deploying the cluster on virtual machines such as Amazon EC2.
The user starts by configuring an environment inside the Cloudera console, he can select an existing one from a list of common templates or create his own by duplicating an existing blueprint and customize its content. Advanced customization of the blueprint are supported such as choosing a different Cloud image or modifying the number of instances per role.
Once deployed, the administration of the cluster is done through Cloudera Manager.
It offers less control than the Data Hub experience but it is a lot easier to use and control.
It relies on on-demand Kubernetes clusters and the native cloud provider storage. The key benefits is how on-premise workload can be isolated and scheduled for migration. It handles the capacity planning, the replication of data, metadata, the provision of resources while keeping the costs under control. It is adapted to burst, saturated workloads with degraded SLAs and those which impact the cluster stability.
The Workload Manager is an application performance manager for Impala and Hive (which are the two possible engines at the moment for Data Warehouse clusters) workloads. It allows DBA to have access to SQLs statements with common properties, to look at the performance of those queries, to diagnose potential issues and to fix them.
It can be used to connect to the on-premise datacenter, to classify the important SQL workloads and migrate those to the public Cloud through the Cloudburst functionality. It will automatically create a Data Replication Plan as well as a Capacity and Provisioning Plan. From there, the Data Warehouse service creates a data warehouse (Impala or Hive) using Docker containers managed by Kubernetes on the public Cloud and the Replication Manager Service copies the data, metadata and security policies to S3 (for AWS) and RDS. Multiple settings are available. The DBA can turn on and off the auto scaling policies. It is possible to define a timeout to auto suspends the resources if the cluster did not receive/handle queries for longer than a defined period (default 300s). It is possible to setup the maximum cluster size to prevent over usage and cost over a certain amount on a per hour basis. The auto scaling policy can be specified. They come with multiple flavors: an economical class experience and an business class experience. For example,
HEADROOM set a desired free capacity while
WAIT TIME set the query percentile and the desired wait time. Finally, the Hive configuration properties can be fully customized.
The Azure CDP integration is planed for the end of 2019 while GCP integration is expected early 2020.
Azure Data Lake Store (ADLS gen 2) is supported by Cloudera to be used as a storage location for data. It provides consistency, file directory structure, and POSIX-compliant ACLs. ADLS is not supported as the default filesystem. You can use ADLS as secondary filesystem while HDFS remains the primary filesystem.
Cloudera supports Disaster Recovery with DistCP using Windows Azure Storage Blob (WASB) to keep a copy of the data you have in HDFS.
Virtual hard disks (VHDs) on Microsoft Azure Standard Storage do not provide the same throughput and performance guarantees as Premium Storage disks.
Standard Storage Disks can deliver a target throughput of up to 60MB/s regardless of size.
Premium Storage Disks can deliver a target throughput that is metered based on the disk size. 1 TB P30 disks are provisioned for 200MB/s throughput.
Note: Cloudera documentation mention “Virtual hard disks (VHDs) on Microsoft Azure Standard Storage do not provide the same throughput and performance guarantees as Premium Storage disks”. However, Premium Storage Disks are linked to the Azure documentation Premium Storage, which are said to be virtual hard disks: “An Azure managed disk is a virtual hard disk (VHD). […] The available types of disks are Ultra disk, Premium solid state drive (SSD), Standard SSD, and Standard hard disk drive (HDD).”
CDP’s Data Hub clusters provide a maximum of flexibility in creating and managing production data warehouse in the cloud. The Data Warehouse clusters make it easy to extend data incrementally to the cloud and operate the data warehouse in production. They both rely on Shared Data Experience (SDX) which is responsible for the security and governance capabilities.