This article is based on Peter Darvasi and Richard Doktorics’ talk Running Enterprise Workloads in the Cloud at the DataWorks Summit 2018 in Berlin.
It presents Hortonworks’ automated deployment tool for cloud environments, Cloudbreak, describes and comments features that Peter and Richard explained in their talk, and give some personal guidelines on when and why to use it or not.
What is Cloudbreak
Here’s how Hortonworks defines Cloudbreak :
Cloudbreak simplifies the deployment of Hortonworks platforms in cloud environments [… and] enables the enterprise to quickly run Big Data workloads in the cloud while optimizing the use of cloud resources.
In other words, Cloudbreak is used to provision your cluster infrastructure through cloud providers, install Apache Ambari, and submit a cluster deployment request.
Bear in mind that Cloudbreak only handles the infrastructure and Ambari‘s installation. The actual cluster deployment is still done by Ambari.
This is both an advantage and a drawback. Hortonworks will ensure that Cloudbreak is fine tuned for Ambari and HDP/HDF, and will have more flexibility by having full control on the project’s repository. On the other hand, it’s going to be a lot more difficult to plug custom-made features for your business, like you might be doing with Apache Ranger for example.
As Cloudbreak is made for launching cluster deployments in a cloud environment, it needs a way to create and interact with the infrastructure. This is done through the underlying cloud provider(s) on which the HDP/HDF clusters have to be pushed.
These are the providers currently supported by Cloudbreak:
How to use it
The first thing you need is an up and running Cloudbreak Deployer instance. From it, you will be able to launch your clusters on the various cloud providers.
To set it up you have two choices : pre-built instance or custom VM.
You can run a pre-built image on the cloud provider of your choice. It does not need to be the same as the one(s) you want to launch your clusters on. This means you may run the Cloudbreak Deployer on an AWS instance, and launch your clusters on Azure and Google Cloud.
Or you can build Cloudbreak Deployer on your own custom VM. This is usefull for entreprise production environments which have system and software requirements that are not met by pre-built images.
Bear in mind that the system requirements described by Hortonworks for the Cloudbreak Deployer VM are the following:
- Minimum of 16GB, 4 cpu, and 40GB of disk.
- RHEL, CentOS, and Oracle Linux 7 (may vary for pre-built images).
- Docker 1.9.1.
Launch a cluster part 1 – Ambari Blueprints
The cluster definition is directly passed on by Cloudbreak to Ambari, and Ambari needs a Blueprint to deploy a cluster. You can either use the default cluster configurations available in Cloudbreak, or use a custom blueprint you built beforehand. More on how to build your custom blueprint here.
Beware ! As you may already know, Ambari Blueprint‘s JSON only handles part of a cluster’s definition. The other part, which is defined at the cluster’s actual launch by Ambari, has to be set separately in Cloudbreak. Some properties are automatically set by Cloudbreak, such as the host mapping to their respective hostgroups but others, like the Blueprint‘s name, external authentication sources (LDAP for example) or databases, have to be defined manually in the right fields in Cloudbreak. Find the complete walkthrough in the official documentation.
Launch a cluster part 2 – Infrastructure
The second information is about the Clouds on which you want to deploy.
This part is fairly straightforward as all you need is to follow Cloudbreak‘s CLI or Wizard and fill the fields with adequate information.
- General information about the cluster (name, cloud provider, region, HDP version),
- Each hostgroup’s instance type, number of hosts, and on which to install Ambari Server,
- Network and security information (set by cloud provider, not for the cluster components themselves),
- Ambari’s administrator credentials.
Talk highlighted features
Peter and Richard talked about several features, some already well known and production ready, others fairly new and in technical preview.
Auto-scaling and Alerts
The primary use case Cloudbreak is aiming to handle is dynamic cluster allocation and sizing based on workload. To answer this, a not-so-new feature for Cloudbreak is automatic scaling of a cluster’s infrastructure.
To execute a scaling, you first have to define an event that will trigger it, which is done through alerting. These can be metric-based (eg. YARN usage is at +80% for at least 2 hours) or time-based (eg. check every 12 hours if YARN usage is at +80%).
To induce a scaling of the infrastructure, alerts need to be mapped to a scaling policy. A policy defines which hostgroups have to be rescaled, what adjustments have to be made, and which alert triggers it.
This is actually a really handy feature that is required by most of the infrastructure teams we had to work with. Some even have their own custom alterting process that, given some development, may use Cloudbreak‘s “alert to scale” method.
In this section, Peter and Richard presented two separate features on images used by Cloudbreak.
The first one is to be able to use your own custom images instead of the default ones for each cloud provider (AMI for AWS, Centos 7.X for the others). It’s usually a requirement for an enterprise-ready tool, as most companies either have a contract with a different system provider such as RedHat or their own custom made images for internal usage.
The second one is to use pre-warmed images (images that have Ambari and/or HDP pre-installed on them). The main benefit of pre-warmed images is a faster deployment time and no need to access repositories that might be on another network. However, there’s an important drawback on flexibility as they are stuck with a fixed version of Ambari and HDP.
Unless you need the extra quickness of deployment for a short-living cluster, I’d recommend sticking to standard cold images. The little work spared by not implementing a way to access remote repositories (with proxies for example, see next section) is not worth the operational cost to rebuild a new pre-warmed image each time either Ambari or HDP gets an upgrade.
Mutualized proxy configurations
Here’s a common cloud-business oriented use case: you have a Cloudbreak instance that manages several dynamically scaled clusters, split on multiple cloud providers and into dozens to hundreds of nodes of various type and usage.
Concerned companies usually care for security and will ask for network isolation so that their infrastructure can’t be accessed from everywhere. But there are always some exceptions like software dependencies or public repositories that are required by internal components.
This is where proxies come in and, for an infrastructure like the one we described, the real chore of setting it up.
Cloudbreak has thus implemented a rather minor but handy feature that allows you to define proxy informations in one place, and apply them to the Cloudbreak Deployer and part or all of your clusters. Cloudbreak will then apply the proxy on various places such as Ambari and yum configurations, or HTTP_PROXY and HTTPS_PROXY environment variables.
Only a part of the features presented by Peter and Richard in their talk were discussed in this article. I chose these because they felt either unique to Cloudbreak, fixed in some way a task I encountered in a client’s environment, or just because they were interesting to share and discuss. There’s a lot more such as updates of Cloudbreak‘s UI and CLI, security features integration like Kerberos and LDAP/AD authentication, the support of HDF, etc.
All in all, Cloudbreak is an interesting project, that definitely will have its use in the future based on how much cloud and container based solutions are growing, but is in my opinion still a little lacking to be fully enterprise and production-ready. Also, the entry cost for test purposes is too high. The only real world usage that I’d recommend using Cloudbreak right now would be for companies that would highly benefit from the unique place to manage several short/medium-lived clusters shared across various cloud providers.