With the current growth of BigData technologies, more and more companies are building their own clusters in hope to get some value of their data.
One main concern while building these infrastructures is the capacity to continuously monitor the cluster’s health and report issues as fast as possible.

This is where supervision comes in.

There are almost as many supervision policies as there are different companies. Most of them have their own supervision tools, and Big Data clusters have to be adapted to them.
This article covers the integration of an HDP 2.4.2 cluster into one of our customers’ supervision process.

Ambari-Alerts : HDP’s supervision solution

On an HDP solution, there’s a lot of stuff that can have an impact on the global cluster’s health, from the multiple components’ processes on the platform, to the communication through the network, including nodes’ CPU, RAM, and FS usage.

Ambari already monitors most of these and exposes their statuses with its Ambari-Alerts’ REST API. Alerts can be customized to fit the company’s needs, and custom alerts can be added to handle components that may not be supervised by ambari’s default alerts.

To view all available alerts definitions on your cluster, use :

This will print about 70 default alerts already enabled and reporting various health statuses on Ambari’s web-UI. Some parameters, like the check interval or criticity thresholds, can be directly changed on the Alert tab of the web-UI. To visualize and/or change more alert-specific parameters, use :

This will return something like :

Supervision by our customer

While Ambari’s alerts already enable full cluster supervision through its web-UI, it doesn’t fit our customer’s policies.
A single “pilotage” team has to monitor all of the company’s environments, and fill an issue when an alert appears. The issue is attributed to the right exploitation team who then works on resolving the issue.

The monitoring solution used here is HP’s Operation-Manager. All environments have to provide their health checks in a way HP-OM can access.

In our case, we decided to provide a log file on which we append regularly all of ambari’s alerts information.
We used a custom python script to request each enabled ambari alert status with its REST API, and print it on one line in the log file.

HP-OM reads each new line appended to the log file, searches for the keywords **CRITICAL** or **UNKNOWN**, and sends the line to the pilotage team if one of these terms appears.
The pilotage member that receives the alert creates an issue and puts the log line in the description of it.
Finally, the exploitation team affected to the environment on which the alert appeared treats the issue with the help of the log line in the description.


Additional information

Ambari-Alerts isn’t the only way to get information on your cluster’s health. In fact, it is built in a way that enables a lot of customization.
You can write your own scripts to get the information you want and integrate it as an alert in ambari. This keeps your supervision process in one place.
But you can also use other sources of information.

One example is Ambari-Metrics. When it is enabled, each HDP service and host metric is monitored by Ambari-Metrics, and the result can be seen on a Grafana web-UI or queried through its REST API. Global cluster metrics are also available.

To get a list of the services’ metrics monitored by Ambari-Metrics, use :

Those metrics can then be used as sources to create more custom alerts.
It was not implemented for customer’s use case because the default alerts were enough to handle the needs. However, the solution was still studied and may be implemented in the future if the supervision requirements of the company evolve.