Using Kubernetes to monitor a Hadoop cluster efficiently

Introduction

When it comes to a Hadoop cluster in a production environment, monitoring is a challenge and should always be evolving. In this talk we will discuss how Kubernetes can help us build a light and scalable monitoring system with the smallest overhead possible.

Speaker: Paul-Adrien Cordonnier
Duration: 1h00
Format: talk

Presentation

Our role at EDF is to configure, deploy and monitor multi-tenants Hadoop cluster.

Even with tools like Ambari, ensuring that all the components of a cluster are working correctly is a real challenge. We are often confronted with the following:

A component is not working (Configuration error, hardware or network issue, bug...)
One user is having latency or error and opens an issue
An investigation is launched in order to find the cause
Once fixed, we want to schedule tests in order to be warned next time

By iterations, error detection become more relevant which leads to better stability and happier customers.

Our current architecture uses Shinken, a monitoring app very similar to Nagios. It is very useful and simple when it comes the simplest errors: a service is detected down, a command is sent to restart it within seconds. For more complex use cases, like a complete HBase connection, Shinken is overwhelmed. The needs of having metrics of response time, SLAs and dashboards is also becoming is also a big thing that Shinken does not provide.

The goal of this talk is to present a modern, easy to scale architecture using Kubernetes. The objective of the architecture is to be able to write a check in any form you want (Bash, scripting language like Node or Python, Small Java App), place it in a container, and schedule it in Kubernetes. Results of the checks are automatically sent to a central logging system (like Elasticsearch) and a metrics collector (Prometheus) which are then used to alerts about the behaviour of the cluster.

Author

Paul-Adrien Cordonnier, Big Data consultant at Adaltas. Currently working at EDF, deploy and monitor several on-prem Hadoop and Elasticsearch clusters.