Present and future of Hadoop workflow scheduling: Oozie 5.x
During the DataWorks Summit Europe 2018 in Berlin, I had the opportunity to attend a breakout session on Apache Oozie. It covers the new features released in Oozie 5.0, including future features of Oozie 5.X, which is the main subject of this article. They spent some time discussing the Apache Ambari’s Workflow Scheduler and its way to design and visualize Apache Oozie workflows.
The talk was given by Artem Ervits, solutions engineer at Hortonworks, and Clay Baenziger, member of the Hadoop Infrastructure team at Bloomberg. Their presentation is available here.
Apache Oozie is the most used workflow scheduler in the Apache Hadoop ecosystem. It allows users to execute a serie of actions as a Directed Acyclical Graph. Oozie features some built-in native actions for the most common components of Hadoop such as Hive, Sqoop, Distcp, etc. There is also a shell action allowing users to do even more stuff, César did a great article showing an example of what can be achieved with it here.
It truly is a powerful tool, but it also comes with a few weaknesses. One of the main complaints that arises when we talk about Oozie is its outdated UI. Hopefully those issues will be addressed in future releases, starting today with Oozie 5.0 which was announced only 24 hours before Artem and Clay’s talk. Let’s open the hood and see what’s in store for us.
Finally! If you are familiar with Oozie you will understand what it means to see this feature. Oozie launchers are no longer MapReduce map tasks but fully functional YARN AMs. I wondered what it looked like so I downloaded and compiled the latest version of Oozie to install it on a test cluster.
Here is a simple shell action job running with Oozie 5.0:
It might look like a simple improvement but it has been a really long time in the making (see OOZIE-1770). The Oozie committers had to reimplement a lot of things for this to be possible, one of the (totally acceptable) drawbacks is that Oozie 5.X does not support Hadoop 1.X anymore: minimum version is now 2.6.0. As part of these changes, the old map launcher was completely removed in OOZIE-2918.
Every Oozie user knows that its web UI is a pain to use. The fact that it relies on ExtJS also complicates things as it is an out of date JS framework. Unfortunately, nothing has quite emerged yet, but the discussion of Oozie committers on the subject can be followed here and hopefully something will come out in a future version of the 5.X branch.
The Workflow Manager is actually an Ambari View but it can also be installed on a non-Ambari managed cluster, which is something we’ve done at Adaltas for one of our clients. It provides a GUI to build and edit Oozie workflows as well as a manager to easily monitor and visualize them.
As of the 4.2 release of Oozie (the latest supported in HDP 2.6.4), there is no such things as fine grained authorizations.
Indeed it features a very basic authorization model as we can see from the documentation:
Users have read access to all jobs
Users have write access to their own jobs
Users have write access to jobs based on an Access Control List (list of users and groups)
Users have read access to admin operations
Admin users have write access to all jobs
Admin users have write access to admin operations
As long as you can access the WebUI, you can take a look at every workflow, which is really problematic in a multi-tenant environment. Thankfully, the community is working on the subject in OOZIE-3196. A patch is available but not yet released.
Another nice new feature is the embedded diagnostic bundle tool. Originally developed as a feature for Cloudera Manager, it facilitates the aggregation of useful information while debugging a job: workflow.xml, job.properties, Oozie logs, ENV, etc.
Here is an example of how it can be run and how the output looks like:
bin/oozie-diag-bundle-collector.sh -jobs 0000002-180425152941870-oozie-oozi-W -oozie $OOZIE_URL -output bundle.zip Checking Connection...Done Using Temporary Directory: /tmp/1525342998391-0 Getting Sharelib Information...Done Getting Configuration...Done Getting OS Environment Variables...Done Getting Java System Properties...Done Getting Queue Dump...Done Getting Thread Dump...Done Getting Instrumentation...Skipping (Instrumentation is unavailable) Getting Metrics...Done Getting Details for 0000002-180425152941870-oozie-oozi-W...Done Creating Zip File: /opt/oozie-5.1.0-SNAPSHOT/bundle.zip/oozie-diag-bundle-1525343435346.zip...Done
This can be quite useful to easily gather all the facts when a job is not running as intended. Unfortunately, it is not yet capable of getting the underlying YARN containers’ logs for the different actions: it is something that could be implemented in future releases.
And here’s when you ask:
Awesome! But I use HDP 2.X.X/Cloudera 5.X.X, when will I be able to use these features?
And the answer is… Not yet. Hortonworks’ HDP 3.0 is set for general release sometime during summer 2018 but unfortunately (according to Artem Ervits), Oozie 5.0 missed the train and will not appear in the stack. We might get it in HDP 3.1. At the time of writing, the latest version supported by HortonWorks is Oozie 4.2.0, with some patches being back-ported (full list available here).
That’s it for the features of Oozie 5.X, I hope this overview makes you want to dig in and compile the latest version of this amazing tool. I really encourage you to do so at it is not as complicated as it seems to build and deploy to your dev Hadoop environment.
- Slides from Artem Ervits and Clay Baenziger’s talk at DataWorks Summit Europe 2018
- Article on Oozie 5.0’s release
- Oozie’s GitHub
- Oozie JIRA 1770
- Oozie JIRA 2918
- Oozie JIRA 2683
- Oozie JIRA 3196
- Oozie 5.0 release logs
- Article on Hue’s new Oozie editor
- Hortonworks’ list of patch back-ported into Oozie 4.2.0
- Cloudera’s release notes on Oozie 4.1.0