Execute Python in an Oozie workflow

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that.

I’ve recently designed a workflow that would interact with ElasticSearch. The workflow is made of the followings sequential actions:

  • Create an index.
  • Inject a data set.
  • Set an alias on success.
  • Delete the index on failure.

There are multiple ways to interact with ElasticSearch: Java binary transport or REST API. The majority of languages offer wrapping libraries. To get the job done in Oozie, we defined multiple requirements :

  • Code must be portable, meaning include all its dependencies, because the cluster is offline, meaning not connected to the Internet.
  • Code must be easy to understand and written in a widely used language to avoid technical debt.
  • Prioritize a dynamic language comfortable with JSON and REST manipulation.
  • The application must accept multiple CLI entry points, at least one for each Oozie actions.

The original idea was using Bash. However, parsing ElasticSearch’s JSON responses would have been a pain. So we chose Python.

ElasticSearch & Python

A bit off-topic but good to know: Python is really well equipped to deal with ElasticSearch.

The library maintains support for ElasticSearch from version 2.x to 6.x (luckily, we’re on 2.x !) and is very easy to understand and use.

Here’s a sample opening a secure connection and creating an index :

Package Python code

Once the code is ready, we need to package it with all the dependencies. The workflow must be independent of any Internet access. Only the Python binary must be present, which is the case natively on our targeted Operating System, CentOS 7.

Python offers a lot of possibilities for packaging (Wheel, Egg (deprecated in favor of Wheel), Zip…), and associated resources and HOWTOs. However, chosing the right packaging strategy for a newcomer is challenging. Fortunately, Python natively supports packaging a code directory into a zip for further execution. The generated archive behave a bit like a .py  file.

Secondly, we need to download the dependencies locally and include them in the package.

Let’s say we have a project structured as following:

Here’s how we package it: And finally we’d execute it like this:

Oozie workflow

Now that we have a valid Python package with our scripts, we must integrate it with our Oozie workflow.

There’s no such thing as a Python action in Oozie. We’ll use the closest and most flexible one, the Shell action.

As for any other action, Oozie prepares a container, injects the files you specify, and executes a command.

Here’s what the action would look like:

  • The configuration  specifies a user YARN queue to run the Oozie container.
  • env-var  sets the environment variable on the python bundle.
  • file injects the python bundle in the Oozie Shell action container.

Of course, we need to have Python installed on the YARN nodes (usually it’s shipped with the Linux distro underneath, but it’s a best practice to install one of your choice, using someting like Anaconda).

Some neat feature from Oozie on the Shell action is the  <capture-output/>  tag. If it’s set, Oozie will capture any line in the output that is formatted as  property=value  and allow to re-use it in the workflow to inject in another action with the following syntax: ${wf:actionData('python-action')['property']} .

References

[/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]
By |2018-06-05T22:36:41+00:00March 6th, 2018|Categories: Data Engineering|Tags: , , |3 Comments

About the Author:

Big Data consultant @ Adaltas since 2015, I enjoy discovering stuff and experimenting with new technologies in addition to my day to day work

3 Comments

  1. Sylvain Boucault @ StudioEtrange March 21, 2018 at 1:16 am - Reply

    “Of course, we need to have Python installed on the YARN nodes (usually it’s shipped with the Linux distro underneath, but it’s a best practice to install one of your choice, using someting like Anaconda).”

    To go further with this point :

    You should try to do the full package stuff from a conda env, instead of only using pip to take care of dependencies. Because conda take care of python runtime

    The same process you described (zip stuff and so on) can be used with a full zip conda env, so you do not bother to have the right version of python on any nodes.

    • César Berezowski March 21, 2018 at 4:08 pm - Reply

      Thanks for the tip, I’ll check it out !
      However your comment also strengthens my point: the Python ecosystem is great, but there’s so many way to do things that you easily get lost as a beginner

  2. […] Apache Oozie is the most used workflow scheduler in the Apache Hadoop ecosystem. It allows users to execute a series of actions as a Directed Acyclical Graph. Oozie features some built-in native actions for the most common components of Hadoop such as Hive, Sqoop, Distcp, etc. There is also a shell action allowing users to do even more stuff, César did a great article showing an example of what can be achieved with it here. […]

Leave A Comment