Execute Python in an Oozie workflow

Execute Python in an Oozie workflow

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that.

I’ve recently designed a workflow that would interact with ElasticSearch. The workflow is made of the followings sequential actions:

  • Create an index.
  • Inject a data set.
  • Set an alias on success.
  • Delete the index on failure.

There are multiple ways to interact with ElasticSearch: Java binary transport or REST API. The majority of languages offer wrapping libraries. To get the job done in Oozie, we defined multiple requirements :

  • Code must be portable, meaning include all its dependencies, because the cluster is offline, meaning not connected to the Internet.
  • Code must be easy to understand and written in a widely used language to avoid technical debt.
  • Prioritize a dynamic language comfortable with JSON and REST manipulation.
  • The application must accept multiple CLI entry points, at least one for each Oozie actions.

The original idea was using Bash. However, parsing ElasticSearch’s JSON responses would have been a pain. So we chose Python.

ElasticSearch & Python

A bit off-topic but good to know: Python is really well equipped to deal with ElasticSearch.

The library maintains support for ElasticSearch from version 2.x to 6.x (luckily, we’re on 2.x !) and is very easy to understand and use.

Here’s a sample opening a secure connection and creating an index :

Package Python code

Once the code is ready, we need to package it with all the dependencies. The workflow must be independent of any Internet access. Only the Python binary must be present, which is the case natively on our targeted Operating System, CentOS 7.

Python offers a lot of possibilities for packaging (Wheel, Egg (deprecated in favor of Wheel), Zip…), and associated resources and HOWTOs. However, chosing the right packaging strategy for a newcomer is challenging. Fortunately, Python natively supports packaging a code directory into a zip for further execution. The generated archive behave a bit like a .py  file.

Secondly, we need to download the dependencies locally and include them in the package.

Let’s say we have a project structured as following:

Here’s how we package it:

And finally we’d execute it like this:

Oozie workflow

Now that we have a valid Python package with our scripts, we must integrate it with our Oozie workflow.

There’s no such thing as a Python action in Oozie. We’ll use the closest and most flexible one, the Shell action.

As for any other action, Oozie prepares a container, injects the files you specify, and executes a command.

Here’s what the action would look like:

  • The configuration  specifies a user YARN queue to run the Oozie container.
  • env-var  sets the environment variable on the python bundle.
  • file injects the python bundle in the Oozie Shell action container.

Of course, we need to have Python installed on the YARN nodes (usually it’s shipped with the Linux distro underneath, but it’s a best practice to install one of your choice, using someting like Anaconda).

Some neat feature from Oozie on the Shell action is the  <capture-output/>  tag. If it’s set, Oozie will capture any line in the output that is formatted as  property=value  and allow to re-use it in the workflow to inject in another action with the following syntax: ${wf:actionData('python-action')['property']} .


By | 2018-03-07T13:41:45+00:00 March 6th, 2018|Categories: Data Engineering|Tags: , , |0 Comments

About the Author:

Big Data consultant @ Adaltas since 2015, Cesar enjoys discovering stuff and experimenting with new technologies in addition to his day to day work

Leave A Comment

Time limit is exhausted. Please reload the CAPTCHA.