Execute Python in an Oozie workflow

Execute Python in an Oozie workflow

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

Oozie workflows allow you to use multiple actions to execute code, however doing so with Python can be a bit tricky, let’s see how to do that.

I’ve recently designed a workflow that would interact with ElasticSearch. The workflow is made of the followings sequential actions:

  • Create an index.
  • Inject a data set.
  • Set an alias on success.
  • Delete the index on failure.

There are multiple ways to interact with ElasticSearch: Java binary transport or REST API. The majority of languages offer wrapping libraries. To get the job done in Oozie, we defined multiple requirements:

  • Code must be portable, meaning include all its dependencies, because the cluster is offline, meaning not connected to the Internet.
  • Code must be easy to understand and written in a widely used language to avoid technical debt.
  • Prioritize a dynamic language comfortable with JSON and REST manipulation.
  • The application must accept multiple CLI entry points, at least one for each Oozie actions.

The original idea was using Bash. However, parsing ElasticSearch’s JSON responses would have been a pain. So we chose Python.

ElasticSearch & Python

A bit off-topic but good to know: Python is really well equipped to deal with ElasticSearch.

The library maintains support for ElasticSearch from version 2.x to 6.x (luckily, we’re on 2.x !) and is very easy to understand and use.

Here’s a sample opening a secure connection and creating an index:

from elasticsearch import Elasticsearch 

client = Elasticearch(["https://user:pwd@elastic.host:port"]) 
response = client.indices.create("my_index") 

if "acknowledged" in response and response["acknowledged"] is True: 
  print("my_index created !") 
else: 
  print("Uh oh, there was an error...") 
  print(response)

Package Python code

Once the code is ready, we need to package it with all the dependencies. The workflow must be independent of any Internet access. Only the Python binary must be present, which is the case natively on our targeted Operating System, CentOS 7.

Python offers a lot of possibilities for packaging (Wheel, Egg (deprecated in favor of Wheel), Zip…), and associated resources and HOWTOs. However, chosing the right packaging strategy for a newcomer is challenging. Fortunately, Python natively supports packaging a code directory into a zip for further execution. The generated archive behave a bit like a .py file.

Secondly, we need to download the dependencies locally and include them in the package.

Let’s say we have a project structured as following:

my_python_project/
├── EsUtil.py
├── create_index.py
├── set_alias.py
└── rollback.py

Here’s how we package it:

cd my_python_project

# Locally install the dependencies
pip install -t ./ [dependency list]

# Compress everything
zip --recurse-paths --quiet -9 ../my_python_dist.zip ./*

And finally we’d execute it like this:

PYTHONPATH=/path/to/my_python_dist.zip python -m [filename without extension] [args]
PYTHONPATH=/path/to/my_python_dist.zip python -m create_index [args]

Oozie workflow

Now that we have a valid Python package with our scripts, we must integrate it with our Oozie workflow.

There’s no such thing as a Python action in Oozie. We’ll use the closest and most flexible one, the Shell action.

As for any other action, Oozie prepares a container, injects the files you specify, and executes a command.

Here’s what the action would look like:

<action name="python-action">
    <shell xmlns="uri:oozie:shell-action:0.2">
        <job-tracker>${clusterJobtracker}</job-tracker>
        <name-node>${clusterNamenode}</name-node>
        <configuration>
            <property>
                <name>mapred.job.queue.name</name>
                <value>${jobQueue}</value>
            </property>
        </configuration>
        <exec>python</exec> <!-- python2 if necessary -->
        <argument>-m</argument>
        <argument>create_index</argument>
        <argument>arg2</argument>
        <argument>arg3</argument>
        <env-var>PYTHONPATH=pyBundle</env-var>
        <file>my_python_dist.zip#pyBundle</file>
    </shell>
    <ok to="end"/>
    <error to="end"/>
</action>
  • The configuration specifies a user YARN queue to run the Oozie container.
  • env-var sets the environment variable on the python bundle.
  • file injects the python bundle in the Oozie Shell action container.

Of course, we need to have Python installed on the YARN nodes (usually it’s shipped with the Linux distro underneath, but it’s a best practice to install one of your choice, using someting like Anaconda).

Some neat feature from Oozie on the Shell action is the tag. If it’s set, Oozie will capture any line in the output that is formatted as property=value and allow to re-use it in the workflow to inject in another action with the following syntax: ${wf:actionData('python-action')['property']}.

References

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain