Get in control of your workflows with Apache Airflow

Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder.

Introduction

Use case: how to handle data coming in regularly from customers?

Option 1: use CRON
- only time triggers
- hard error handling
- inconvenient when overlapping
Option 2: Writing a workflow processing tool
- start is easy
- soon you reach limits: invest much more than envisionned of work with it
Option 3: Use an OpenSource worklow processing tool
- multiple options
- they chose Apache Airflow @ BlueYonder

Apache Airflow

Apache Airflow is a workflow scheduler like Apache Oozie or Azkaban

Written in [Python])(https://www.python.org/)
Workflows are defined in Python
Interface with a view of present & past runs and also logging
Extensible with plugins
Active development and community
Provides a nice ui and REST interface
Relatively lightweight (2 processes on a server & a database)

Development

An Airflow job is composed of multiple operators, one operator being one step of the job, and sensors to read inputs. In a Python workflow, you build your DAG yourself operator by operator.

Many operators are available in Airflow:

BashOperator
SimpleHttpOperator
…

and sensors:

HttpSensor
HdfsSensor
…

or you can develop your own operator/sensor in Python. Also, Airflow supports branching of the workflow through custom operators.

State handling

Variable or relative to the airflow instance
External communications are relative to the DAG run / task
Both states are persisted in two database

Deployment

Two processes and a database:

scheduler
webserver
database PostgreSQL, SQLite, …

Notes

Airflow doesn’t handle user impersonation, you have to do it yourself
High Availability isn’t handled natively by Airflow
The presented use case had no need to connect to services with Kerberos & High Availability

Conclusion

Airflow seems to be a very nice alternative to Oozie and it’s XML workflows. We would have loved for it to be in JavaScript with NodeJS instead of Python!

Share this article