Get in control of your workflows with Apache Airflow

Get in control of your workflows with Apache Airflow

Presentation by Christian Trebing from BlueYonder

Introduction

Use case : how to handle data coming in regularly from customers ?

  • Option 1 : use CRON
    • only time triggers
    • hard error handling
    • inconvenient when overlapping
  • Option 2 : Writing a workflow processing tool
    • start is easy
    • soon you reach limits: invest much more than envisionned of work with it
  • Option 3 : Use an OpenSource worklow processing tool
    • multiple options
    • they chose Apache Airflow @ BlueYonder

Apache Airflow

Apache Airflow is a workflow scheduler like Apache Oozie or Azkaban

  • Written in python
  • Workflows are defined in python
  • Interface with a view of present & past runs and also logging
  • Extensible with plugins
  • Active development and community
  • Provides a nice ui and REST interface
  • Relatively lightweight (2 processes on a server & a database)

Development

An Airflow job is composed of multiple operators, one operator being one step of the job, and sensors to read inputs. In a Python workflow, you build your DAG yourself operator by operator.

Many operators are available in Airflow :

  • BashOperator
  • SimpleHttpOperator

and sensors :

  • HttpSensor
  • HdfsSensor

or you can develop your own operator/sensor in Python. Also, Airflow supports branching of the workflow through custom operators.

State handling

  • Variable or relative to the airflow instance
  • External communications are relative to the DAG run / task
  • Both states are persisted in two database

Deployment

Two processes and a database :

  • scheduler
  • webserver
  • database (PostgreSQL, SQLite, …)

Notes

  • Airflow doesn’t handle user impersonation, you have to do it yourself
  • High Availability isn’t handled natively by Airflow
  • The presented use case had no need to connect to services with Kerberos & High Availability

Conclusion

Airflow seems to be a very nice alternative to Oozie and it’s XML workflows. We would have loved for it to be in JavaScript with NodeJS instead of Python !

By | 2017-07-24T21:37:13+00:00 July 17th, 2016|Categories: Events|0 Comments

About the Author:

Big Data consultant @ Adaltas since 2015, Cesar enjoys discovering stuff and experimenting with new technologies in addition to his day to day work

Leave A Comment

Time limit is exhausted. Please reload the CAPTCHA.