Get in control of your workflows with Apache Airflow

Get in control of your workflows with Apache Airflow

César BEREZOWSKI

By César BEREZOWSKI

Jul 17, 2016

Categories
Big Data
Tech Radar
Tags
DevOps
Airflow
Cloud
Python
[more]
Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder.

Introduction

Use case: how to handle data coming in regularly from customers?

  • Option 1: use CRON
    • only time triggers
    • hard error handling
    • inconvenient when overlapping
  • Option 2: Writing a workflow processing tool
    • start is easy
    • soon you reach limits: invest much more than envisionned of work with it
  • Option 3: Use an OpenSource worklow processing tool
    • multiple options
    • they chose Apache Airflow @ BlueYonder

Apache Airflow

Apache Airflow is a workflow scheduler like Apache Oozie or Azkaban

  • Written in [Python])(https://www.python.org/)
  • Workflows are defined in Python
  • Interface with a view of present & past runs and also logging
  • Extensible with plugins
  • Active development and community
  • Provides a nice ui and REST interface
  • Relatively lightweight (2 processes on a server & a database)

Development

An Airflow job is composed of multiple operators, one operator being one step of the job, and sensors to read inputs. In a Python workflow, you build your DAG yourself operator by operator.

Many operators are available in Airflow:

  • BashOperator
  • SimpleHttpOperator

and sensors:

  • HttpSensor
  • HdfsSensor

or you can develop your own operator/sensor in Python. Also, Airflow supports branching of the workflow through custom operators.

State handling

  • Variable or relative to the airflow instance
  • External communications are relative to the DAG run / task
  • Both states are persisted in two database

Deployment

Two processes and a database:

Notes

  • Airflow doesn’t handle user impersonation, you have to do it yourself
  • High Availability isn’t handled natively by Airflow
  • The presented use case had no need to connect to services with Kerberos & High Availability

Conclusion

Airflow seems to be a very nice alternative to Oozie and it’s XML workflows. We would have loved for it to be in JavaScript with NodeJS instead of Python!

Share this article

Canada - Morocco - France

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.

Support Ukrain