Get in control of your workflows with Apache Airflow
Jul 17, 2016
Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.
Below is a compilation of my notes taken during the presentation of Apache Airflow by Christian Trebing from BlueYonder.
Use case: how to handle data coming in regularly from customers?
- Option 1: use CRON
- only time triggers
- hard error handling
- inconvenient when overlapping
- Option 2: Writing a workflow processing tool
- start is easy
- soon you reach limits: invest much more than envisionned of work with it
- Option 3: Use an OpenSource worklow processing tool
- multiple options
- they chose Apache Airflow @ BlueYonder
Apache Airflow is a workflow scheduler like Apache Oozie or Azkaban
- Written in [Python])(https://www.python.org/)
- Workflows are defined in Python
- Interface with a view of present & past runs and also logging
- Extensible with plugins
- Active development and community
- Provides a nice ui and REST interface
- Relatively lightweight (2 processes on a server & a database)
An Airflow job is composed of multiple operators, one operator being one step of the job, and sensors to read inputs. In a Python workflow, you build your DAG yourself operator by operator.
Many operators are available in Airflow:
or you can develop your own operator/sensor in Python. Also, Airflow supports branching of the workflow through custom operators.
- Variable or relative to the airflow instance
- External communications are relative to the DAG run / task
- Both states are persisted in two database
Two processes and a database:
- database PostgreSQL, SQLite, …
- Airflow doesn’t handle user impersonation, you have to do it yourself
- High Availability isn’t handled natively by Airflow
- The presented use case had no need to connect to services with Kerberos & High Availability