Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex.


Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind :

  • Yarn is used for resource managing and ordonnancing
  • HDFS is used to store persistant states

Application development model


  • A stream is a sequence of tuples
  • An operator :
    • takes one or more input streams as input
    • performs custom computation on the tuples (logic is in Java)
    • emits one or more input streams
    • has many parallel instances, each single threaded
    • uses the DAG model to optimize computation
  • An application is a suite of operators

 Development process

A typical WordCount setup with Apex looks like this :

  • Apache Kafka brings the data
  • The Apex application processes through the following operators :
    • Kafka input
    • Parser
    • Filter
    • Word counter
    • JDBC output
  • End data is written in DB

The development process goes like this :

  • Take an operator from existing libraries or implement a custom logic
  • Connect the operators to form an application
  • Configure the operators properties
  • Configure scaling & platform attributes
  • Test functionalities, performance and iterate

Operator libraries

Apex provides very complete operator libraries through Apache Apex Malhar :

  • Messaging (Kafka, ActiveMQ, …)
  • NoSQL (HBase, Cassandra, MongoDB, Redis, CouchDB, …)
  • RDBMS (JDBC, MySQL, …)
  • FileSystem (HDFS / Hive, …)


Apache Apex Malhar operators


  • Apex uses Apache BEAM for the job implementation so he enjoys its multiple benfits :
    • dynamic partition at runtime
    • load-balancing between operators
    • windowing
  • For fault tolerance, the operators states are checkpointed and persisted
  • Apex processing guarentees for
    • at least once
    • at most once
    • exactly once