Apache Apex : next gen Big Data analytics

Apache Apex : next gen Big Data analytics

Presentation by Thomas Weise from DataTorrent (developpers of Apex)

Introduction

Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind :

  • Yarn is used for resource managing and ordonnancing
  • HDFS is used to store persistant states

Application development model

 

  • A stream is a sequence of tuples
  • An operator :
    • takes one or more input streams as input
    • performs custom computation on the tuples (logic is in Java)
    • emits one or more input streams
    • has many parallel instances, each single threaded
    • uses the DAG model to optimize computation
  • An application is a suite of operators

 Development process

A typical WordCount setup with Apex looks like this :

  • Apache Kafka brings the data
  • The Apex application processes through the following operators :
    • Kafka input
    • Parser
    • Filter
    • Word counter
    • JDBC output
  • End data is written in DB

The development process goes like this :

  • Take an operator from existing libraries or implement a custom logic
  • Connect the operators to form an application
  • Configure the operators properties
  • Configure scaling & platform attributes
  • Test functionalities, performance and iterate

Operator libraries

Apex provides very complete operator libraries through Apache Apex Malhar :

  • Messaging (Kafka, ActiveMQ, …)
  • NoSQL (HBase, Cassandra, MongoDB, Redis, CouchDB, …)
  • RDBMS (JDBC, MySQL, …)
  • FileSystem (HDFS / Hive, …)

 

Apache Apex Malhar operators

Notes

  • Apex uses Apache BEAM for the job implementation so he enjoys its multiple benfits :
    • dynamic partition at runtime
    • load-balancing between operators
    • windowing
  • For fault tolerance, the operators states are checkpointed and persisted
  • Apex processing guarentees for
    • at least once
    • at most once
    • exactly once

 

By | 2017-07-24T21:37:13+00:00 July 17th, 2016|Categories: Events|0 Comments

About the Author:

Big Data consultant @ Adaltas since 2015, Cesar enjoys discovering stuff and experimenting with new technologies in addition to his day to day work

Leave A Comment

Time limit is exhausted. Please reload the CAPTCHA.