Apache Apex : next gen Big Data analytics

Presentation by Thomas Weise from DataTorrent (developpers of Apex)

Introduction

Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind :

  • Yarn is used for resource managing and ordonnancing
  • HDFS is used to store persistant states

Application development model

 

  • A stream is a sequence of tuples
  • An operator :
    • takes one or more input streams as input
    • performs custom computation on the tuples (logic is in Java)
    • emits one or more input streams
    • has many parallel instances, each single threaded
    • uses the DAG model to optimize computation
  • An application is a suite of operators

 Development process

A typical WordCount setup with Apex looks like this :

  • Apache Kafka brings the data
  • The Apex application processes through the following operators :
    • Kafka input
    • Parser
    • Filter
    • Word counter
    • JDBC output
  • End data is written in DB

The development process goes like this :

  • Take an operator from existing libraries or implement a custom logic
  • Connect the operators to form an application
  • Configure the operators properties
  • Configure scaling & platform attributes
  • Test functionalities, performance and iterate

Operator libraries

Apex provides very complete operator libraries through Apache Apex Malhar :

  • Messaging (Kafka, ActiveMQ, …)
  • NoSQL (HBase, Cassandra, MongoDB, Redis, CouchDB, …)
  • RDBMS (JDBC, MySQL, …)
  • FileSystem (HDFS / Hive, …)

 

Apache Apex Malhar operators

Notes

  • Apex uses Apache BEAM for the job implementation so he enjoys its multiple benfits :
    • dynamic partition at runtime
    • load-balancing between operators
    • windowing
  • For fault tolerance, the operators states are checkpointed and persisted
  • Apex processing guarentees for
    • at least once
    • at most once
    • exactly once

 

By |2018-06-05T22:37:06+00:00July 17th, 2016|Categories: Events|0 Comments

About the Author:

César is a Big Data & Hadoop Solution Architect and Data Engineer with 2 years of hands-on experience in Hadoop and distributed systems. He has been designing, developing and maintaining data processing workflows and real-time services as well as bringing to clients a consistent vision on data management and workflows across their different data sources and business requirements. He steps in at all levels of the Data platforms, from planning, design and architecture to clusters deployment, administration, maintenance as well as prototyping and applications development in collaboration with business users, analysts, data scientists, engineering and operational teams. He enjoys discovering stuff and experimenting with new technologies in addition to his day to day work He also has a good experience as educator for knowledge transfer and training.

Leave A Comment