Apache Apex: next gen Big Data analytics
Jul 17, 2016
Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Kafka, Storm, Tools, Hadoop, Data Science, Machine Learning [more][less]
Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex.
Introduction
Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind:
- Yarn is used for resource managing and ordonnancing
- HDFS is used to store persistant states
Application development model
- A stream is a sequence of tuples
-
An operator:
- takes one or more input streams as input
- performs custom computation on the tuples (logic is in Java)
- emits one or more input streams
- has many parallel instances, each single threaded
- uses the DAG model to optimize computation
- An application is a suite of operators
Development process
A typical WordCount setup with Apex looks like this:
- Apache Kafka brings the data
-
The Apex application processes through the following operators:
- Kafka input
- Parser
- Filter
- Word counter
- JDBC output
- End data is written in DB
The development process goes like this:
- Take an operator from existing libraries or implement a custom logic
- Connect the operators to form an application
- Configure the operators properties
- Configure scaling & platform attributes
- Test functionalities, performance and iterate
Operator libraries
Apex provides very complete operator libraries through Apache Apex Malhar:
- Messaging (Kafka, ActiveMQ, …)
- NoSQL (HBase, Cassandra, MongoDB, Redis, CouchDB, …)
- RDBMS (JDBC, MySQL, …)
- FileSystem (HDFS / Hive, …)
Notes
-
Apex uses Apache BEAM for the job implementation so he enjoys its multiple benfits:
- dynamic partition at runtime
- load-balancing between operators
- windowing
- …
- For fault tolerance, the operators states are checkpointed and persisted
-
Apex processing guarentees for
- at least once
- at most once
- exactly once