Adaltas Adaltas

Apache Apex with Apache SAMOA

Apache Apex with Apache SAMOA

Apache Apex with Apache SAMOA

Pierre SAUVAGE

By Pierre SAUVAGE

Jul 17, 2016

Categories: Data Science; Events; Tech Radar
Tags: Apex; Flink; Samoa; Storm; Tools; Hadoop; Machine Learning

Do you like our work......we hire!

Never miss our publications about Open Source, big data and distributed systems, low frequency of one email every two months.

Traditional Machine Learning

Batch Oriented
Supervised - most common
Training and Scoring
One time model building
Data set
Training: Model building
Holdout: Paremeter tuning
Test: Accuracy

Online Machine Learning

Streaming
Change
Dynmaically adapt to new patterns in Data
Change over time (concept drift)
Model updates
Approximation Algorithms
Single pass: one data item at a time
sublinear space and time per data item
small error with high probabilities

Apache SAMOA

What we need
Platform for streaming learning Algorithms
Distributed Scalable

Machine learning classification

               Machine learning
              /                \
     Disribué                   Non distibué
    /        \                   /        \
  Batch   streaming           Batch    streaming
    |         |                 |          |
  Hadoop Apex/flink/storm       |          |
    |         |                 |          |
  Mahout    SAMOA            R, WEKA      MOA

Logical Building Blocks

Each block is a processor (an algorithm), then we create a topology of blocks.

Prequential Eval Tasks in SAMOA

Interleaved test-then-train
Evaluates performance for online classifiers
Basic - Overall
Sliding Windows based - Most recent

Apex DSPE

Distributed Stream Processing Engine:

Apex
Storm
Flink

Apex Application DAG

DAG is composed of vertices (Operators) and Edges (Streams)
Stream is a sequence of data tuples which connects operators

Distribution of tuples

Calculate Hash of tuple
Modulo by the number of partition

Iteration support in Apex

Machine-learning needs iterations
At the very least, feedback loop
Apex topology - Acyclic: DAG

Delay Operators

Increment window id for all outgoing ports
A note on Fault tolerance

Challenges

Adding DSPE for Apache Apex:

Differences in the topology builder APIs of SAMOA and Apex
No concept of Ports in SAMOA
On demand declaration of streams in SAMOA
Cycles in topology: Delay Operator
Serialization of Processor state during checkpointing.

Also serialization of tuples:

Number of tuples in a single window - Affects number of tuples in future windows coming from the delay operator

Share this article

Support Ukrain