Apache Apex with Apache SAMOA
Jul 17, 2016
Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Samoa, Storm, Tools, Hadoop, Machine Learning
Traditional Machine Learning
- Batch Oriented
- Supervised - most common
- Training and Scoring
- One time model building
- Data set
- Training: Model building
- Holdout: Paremeter tuning
- Test: Accuracy
Online Machine Learning
- Streaming
- Change
- Dynmaically adapt to new patterns in Data
- Change over time (concept drift)
- Model updates
- Approximation Algorithms
- Single pass: one data item at a time
- sublinear space and time per data item
- small error with high probabilities
Apache SAMOA
- What we need
- Platform for streaming learning Algorithms
- Distributed Scalable
Machine learning classification
Machine learning
/ \
Disribué Non distibué
/ \ / \
Batch streaming Batch streaming
| | | |
Hadoop Apex/flink/storm | |
| | | |
Mahout SAMOA R, WEKA MOA
Logical Building Blocks
Each block is a processor (an algorithm), then we create a topology of blocks.
Prequential Eval Tasks in SAMOA
- Interleaved test-then-train
- Evaluates performance for online classifiers
- Basic - Overall
- Sliding Windows based - Most recent
Apex DSPE
Distributed Stream Processing Engine:
- Apex
- Storm
- Flink
Apex Application DAG
- DAG is composed of vertices (Operators) and Edges (Streams)
- Stream is a sequence of data tuples which connects operators
Distribution of tuples
- Calculate Hash of tuple
- Modulo by the number of partition
Iteration support in Apex
- Machine-learning needs iterations
- At the very least, feedback loop
- Apex topology - Acyclic: DAG
Delay Operators
- Increment window id for all outgoing ports
- A note on Fault tolerance
Challenges
Adding DSPE for Apache Apex:
- Differences in the topology builder APIs of SAMOA and Apex
- No concept of Ports in SAMOA
- On demand declaration of streams in SAMOA
- Cycles in topology: Delay Operator
- Serialization of Processor state during checkpointing.
Also serialization of tuples:
- Number of tuples in a single window - Affects number of tuples in future windows coming from the delay operator