Traditional Machine Learning

– Batch Oriented
– Supervised – most common
– Training and Scoring
– One time model building
– Data set
– Training: Model building
– Holdout: Paremeter tuning
– Test: Accuracy

Online Machine Learning

– Streaming
– Change
– Dynmaically adapt to new patterns in Data
– Change over time (concept drift)
– Model updates
– Approximation Algorithms
– Single pass: one data item at a time
– sublinear space and time per data item
– small error with high probabilities

Apache SAMOA

– What we need
– Platform for streaming learning Algorithms
– Distributed Scalable

Machine learning classification

Logical Building Blocks

Each blocks is a processor (an algorithm)
Then whe create a topology of blocks

Prequential Eval Tasks in SAMOA

– Interleaved test-then-train
– Evaluates performance for online classifiers
– Basic – Overall
– Sliding Windows based – Most recent


Distributed Stream Processing Engine:
– Apex
– Storm
– Flink

Apex Application DAG

DAG is composed of vertices (Operators) and Edges (Streams)
– Stream is a sequence of data tuples which connects operators

Distribution of tuples

– Calculate Hash of tuple
– Modulo by the number of partition

Iteration support in Apex

– Machine learning needs iterations
– At the very least,  feedback loop
– Apex topology – Acyclic: DAG

Delay Operators

– Increment window id for all outgoing ports
– A note on Fault tolerance


Adding DSPE for Apache Apex:
– Differences in the topology builder APIs of SAMOA and Apex
– No concept of Ports in SAMOA
– On demand declaration of streams in SAMOA
– Cycles in topology: Delay Operator
– Serialization of Processor state during checkpointing.
Also serialization of tuples
– Number of tuples in a single window – Affects number of tuples in future windows coming from the delay operator