Apache Apex with Apache SAMOA

Traditional Machine Learning

– Batch Oriented
– Supervised – most common
– Training and Scoring
– One time model building
– Data set
– Training: Model building
– Holdout: Paremeter tuning
– Test: Accuracy

Online Machine Learning

– Streaming
– Change
– Dynmaically adapt to new patterns in Data
– Change over time (concept drift)
– Model updates
– Approximation Algorithms
– Single pass: one data item at a time
– sublinear space and time per data item
– small error with high probabilities

Apache SAMOA

– What we need
– Platform for streaming learning Algorithms
– Distributed Scalable

Machine learning classification

Logical Building Blocks

Each blocks is a processor (an algorithm)
Then whe create a topology of blocks

Prequential Eval Tasks in SAMOA

– Interleaved test-then-train
– Evaluates performance for online classifiers
– Basic – Overall
– Sliding Windows based – Most recent

Apex DSPE

Distributed Stream Processing Engine:
– Apex
– Storm
– Flink

Apex Application DAG

DAG is composed of vertices (Operators) and Edges (Streams)
– Stream is a sequence of data tuples which connects operators

Distribution of tuples

– Calculate Hash of tuple
– Modulo by the number of partition

Iteration support in Apex

– Machine learning needs iterations
– At the very least,  feedback loop
– Apex topology – Acyclic: DAG

Delay Operators

– Increment window id for all outgoing ports
– A note on Fault tolerance

Challenges

Adding DSPE for Apache Apex:
– Differences in the topology builder APIs of SAMOA and Apex
– No concept of Ports in SAMOA
– On demand declaration of streams in SAMOA
– Cycles in topology: Delay Operator
– Serialization of Processor state during checkpointing.
Also serialization of tuples
– Number of tuples in a single window – Affects number of tuples in future windows coming from the delay operator

By |2017-07-24T19:57:56+00:00July 17th, 2016|Categories: Events|0 Comments

About the Author:

Passionate about computer science since his childhood, and practicing programming in leisure since adolescence, Pierre joined an engineering school specializing in Information System, Big Data option. He began his career in the IoT research laboratory, where he was able to study distributed systems, both theoretically and practically. Pierre then joined Adaltas. Today he is a Big Data & Hadoop Solution Architect and Data Engineer with over 4 years of hands-on experience in Hadoop and 5 years of experience with distributed systems. He has been designing, developing and maintaining, data processing workflows and real-time services as well as bringing to clients a unified and consistent vision on data management and workflows across their different data sources and business requirements. He steps in at all levels of the Data platforms, from planning, design and architecture to clusters deployment, administration, maintenance as well as prototyping and applications development in collaboration with business users, analysts, data scientists, engineering and operational teams. He also has a good experience as educator for knowledge transfer and training.(He regularly gives courses and training around Big Data for various engineering and master schools) facilitating the transfer of knowledge and training of teams.

Leave A Comment