Machine Learning

Spark Streaming part 4: clustering with Spark MLlib

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed with Spark MLlib can be combined with a low-latency streaming pipeline created with Spark Structured Streaming. The K-means clustering algorithm [...]

By |2019-07-12T08:07:03+00:00July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |1 Comment

Introduction to Cloudera Data Science Workbench

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main task that is deriving insights from data, without thinking about the complexity that lies in the background. CDSW was released after Cloudera’s acquisition of [...]

YARN and GPU Distribution for Machine Learning

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be useful in this context and how it can help the algorithms to run smoothly. This article stems from a conference at [...]

By |2019-08-16T21:26:37+00:00May 30th, 2018|Categories: Data Science, DataWorks Summit 2018|Tags: , , |2 Comments

Apache Apex with Apache SAMOA

Traditional Machine Learning - Batch Oriented - Supervised - most common - Training and Scoring - One time model building - Data set - Training: Model building - Holdout: Paremeter tuning - Test: Accuracy Online Machine Learning - Streaming - Change - Dynmaically adapt to new patterns in Data - Change over time (concept drift) [...]

By |2019-06-18T22:53:49+00:00July 17th, 2016|Categories: Data Science, Events|Tags: , , |0 Comments