Spark MLlib

Apache Spark MLlib is a machine learning library which runs on top of Spark core. It supports distributed computing and it can scale vertically and horizontally. It offers APIs for Java, Scala, Python, R and SQL.

It provides tools such as:

ML Algorithms: common learning algorithms such as classification, regression, clustering, and collaborative filtering
Featurization: feature extraction and selection, transformation, dimensionality reduction
Pipelines: tools for constructing, evaluating, and tuning ML pipelines
Persistence: saving and loading of algorithms, models and pipelines
Utilities: linear algebra, statistics, data handling, etc.

Learn more: MLlib documentation
Related tags: Machine Learning

What's new in Apache Spark 2.3?

Categories: Data Engineering, DataWorks Summit 2018 | Tags: Arrow, PySpark, Tuning, ORC, Spark, Spark MLlib, Data Science, Docker, Kubernetes, pandas, Streaming

Let’s dive into the new features offered by the 2.3 distribution of Apache Spark. This article is a composition of the following talks seen at the DataWorks Summit 2018 and additional research: Apache…

By César BEREZOWSKI

May 23, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

By Louis BIANCHERIN

Jul 24, 2018

MLflow tutorial: an open source Machine Learning (ML) platform

Categories: Data Engineering, Data Science, Learning | Tags: AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

By Petra KAFERLE DEVISSCHERE

Mar 23, 2020

Spark MLlib

Related articles

What's new in Apache Spark 2.3?

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

MLflow tutorial: an open source Machine Learning (ML) platform