Definitions of machine learning algorithms present in Apache Mahout


By David WORMS

Mar 8, 2013

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop using the map/reduce paradigm.

It contains various algorithms which we are defining below. Each of them may define multiple implementations. A majority but not all of the implementations are distributed.


Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.


Clustering is the task of grouping a set of objects in such a way that objects in the same group (called cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).

Pattern mining

Pattern mining is a data mining method that involves finding existing patterns in data. In this context patterns often mean association rules.

Regression analysis

Regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.

Dimension reduction

Dimension reduction is the process of reducing the number of random variables under consideration and can be divided into feature selection and feature extraction.

Evolutionary algorithm

Evolutionary algorithm uses mechanisms inspired by biological evolution, such as reproduction, mutation, recombination, and selection. Candidate solutions to the optimization problem play the role of individuals in a population, and the fitness function determines the environment within which the solutions “live”

Recommenders / Collaborative filtering

Collaborative filtering is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc.

Vector Similarity

Vector Similarity allows one to compare one or more vectors with another set of vectors.


Collocation defines a sequence of words or terms that co-occur more often than would be expected by chance.

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.