Apache Spark

Apache Spark is a unified in-memory analytics platform for Big Data processing, data streaming, SQL, Machine Learning and graph processing.

The open source project, classified by the Apache Foundation as a top-level project since 2014, originated from UC Berkeley in the AMP Lab. It has since become an major actor of the Big Data ecosystem as an alternative and an evolution of MapReduce.

Due to its distributed architecture in a cluster, Apache Spark execute in a cluster to process large amounts of data with high performance and in parallel. Apache Spark processes the data in memory and is optimize to limit the usage of disks.

Many users use Spark DataFrames, which have been integrated in Scala, Python and Java since Spark version 2. Spark DataFrames, comparable to R DataFrames or Pandas DataFrames, enable data to be queried in a table structure. Its integration with Machine Learning enables analytical models to be applied to Big Data with Apache Spark. This is why the system is often referred to as the Swiss Army Knife of data processing.

Spark executes on various platforms including in standalone hosts and clusters, in Hadoop clusters with YARN and in the Databricks platform.

Related articles

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat
Canada

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.