Spark

Spark Streaming part 4: clustering with Spark MLlib

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed with Spark MLlib can be combined with a low-latency streaming pipeline created with Spark Structured Streaming. The K-means clustering algorithm [...]

By |2019-07-12T08:07:03+00:00July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |1 Comment

Spark Streaming part 3: tools and tests for Spark applications

Whenever services are unavailable, businesses experience large financial losses. Spark Streaming applications can break, like any other software application. A streaming application operates on data from the real world, hence the uncertainty is intrinsic to the application's input. Testing is essential to discover as many software defects and as much flawed logic as possible before [...]

By |2019-07-11T22:14:43+00:00June 19th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |4 Comments

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming data processing chain in a distributed environment will be presented. Cluster environment demands attention to aspects such as monitoring, stability, [...]

By |2019-07-11T22:14:21+00:00May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |2 Comments

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming engine shares the same API as with the Spark SQL engine and is as easy to use. Spark Structured Streaming [...]

By |2019-07-11T22:14:25+00:00April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |10 Comments

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols and are such examples. Spark Thrift Server may be used in various fashions. It can run independently as Spark standalone [...]

By |2019-03-25T14:50:18+00:00March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |1 Comment

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging available resources. This article is a based on the presentation of Wandga Tan, Apache Hadoop PMC menber, at the DataWorks Summit 2018. It mostly focus on [...]

By |2018-07-24T19:43:12+00:00July 24th, 2018|Categories: Data Science, DataWorks Summit 2018|Tags: , , , |0 Comments

TensorFlow on Spark 2.3: The Best of Both Worlds

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. […]

EclairJS – Putting a Spark in Web Apps

Presentation by David Fallside from IBM, images extracted from the presentation. Introduction Web Apps development has moved from Java to NodeJS and Javascript. It provides a simple and rich environment with NPM. EclairJS is a NodeJS library that provides bindings to a Spark application : An RDD is bound to a JS object that is made [...]

By |2019-06-21T22:26:53+00:00July 17th, 2016|Categories: Data Engineering, Events|Tags: , , , |0 Comments