The integration of TensorFlow With Spark has a lot of potential and creates new opportunities.

This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new features of the 2.3 release of Apache Spark, an open source framework for Big Data computation on clusters. And one of the aspects the talker insisted on was the integration of Tensorflow in Spark and the benefits of combining them.

Since I am mostly interested in Deep Learning, I was wondering why should I use this software combo to train and use my models. And it turns out we can use both power for a multitude of usecases. So first, let’s try to use a deep learning model for computer vision using TensorFlow in Apache Spark 2.3.


TensorFlow is Google’s open source software library dedicated to high performance numerical computation. Taking advantages of GPUs, TPUs and CPUs power and usable on servers, clusters and even mobiles, it is mainly used for Machine Learning and especially Deep Learning.

It provides support for C++, Python and recently for Javascript with TensorFlow.js. The library is now in its 1.8th version and comes with an official high level wraper called Keras.

Tensorflow uses Cuda and CudNN from Nvidia to allow communication with the GPUs. Contrary to CPUs, GPUs are designed to perform parallel tasks and Matrix Operations, which are heavily present in Machine Learning, Deep Learning and Data computation, can profit of parallel computing.

Since matrix multiplication is one of the most heavy computation for matrices let’s have a look at some benchmarks and then try to implement it with and without Tensorflow.

Matrix Multiplication

As we can see, past a certain dimension, matrix multiplication computation time explodes with CPUs whereas for GPU it stays very low.

Without Tensorflow

With Tensorflow

Apache Spark

Apache Spark is an efficient scalable open source framework dedicated to Big Data. It is good at running heavy computations on clusters and distributing them. It is simple of usage and fast. Simpler than Hadoop, Spark allows you to develop in Java, Scala, Python and R. It is 100 times faster than Hadoop MapReduce.

Spark is also general. It can be used as a SQL query engine and can handle streaming thanks to Spark Streaming. Moreover, it comes with a Machine Learning Library called MLLib.

The framework can be executed on Hadoop 2 clusters with YARN, Mesos, Kubernetes or in standalone and allows access to mutliple data sources such as HDFS, Cassandra, HBase and Amazon S3.

The new release of Spark: Apache Spark 2.3, comes with multiple features.

Retrieve Images from a file system

Here’s a sample code to retrieve images from filesystem with PySpark

TensorFlow x Apache Spark 2.3: TensorFrame

With the 2.3 release, Tensorflow is supported by Apache Spark. Thus we can combine the power of DataFrames, Transformers and Estimators to TensorFlow and Keras.

As you can see it is pretty straightforward. You just have to turn your model into a transformer or an estimator.

Use cases

Hyper parameter selection can be time consuming. Moreover, one small change can have huge consequences. As it’s nearly impossible to test out every combination, it’s at least possible to perform some fine tuning by giving each node of a Spark cluster a different set , run them synchronously, then cross validate to pick the best one.

Tensorflow in Spark 2.3 can also be usefull for model deployment and scalability. Thanks to Spark, we can broadcast a pretrained model to each node and distribute the predictions over all the nodes. Regarding scaling, Spark allows new nodes to be added  to the cluster if needed.


To conclude, this new  Spark release offers the possibility to apply the benefits of distributed computation to the Deep Learning/Machine Learning field.