PySpark

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming engine shares the same API as with the Spark SQL engine and is as easy to use. Spark Structured Streaming [...]

By |2019-07-11T22:14:25+00:00April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |9 Comments

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols and are such examples. Spark Thrift Server may be used in various fashions. It can run independently as Spark standalone [...]

By |2019-03-25T14:50:18+00:00March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |1 Comment