Big Data

Spark Streaming part 4: clustering with Spark MLlib

Spark MLlib is an Apache's Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for developing Machine Learning systems. An ML model developed with Spark MLlib can be combined with a low-latency streaming pipeline created with Spark Structured Streaming. The K-means clustering algorithm [...]

By |2019-07-12T08:07:03+00:00July 11th, 2019|Categories: Big Data, Data Engineering, ML|Tags: , , , , |1 Comment

Spark Streaming part 2: run Spark Structured Streaming pipelines in Hadoop

Spark can process streaming data on a multi-node Hadoop cluster relying on HDFS for the storage and YARN for the scheduling of jobs. Thus, Spark Structured Streaming integrates well with Big Data infrastructures. A streaming data processing chain in a distributed environment will be presented. Cluster environment demands attention to aspects such as monitoring, stability, [...]

By |2019-07-11T22:14:21+00:00May 28th, 2019|Categories: Big Data, Data Engineering|Tags: , , , |2 Comments

Spark Streaming part 1: build data pipelines with Spark Structured Streaming

Spark Structured Streaming is a new engine introduced with Apache Spark 2 used for processing streaming data. It is built on top of the existing Spark SQL engine and the Spark DataFrame. The Structured Streaming engine shares the same API as with the Spark SQL engine and is as easy to use. Spark Structured Streaming [...]

By |2019-07-11T22:14:25+00:00April 18th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |9 Comments

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols and are such examples. Spark Thrift Server may be used in various fashions. It can run independently as Spark standalone [...]

By |2019-03-25T14:50:18+00:00March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |1 Comment

Introduction to Cloudera Data Science Workbench

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main task that is deriving insights from data, without thinking about the complexity that lies in the background. CDSW was released after Cloudera’s acquisition of [...]

Apache Knox made easy!

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? […]

Hadoop cluster takeover with Apache Ambari

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this operation was required and how we did it. […]

By |2018-11-20T13:54:41+00:00November 15th, 2018|Categories: Adaltas Summit 2018, Big Data|Tags: , , , |0 Comments

Managing User Identities on Big Data Clusters

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to understand how these different services fit together and whether they should be shared across multiple clusters. Also, which strategy to choose and what are [...]

By |2018-11-08T11:15:29+00:00November 8th, 2018|Categories: Big Data, Cyber Security|Tags: , , , , , |0 Comments

One week to discuss technology in a Moroccan riad

Adaltas organise the year its first conference between the 22 and 26 of October. On the agenda of these 5 days of conference: discuss technology in one of the most beautiful riad of Marrakech. Mix the useful with the pleasant, learn and share the feet in the swimming pool. The rule is simple, each participant [...]

By |2019-07-17T13:57:28+00:00October 11th, 2018|Categories: Adaltas Summit 2018|Tags: , , , , , , , |0 Comments

TensorFlow on Spark 2.3: The Best of Both Worlds

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. […]