Big Data

Publish Spark SQL DataFrame and RDD with Spark Thrift Server

The distributed and in-memory nature of the Spark engine makes it an excellent candidate to expose data to clients which expect low latencies. Dashboards, notebooks, BI studios, KPIs-based reports tools commonly speak the JDBC/ODBC protocols and are such examples. Spark Thrift Server may be used in various fashions. It can run independently as Spark standalone [...]

By |2019-03-25T14:50:18+00:00March 25th, 2019|Categories: Big Data, Data Engineering|Tags: , , , , |0 Comments

Multihoming on Hadoop

Multihoming, which means having multiple networks attached to one node, is one of the main components to manage the heterogeneous network usage of an Apache Hadoop cluster. This article is an introduction to the concept and its applications for real-world businesses. […]

By |2019-03-05T18:48:18+00:00March 5th, 2019|Categories: Adalas Summit 2018, Big Data, Data Engineering|Tags: , , |0 Comments

Introduction to Cloudera Data Science Workbench

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main task that is deriving insights from data, without thinking about the complexity that lies in the background. CDSW was released after Cloudera’s acquisition of [...]

Apache Knox made easy!

Apache Knox is the secure entry point of a Hadoop cluster, but can it also be the entry point for my REST applications? […]

CodaLab – Data Science competitions

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it works and how to install CodaLab On-Premise. […]

By |2018-12-17T16:45:38+00:00December 17th, 2018|Categories: Big Data, Data Science|Tags: , , , , |0 Comments

Main advantages of GraphQL as an alternative to REST

GraphQL is based on a simple idea, moving the assembly of a request from the server to the client. The client sees the overall strongly-typed schema instead of multiple REST endpoints and he builds the query he wants. My first REST based web application, SPAs for Single Page Applications as we are calling it lately, [...]

By |2018-11-27T09:56:07+00:00November 27th, 2018|Categories: Big Data, Data Science|Tags: , , , , , |0 Comments

Hadoop cluster takeover with Apache Ambari

We recently migrated a large production Hadoop cluster from a “manual” automated install to Apache Ambari, we called this the Ambari Takeover. This is a risky process and we will detail why this operation was required and how we did it. […]

By |2018-11-20T13:54:41+00:00November 15th, 2018|Categories: Adalas Summit 2018, Big Data|Tags: , , , |0 Comments

Managing User Identities on Big Data Clusters

Securing a Big Data Cluster involves integrating or deploying specific services to store users. Some users are cluster-specific when others are available across all clusters. It is not always easy to understand how these different services fit together and whether they should be shared across multiple clusters. Also, which strategy to choose and what are [...]

By |2018-11-08T11:15:29+00:00November 8th, 2018|Categories: Big Data, Cyber Security|Tags: , , , , , |0 Comments

Apache Flink: past, present and future

Apache Flink is a little gem which deserves a lot more attention. Let’s dive into Flink’s past, its current state and the future it is heading to by following the keynotes and presentations at Flink Forward 2018. […]

By |2018-11-15T11:47:31+00:00November 5th, 2018|Categories: Big Data, Data Engineering|Tags: , , , , , , |0 Comments