Data Science

Apache Apex : next gen Big Data analytics

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel stream processing engine, like Flink or Storm. However, it is built with native Hadoop integration in mind : Yarn is used for resource managing [...]

By |2019-06-21T19:57:57+00:00July 17th, 2016|Categories: Data Science, Events, Tech Radar|0 Comments

Apache Apex with Apache SAMOA

Traditional Machine Learning - Batch Oriented - Supervised - most common - Training and Scoring - One time model building - Data set - Training: Model building - Holdout: Paremeter tuning - Test: Accuracy Online Machine Learning - Streaming - Change - Dynmaically adapt to new patterns in Data - Change over time (concept drift) [...]

By |2019-06-18T22:53:49+00:00July 17th, 2016|Categories: Data Science, Events|Tags: , , |0 Comments

Hadoop and R with RHadoop

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of computers. RHadoop is built out of 3 components which are R packages: rmr, rhdfs and rhbase. Below, we will present each of those [...]

By |2019-02-19T17:11:27+00:00July 19th, 2012|Categories: Data Science|0 Comments

Installing and using MADlib with PostgreSQL on OSX

We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterprise functionalities which often lack in MySQL. MADlib is an Open Source library which enhances a PostgreSQL or Greenplum database with functionalities for scalable in-database analytics. [...]

By |2019-06-26T21:21:23+00:00July 7th, 2012|Categories: Data Science|Tags: , , , |0 Comments