Data Science

Data Science, and more generally Artificial Intelligence (AI), differs from traditional programming and analysis in its ability to extract knowledge from data and modify its behavior and learn without specific programming. While traditional software predefines the logic that governs their processes, Data Science's algorithms build and discover models and are able to continually improve them.

Data Science brings together a set of skills including Machine Learning, Natural Language Processing (NLP), speech, images and faces recognition (among other applications). In some applications, the algorithms go so far as to simulate human intelligence.

Convergence of business, data and statitics

Articles related to Data Science

Data versioning and reproducible ML with DVC and MLflow

Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Git, Databricks, Delta Lake, Machine Learning, MLflow, Storage

Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…

Experiment tracking with MLflow on Databricks Community Edition

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Deep Learning, Databricks, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn

Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…

Version your datasets with Data Version Control (DVC) and Git

Categories: Data Science, DevOps & SRE | Tags: DevOps, Git, Infrastructure, Operation, SCM

Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…

Grégor JOUET

By Grégor JOUET

Sep 3, 2020

Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python

During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…

MLflow tutorial: an open source Machine Learning (ML) platform

Categories: Data Engineering, Data Science, Learning | Tags: Deep Learning, AWS, Databricks, Deployment, Machine Learning, Azure, MLflow, MLOps, Python, Scikit-learn

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Categories: Data Science, Tech Radar | Tags: Deep Learning, Learning and tutorial, Ludwig Deep Learning Toolbox, Machine Learning, Python

Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…

Robert Walid SOARES

By Robert Walid SOARES

Mar 2, 2020

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Categories: Data Engineering, Data Science | Tags: Flink, Kafka, Spark, DevOps, Hadoop, HBase, Kubernetes, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…


By David WORMS

Nov 26, 2019

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod

Categories: Data Science | Tags: Deep Learning, GPU, Horovod, Keras, TensorFlow

The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…

Grégor JOUET

By Grégor JOUET

Nov 15, 2019

Innovation, project vs product culture in Data Science

Categories: Data Science, Data Governance | Tags: DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…


By David WORMS

Oct 8, 2019

Machine Learning model deployment

Categories: Big Data, Data Engineering, Data Science, DevOps & SRE | Tags: DevOps, Operation, Schema, AI, Cloud, Machine Learning, MLOps, On-premises

“Enterprise Machine Learning requires looking at the big picture … from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…



Sep 30, 2019

TensorFlow installation on Docker

Categories: Containers Orchestration, Data Science, Learning | Tags: CPU, Deep Learning, Docker, Jupyter, Linux, AI, TensorFlow

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…



Aug 5, 2019

Spark Streaming part 4: clustering with Spark MLlib

Categories: Data Engineering, Data Science, Learning | Tags: Spark, Apache Spark Streaming, Big Data, Scala, Streaming, Clustering, Machine Learning

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…



Jul 11, 2019

Introduction to Cloudera Data Science Workbench

Categories: Data Science | Tags: Cloudera, Docker, Git, Kubernetes, Machine Learning, Azure, Notebook

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…



Feb 28, 2019

Applying Deep Reinforcement Learning to Poker

Categories: Data Science | Tags: Algorithm, Deep Learning, Gaming, Q-learning, Machine Learning, Neural Network, Python

We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…



Jan 9, 2019

CodaLab – Data Science competitions

Categories: Data Science, Adaltas Summit 2018, Learning | Tags: Database, Infrastructure, MySQL, Node.js, Machine Learning, Python

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…

Robert Walid SOARES

By Robert Walid SOARES

Dec 17, 2018

Nvidia and AI on the edge

Categories: Data Science | Tags: Caffe, Deep Learning, Edge computing, GPU, NVIDIA, AI, Keras, PyTorch, TensorFlow

In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…

Yliess HATI

By Yliess HATI

Oct 10, 2018

Lando: Deep Learning used to summarize conversations

Categories: Data Science, Learning | Tags: Deep Learning, Micro Services, Node.js, Open API, Kubernetes, Neural Network

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

Yliess HATI

By Yliess HATI

Sep 18, 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Categories: Data Science | Tags: Spark, YARN, Deep Learning, GPU, Hadoop, Spark MLlib, PyTorch, TensorFlow, XGBoost, MXNet

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…



Jul 24, 2018

YARN and GPU Distribution for Machine Learning

Categories: Data Science, DataWorks Summit 2018 | Tags: YARN, GPU, Machine Learning, Neural Network, Storage

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

Grégor JOUET

By Grégor JOUET

May 30, 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

Categories: Data Science, DataWorks Summit 2018 | Tags: Mesos, Spark, YARN, C++, CPU, GPU, JavaScript, Tuning, Keras, Kubernetes, Machine Learning, Python, TensorFlow

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

Yliess HATI

By Yliess HATI

May 29, 2018

Apache Apex with Apache SAMOA

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Samoa, Storm, Tools, Hadoop, Machine Learning

Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…



Jul 17, 2016

Apache Apex: next gen Big Data analytics

Categories: Data Science, Events, Tech Radar | Tags: Apex, Flink, Kafka, Storm, Tools, Hadoop, Data Science, Machine Learning

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…



Jul 17, 2016

Definitions of machine learning algorithms present in Apache Mahout

Categories: Data Science | Tags: Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…


By David WORMS

Mar 8, 2013

Hadoop and R with RHadoop

Categories: Business Intelligence, Data Science | Tags: HDFS, MapReduce, Thrift, Data Analytics, Learning and tutorial, R, Hadoop, HBase

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…


By David WORMS

Jul 19, 2012

Installing and using MADlib with PostgreSQL on OSX

Categories: Data Science | Tags: Database, Greenplum, Statistics, PostgreSQL, SQL

We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterprise…


By David WORMS

Jul 7, 2012

Canada - Morocco - France

International locations

10 rue de la Kasbah
2393 Rabbat

We are a team of Open Source enthusiasts doing consulting in Big Data, Cloud, DevOps, Data Engineering, Data Science…

We provide our customers with accurate insights on how to leverage technologies to convert their use cases to projects in production, how to reduce their costs and increase the time to market.

If you enjoy reading our publications and have an interest in what we do, contact us and we will be thrilled to cooperate with you.