Data Science

Articles associés à la data science

Deploy your containerized AI applications with nvidia-docker

Catégories : Containers Orchestration, Data Science | Tags : containerd, DevOps, Learning and tutorial, NVIDIA, Docker, Keras, TensorFlow

More and more products and services are taking advantage of the modeling and prediction capabilities of AI. This article presents the nvidia-docker tool for integrating AI (Artificial Intelligence…

Par Robert Walid SOARES

24 mars 2022

Spring 2022 internship - building a Data Lab

Catégories : Data Science, Learning | Tags : MongoDB, Spark, Argo CD, Elasticsearch, Internship, Keycloak, Kubernetes, OpenID Connect, PostgreSQL

Job Description Over the last few years, we developed the ability to use computers to process large amounts of data. The ecosystem evolved over a large offering of tools and libraries and the creation…

Par David WORMS

24 nov. 2021

H2O in practice: a protocol combining AutoML with traditional modeling approaches

Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python, XGBoost

H20 comes with a lot of functionalities. The second part of the series H2O in practice proposes a protocol to combine AutoML modeling with traditional modeling and optimization approach. The objective…

Par Petra KAFERLE DEVISSCHERE

12 nov. 2021

H2O in practice: a Data Scientist feedback

Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Automated machine learning (AutoML) platforms are gaining popularity and becoming a new important tool in the data scientists’ toolbox. A few months ago, I introduced H2O, an open-source platform for…

Par Petra KAFERLE DEVISSCHERE

29 sept. 2021

Apache Liminal: when MLOps meets GitOps

Catégories : Big Data, Containers Orchestration, Data Engineering, Data Science, Tech Radar | Tags : Data Engineering, CI/CD, Data Science, Deep Learning, Deployment, Docker, GitOps, Kubernetes, Machine Learning, MLOps, Open source, Python, TensorFlow

Apache Liminal is an open-source software which proposes a solution to deploy end-to-end Machine Learning pipelines. Indeed it permits to centralize all the steps needed to construct Machine Learning…

Par Aargan COINTEPAS

31 mars 2021

Storage size and generation time in popular file formats

Catégories : Data Engineering, Data Science | Tags : Avro, HDFS, Hive, ORC, Parquet, Big Data, Data Lake, File Format, JavaScript Object Notation (JSON)

Choosing an appropriate file format is essential, whether your data transits on the wire or is stored at rest. Each file format comes with its own advantages and disadvantages. We covered them in a…

Par Barthelemy NGOM

22 mars 2021

TensorFlow Extended (TFX): the components and their functionalities

Catégories : Big Data, Data Engineering, Data Science, Learning | Tags : Beam, Data Engineering, Pipeline, CI/CD, Data Science, Deep Learning, Deployment, Machine Learning, MLOps, Open source, Python, TensorFlow

Putting Machine Learning (ML) and Deep Learning (DL) models in production certainly is a difficult task. It has been recognized as more failure-prone and time consuming than the modeling itself, yet…

Par Petra KAFERLE DEVISSCHERE

5 mars 2021

Faster model development with H2O AutoML and Flow

Catégories : Data Science, Learning | Tags : Automation, Cloud, H2O, Machine Learning, MLOps, On-premises, Open source, Python

Building Machine Learning (ML) models is a time-consuming process. It requires expertise in statistics, ML algorithms, and programming. On top of that, it also requires the ability to translate a…

Par Petra KAFERLE DEVISSCHERE

10 déc. 2020

Data versioning and reproducible ML with DVC and MLflow

Catégories : Data Science, DevOps & SRE, Events | Tags : Data Engineering, Databricks, Delta Lake, Git, Machine Learning, MLflow, Storage

Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…

Par Petra KAFERLE DEVISSCHERE

30 sept. 2020

Experiment tracking with MLflow on Databricks Community Edition

Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Databricks, Deep Learning, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn

Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…

Par Petra KAFERLE DEVISSCHERE

10 sept. 2020

Version your datasets with Data Version Control (DVC) and Git

Catégories : Data Science, DevOps & SRE | Tags : DevOps, Infrastructure, Operation, Git, GitOps, SCM

Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…

Par Grégor JOUET

3 sept. 2020

Importing data to Databricks: external tables and Delta Lake

Catégories : Data Engineering, Data Science, Learning | Tags : Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python

During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…

Par Petra KAFERLE DEVISSCHERE

21 mai 2020

MLflow tutorial: an open source Machine Learning (ML) platform

Catégories : Data Engineering, Data Science, Learning | Tags : AWS, Azure, Databricks, Deep Learning, Deployment, Machine Learning, MLflow, MLOps, Python, Scikit-learn

Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…

Par Petra KAFERLE DEVISSCHERE

23 mars 2020

Introduction to Ludwig and how to deploy a Deep Learning model via Flask

Catégories : Data Science, Tech Radar | Tags : Learning and tutorial, Deep Learning, Ludwig Deep Learning Toolbox, Machine Learning, Python

Over the past decade, Machine Learning and deep learning models have proven to be very effective in performing a wide variety of tasks such as fraud detection, product recommendation, autonomous…

Par Robert Walid SOARES

2 mars 2020

Internship Data Science & Data Engineer - ML in production and streaming data ingestion

Catégories : Data Engineering, Data Science | Tags : DevOps, Flink, Hadoop, HBase, Kafka, Spark, Internship, Kubernetes, Python

Context The exponential evolution of data has turned the industry upside down by redefining data storage, processing and data ingestion pipelines. Mastering these methods considerably facilitates…

Par David WORMS

26 nov. 2019

Avoid Bottlenecks in distributed Deep Learning pipelines with Horovod

Catégories : Data Science | Tags : GPU, Deep Learning, Horovod, Keras, TensorFlow

The Deep Learning training process can be greatly speed up using a cluster of GPUs. When dealing with huge amounts of data, distributed computing quickly becomes a challenge. A common obstacle which…

Par Grégor JOUET

15 nov. 2019

Innovation, project vs product culture in Data Science

Catégories : Data Science, Data Governance | Tags : DevOps, Agile, Scrum

Data Science carries the jobs of tomorrow. It is closely linked to the understanding of the business usecases, the behaviors and the insights that will be extracted from existing data. The stakes are…

Par David WORMS

8 oct. 2019

Machine Learning model deployment

Catégories : Big Data, Data Engineering, Data Science, DevOps & SRE | Tags : DevOps, Operation, AI, Cloud, Machine Learning, MLOps, On-premises, Schema

“Enterprise Machine Learning requires looking at the big picture […] from a data engineering and a data platform perspective,” lectured Justin Norman during the talk on the deployment of Machine…

Par Oskar RYNKIEWICZ

30 sept. 2019

TensorFlow installation on Docker

Catégories : Containers Orchestration, Data Science, Learning | Tags : CPU, Linux, AI, Deep Learning, Docker, Jupyter, TensorFlow

TensorFlow is an Open Source software from Google for numerical computation using a graph representation: Vertex (nodes) represent mathematical operations Edges represent N-dimensional data array…

Par Pierre SAUVAGE

5 août 2019

Spark Streaming part 4: clustering with Spark MLlib

Catégories : Data Engineering, Data Science, Learning | Tags : Spark, Apache Spark Streaming, Big Data, Clustering, Machine Learning, Scala, Streaming

Spark MLlib is an Apache’s Spark library offering scalable implementations of various supervised and unsupervised Machine Learning algorithms. Thus, Spark framework can serve as a platform for…

Par Oskar RYNKIEWICZ

27 juin 2019

Introduction to Cloudera Data Science Workbench

Catégories : Data Science | Tags : Azure, Cloudera, Docker, Git, Kubernetes, Machine Learning, MLOps, Notebook

Cloudera Data Science Workbench is a platform that allows Data Scientists to create, manage, run and schedule data science workflows from their browser. Thus it enables them to focus on their main…

Par Mehdi ELALAMI

28 févr. 2019

Applying Deep Reinforcement Learning to Poker

Catégories : Data Science | Tags : Algorithm, Gaming, Q-learning, Deep Learning, Machine Learning, Neural Network, Python

We will cover the subject of Deep Reinforcement Learning, more specifically the Deep Q Learning algorithm introduced by DeepMind, and then we’ll apply a version of this algorithm to the game of Poker…

Par Oscar BLAZEJEWSKI

9 janv. 2019

CodaLab – Data Science competitions

Catégories : Data Science, Adaltas Summit 2018, Learning | Tags : Database, Infrastructure, Machine Learning, MySQL, Node.js, Python

CodaLab Competition is a platform for code execution in the field of Data Science. It is a web interface on which a user can submit code or results and compare themselves to others. Let’s see how it…

Par Robert Walid SOARES

17 déc. 2018

Nvidia and AI on the edge

Catégories : Data Science | Tags : Caffe, GPU, NVIDIA, AI, Deep Learning, Edge computing, Keras, PyTorch, TensorFlow

In the last four years, corporations have been investing a lot in AI and particularly in Deep Learning and Edge Computing. While the theory has taken huge steps forward and new algorithms are invented…

Par Yliess HATI

10 oct. 2018

Lando: Deep Learning used to summarize conversations

Catégories : Data Science, Learning | Tags : Micro Services, Open API, Deep Learning, Internship, Kubernetes, Neural Network, Node.js

Lando is an application to summarize conversations using Speech To Text to translate the written record of a meeting into text and Deep Learning technics to summarize contents. It allows users to…

Par Yliess HATI

18 sept. 2018

Deep learning on YARN: running Tensorflow and friends on Hadoop cluster

Catégories : Data Science | Tags : GPU, Hadoop, MXNet, Spark, Spark MLlib, YARN, Deep Learning, PyTorch, TensorFlow, XGBoost

With the arrival of Hadoop 3, YARN offer more flexibility in resource management. It is now possible to perform Deep Learning analysis on GPUs with specific development environments, leveraging…

Par Louis BIANCHERIN

24 juil. 2018

YARN and GPU Distribution for Machine Learning

Catégories : Data Science, DataWorks Summit 2018 | Tags : GPU, YARN, Machine Learning, Neural Network, Storage

This article goes over the fundamental principles of Machine Learning and what tools are currently used to run machine learning algorithms. We will then see how a resource manager such as YARN can be…

Par Grégor JOUET

30 mai 2018

TensorFlow on Spark 2.3: The Best of Both Worlds

Catégories : Data Science, DataWorks Summit 2018 | Tags : Mesos, C++, CPU, GPU, Tuning, Spark, YARN, JavaScript, Keras, Kubernetes, Machine Learning, Python, TensorFlow

The integration of TensorFlow With Spark has a lot of potential and creates new opportunities. This article is based on a conference seen at the DataWorks Summit 2018 in Berlin. It was about the new…

Par Yliess HATI

29 mai 2018

Apache Apex: next gen Big Data analytics

Catégories : Data Science, Events, Tech Radar | Tags : Apex, Storm, Tools, Flink, Hadoop, Kafka, Data Science, Machine Learning

Below is a compilation of my notes taken during the presentation of Apache Apex by Thomas Weise from DataTorrent, the company behind Apex. Introduction Apache Apex is an in-memory distributed parallel…

Par César BEREZOWSKI

17 juil. 2016

Apache Apex with Apache SAMOA

Catégories : Data Science, Events, Tech Radar | Tags : Apex, Samoa, Storm, Tools, Flink, Hadoop, Machine Learning

Traditional Machine Learning Batch Oriented Supervised - most common Training and Scoring One time model building Data set Training: Model building Holdout: Paremeter tuning Test: Accuracy Online…

Par Pierre SAUVAGE

17 juil. 2016

Definitions of machine learning algorithms present in Apache Mahout

Catégories : Data Science | Tags : Algorithm, Сlassification, Hadoop, Mahout, Clustering, Machine Learning

Apache Mahout is a machine learning library built for scalability. Its core algorithms for clustering, classfication and batch based collaborative filtering are implemented on top of Apache Hadoop…

Par David WORMS

8 mars 2013

Hadoop and R with RHadoop

Catégories : Business Intelligence, Data Science | Tags : Thrift, Learning and tutorial, R, Hadoop, HBase, HDFS, MapReduce, Data Analytics

RHadoop is a bridge between R, a language and environment to statistically explore data sets, and Hadoop, a framework that allows for the distributed processing of large data sets across clusters of…

Par David WORMS

19 juil. 2012

Installing and using MADlib with PostgreSQL on OSX

Catégories : Data Science | Tags : Database, Greenplum, Statistics, PostgreSQL, SQL

We cover basic installation and usage of PostgreSQL and MADlib on OSX and Ubuntu. Instructions for other environments should be similar. PostgreSQL is an Open Source database with enterprise…

Par David WORMS

7 juil. 2012

Data Science

Points clés essentiels

Articles associés à la data science