Databricks
Databricks provides an Apache Spark based analytics platform on two of the largest cloud service providers: Microsoft Azure and Amazon AWS. Founded by Spark developers, Databricks focuses on monetizing the open source Big Data system Apache Spark by providing a unified and simple user experience. It is used for building large Data Lakes, to implement real-time streaming use cases or to replace large ETL processes.
Related articles
Data versioning and reproducible ML with DVC and MLflow
Categories: Data Science, DevOps & SRE, Events | Tags: Data Engineering, Git, Databricks, Delta Lake, Machine Learning, MLflow, Storage
Our talk on data versioning and reproducible Machine Learning proposed to the Data + AI Summit (formerly known as Spark+AI) is accepted. The summit will take place online the 17-19th November…
Sep 30, 2020
Experiment tracking with MLflow on Databricks Community Edition
Categories: Data Engineering, Data Science, Learning | Tags: Spark, Deep Learning, Databricks, Delta Lake, Machine Learning, MLflow, Notebook, Python, Scikit-learn
Introduction to Databricks Community Edition and MLflow Every day the number of tools helping Data Scientists to build models faster increases. Consequently, the need to manage the results and the…
Sep 10, 2020
Version your datasets with Data Version Control (DVC) and Git
Categories: Data Science, DevOps & SRE | Tags: DevOps, Git, Infrastructure, Operation, SCM
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such…
By Grégor JOUET
Sep 3, 2020
Importing data to Databricks: external tables and Delta Lake
Categories: Data Engineering, Data Science, Learning | Tags: Parquet, AWS, Amazon S3, Azure Data Lake Storage (ADLS), Databricks, Delta Lake, Python
During a Machine Learning project we need to keep track of the training data we are using. This is important for audit purposes and for assessing the performance of the models, developed at a later…
May 21, 2020
MLflow tutorial: an open source Machine Learning (ML) platform
Categories: Data Engineering, Data Science, Learning | Tags: Deep Learning, AWS, Databricks, Deployment, Machine Learning, Azure, MLflow, MLOps, Python, Scikit-learn
Introduction and principles of MLflow With increasingly cheaper computing power and storage and at the same time increasing data collection in all walks of life, many companies integrated Data Science…
Mar 23, 2020
Should you move your Big Data and Data Lake to the Cloud
Categories: Big Data, Cloud Computing | Tags: DevOps, AWS, Cloud, CDP, Databricks, GCP, Azure
Should you follow the trend and migrate your data, workflows and infrastructure to GCP, AWS and Azure? During the Strata Data Conference in New-York, a general focus was put on moving customer’s Big…
Dec 9, 2019