Version your datasets with Data Version Control (DVC) and Git
Using a Version Control System such as Git for source code is a good practice and an industry standard. Considering that projects focus more and more on data, shouldn’t we have a similar approach such as Data Versioning?
The Data Version Control (DVC) project aims at bringing Git in projects that use a lot of data. You often find in such projects a link of some sort to download the data, or sometimes Git LFS is used. However, doing so has a major drawback: the data itself is not tracked by git.
Which means that another tool is required to track the changes to the data, and more often than not, this tool is a spreadsheet.
Projects that are data intensive such as deep learning projects, strongly rely on the quality of the dataset to produce good results. It is fair to say that the data is often more important than the model processing it. Having a way of tracking the version and changes made to your data and exchanging it with your colleagues in a way that does not involve a
.tar.gz sounds like a good idea. And considering that Git is the most widely used versioning system, coupling Git with the Data Version Control system sounds like an even better idea.
This way, when our dataset version is tracked, we can later use MLflow to link the accuracy of a model to the version of the dataset that we used.
DVC usage is very similar to Git. The commands look almost the same, basically DVC allows you to choose files in your repository and push them on an alternative remote storage. This can be a S3, HDFS, SSH. All are more suitable for large files than Git. To install DVC, simply use
pip with the command
sudo pip install dvc.
A DVC repository should be tracked by a scm tool first, the SCM tool will track the index files from DVC which will indicate where to get the actual files that are too big to put directly on Git. As a reminder, SCM stands for Source Control Management. It is a family of tools that includes Git, SVN and Mercurial. If you are only using Git, you can replace the term SCM with Git.
So first, initialize a Git repo, as usual:
Then, initialize a DVC repo in the same folder:
This way your folder is tracked by Git and DVC. You can still use Git to track the changes to your project like usual. But with DVC the file you choose are removed from Git and added to the
.gitignore and managed by dvc.
DVC will create a
containing information about your file. For example, if you have a file
dataset.dvc would look like:
md5: a33606741514b870f609eb1510d8c6cf outs: - md5: b2455b259b1c3b5d86eac7dfbb3bbe6d path: dataset cache: true metric: false persist: false
This file describes a file called
dataset which is present on the remote (or DVC local cache) and available for checkout (just as in Git). The md5 hash is used for version control and to indicate which version of the file should be pulled and used in the project.
.dvc files should be committed on the SCM as DVC is not a SCM tool: DVC has commits but they are not linked between them. Commits do not have a parent, this is handled by the SCM software. In the example above, the md5 hash is not tied to any other DVC hash. They are indepedant and tracked by the SCM software.
The commands to commit, push and pull from a remote storage like S3 or HDFS are very simple:
dvc addto trak a file with DVC (and automatically append it to the .gitignore) which will create the associated
dvc committo commit the changes to the DVC local cache.
dvc pull/pushto receive and checkout or to send the file to the remote.
So when cloning a Data Science project, simply use the following sequence to clone the project and get the associated large files such as datasets:
git clone <url> project cd project dvc pull
Use this sequence to add and push the file
dataset.zip to dvc.
dvc add dataset.zip dvc commit git add dataset.zip.dvc .gitignore git commit -m "[DVC] Move dataset to dvc" dvc push git push
DVC can be used with other Data Science tools to make the most of it. One of those tools is MLflow, which is used to track the effeciency of Machine Learning models.
MLflow can be used with a lot of Data Science frameworks: Tensorflow, Pytorch, Spark… The interesting thing is that MLflow’s runs can be tagged with the Git commit hash. Before, only the code would go on Git and the dataset information would typically go on some Excel files, passed from departements to departments. Now, with DVC, the dataset is integrated in Git. You can keep track of the modifications made to it in the Git commit messages and have a branch dedicated to the dataset tracking. Also, you can see the influence of each modification on the model’s performance with MLflow. Because DVC has nothing to do with the code of the project, it is very easy to add to an existing project.
We use MLflow here as an example, but any other framework using a SCM tool would work as well.
The main problem with DVC is the initial configuration of the remote, especially with HDFS which requires some configuration to have a usable client. One of the easiest way to setup DVC is to use a S3 bucket.
dvc remote add myremote s3://bucket/path
But DVC can also be used with many remote like GoogleDrive:
dvc remote add myremote gdrive://root/my-dvc-root dvc remote modify myremote gdrive_client_id my_gdrive_client_id dvc remote modify myremote gdrive_client_secret gdrive_client_secret
More information on remote storage are available on their website.
Be also aware that DVC has a local storage for the files that are beeing tracked. The files tracked by DVC can be harklinked to the cache. Be aware of it if you decide to remove DVC from your project. More information about the DVC cache structure are available on their website.