DVC tutorial for machine learning projects

DVC is a data versionning tool that permits to track data linked to a git repository. It can be seen as an add-on of git for large files. DVC offers additional tools like data pipelines that will not be treated in this tutorial.

Initialisation

First, the command:

dvc init

will connect the DVC commands to git.

A file .dvcignore is also created. We can add the virtualenv directory into this file because tracking the virtual environment containing DVC with DVC can cause problems.

echo '/dvc_env' >> .dvcignore

Then we can add data with dvc using:

dvc add data

It creates a file data.dvc that can be tracked by git afterwards. DVC also add the directory data/ to the .gitignore file. It is possible to also add the .gitignore file to track these changes

git add data.dvc .gitignore

We can then work with this dataset for example to train a model. In our example, launching the script train.py will create a file model.pth which is the resulting trained model

python train.py
dvc add models
git add models.dvc

We can then commit the dataset and model using git

git commit -m "feat: add dataset and model to the repository"

Making changes from the dataset and switching branches

Let's make some changes on the dataset on a different branch and train a new model

git branch feature
git checkout feature
rm data/Sentinel2-RGB2NIR/train/ROIs1868_summer_s2_100_p40.tif
dvc add data
git add data.dvc
python train.py
dvc add models
git add models.dvc
git commit -m "fix: delete an image from the dataset and upload the model trained with the new dataset"

Thanks to DVC, we can retrieve the image removed from the main branch and the model generated with the entire dataset with:

git checkout main
dvc checkout

Reproducibility with DVC & MLFlow: better model versionning

As we saw in the last section, DVC allows model versionning as it allows data versionning. The problem is that we update the model much more frequently that we update the dataset, and overwriting the model file at each training is not convenient anyway. To perform model versionning it is better to use a logger like MLFlow that will track each training with the parameters associated to the network. However, MLFlow does not allow full reproducibility because it does not associate the models tracked with the dataset used to train them. That is why we need DVC in association with MLFlow to retrieve all the informations to reproduce a training.

In order to link DVC tracking with MLFlow tracking we can simply use the hash of the git commit that is automatically recorded by mlflow if the code is launched from a git repository.

Let's suppose we want to reproduce a training that has been performed weeks or months ago, but the dataset has been modified since then. If the training has been recorded with MLFlow, we have access to all the parameters used to lauch the training (learning rate, batch_size) and the git commit associated to the run on the column "Version". To reproduce the training, we can then make a checkout to the commit:

git checkout a5a353

which permits to retrieve the code used to launch the training. To retrieve the dataset on which the model have been trained we simply have to use DVC:

dvc checkout

And now that we have the code, dataset and parameters used for this specific training, we can reproduce the run in the exact same conditions.