DVC tutorial for machine learning projects
DVC is a data versionning tool that permits to track data linked to a git repository. It can be seen as an add-on of git for large files. DVC offers additional tools like data pipelines that will not be treated in this tutorial.
Initialisation
First, the command:
will connect the DVC commands to git.
A file .dvcignore
is also created. We can add the virtualenv directory into this file because tracking the virtual environment containing DVC with DVC can cause problems.
Then we can add data with dvc using:
It creates a file data.dvc
that can be tracked by git afterwards. DVC also add the directory data/
to the .gitignore file. It is possible to also add the .gitignore file to track these changes
We can then work with this dataset for example to train a model. In our example, launching the script train.py
will create a file model.pth
which is the resulting trained model
We can then commit the dataset and model using git
Making changes from the dataset and switching branches
Let's make some changes on the dataset on a different branch and train a new model
git branch feature
git checkout feature
rm data/Sentinel2-RGB2NIR/train/ROIs1868_summer_s2_100_p40.tif
dvc add data
git add data.dvc
python train.py
dvc add models
git add models.dvc
git commit -m "fix: delete an image from the dataset and upload the model trained with the new dataset"
Thanks to DVC, we can retrieve the image removed from the main branch and the model generated with the entire dataset with:
Reproducibility with DVC & MLFlow: better model versionning
As we saw in the last section, DVC allows model versionning as it allows data versionning. The problem is that we update the model much more frequently that we update the dataset, and overwriting the model file at each training is not convenient anyway. To perform model versionning it is better to use a logger like MLFlow that will track each training with the parameters associated to the network. However, MLFlow does not allow full reproducibility because it does not associate the models tracked with the dataset used to train them. That is why we need DVC in association with MLFlow to retrieve all the informations to reproduce a training.
In order to link DVC tracking with MLFlow tracking we can simply use the hash of the git commit that is automatically recorded by mlflow if the code is launched from a git repository.
Let's suppose we want to reproduce a training that has been performed weeks or months ago, but the dataset has been modified since then. If the training has been recorded with MLFlow, we have access to all the parameters used to lauch the training (learning rate, batch_size) and the git commit associated to the run on the column "Version". To reproduce the training, we can then make a checkout to the commit:
which permits to retrieve the code used to launch the training. To retrieve the dataset on which the model have been trained we simply have to use DVC:
And now that we have the code, dataset and parameters used for this specific training, we can reproduce the run in the exact same conditions.