Versioning large files in git with DVC
DVC stands for Data Version Control and it describes itself as open-source version control system for machine learning projects. It has some ML-related features like running experiments or creating simple data pipelines, but it can also be used solely for the purpose of referencing and versioning large files in git. In this regard it is an alternative to Git Large File Storage and other similar solutions.
This is useful because git is not an ideal storage for binary or large files. Cloud git hosting providers also limit the size of the repositories or files in them, for instance the limit for Github repositories is 100 MB per file.
With DVC we can create a repository of files that are stored outside of our project folder, e.g. in Amazon S3, Azure Blog Storage, Google Drive or other cloud or local storage. Data Version Control can work as a standalone tool, but most people will probably use it together with git. Git is then used to store and version metadata about the remote repositories and files stored in them.
The end goal is to be able to work with large files like datasets or ML models inside a project folder by fetching and uploading files that match the specific git branch or commit. Built-in cache is automatically used for downloaded files, so checking out a version that was already downloaded is not going to be downloaded again.
To use it, we will need to:
- Install DVC client
- Initialize DVC in an existing git repository
- Configure a DVC remote storage
- Place the files that we want to store outside of the git repository inside our project folder, but list them in
.gitignore
file - Add these files to DVC with the command
dvc add
; this will create associated metadata files - Upload the files to the remote storage by running
dvc push
- Commit DVC metadata
*.dvc
files and.dvc
directory in git
When this is done, we can use dvc pull
and dvc checkout
commands to keep the correct versions of the files in the project folder after operations like git clone
or git checkout
.
DVC installation
DVC is a Python CLI application, so we will need to have Python installed. Then we can simply install dvc
package with pip, pipx or an alternative installation option:
pip install dvc # global installation might be problematic when multiple projects need different versions of DVC
poetry add dvc # installation inside a Python project with poetry package manager
# other packages might be needed to support the chosen repository, e.g. for Google Drive we need to install:
pip install pydrive2
poetry add pydrive2
Example git + DVC usage
Let’s track a single dataset file with git and DVC from start to finish:
# place a data file inside the folder, e.g. data/dataset.csv
# initialize git repository and dvc inside it:
git init
dvc init
dvc add data/dataset.csv
# ignore all files in data/ folder by git:
touch data/.gitignore
git add data/.gitignore
# track *.dvc metadata file for dataset.csv in git:
git add data/dataset.csv.dvc
# configure a DVC remote, e.g. a Google Drive folder
dvc remote add -d storage gdrive://15B9PVuVB2hOYapm_oxfRzgv89tD7iD-3
# track DVC config folder in git:
git add .dvc/config
# upload the dataset to the specified remote storage:
dvc push
# commit changes in git:
git commit -m "Add dataset, configure DVC"
Last updated on 21.10.2020.