Versioning large files in git with DVC
DVC stands for Data Version Control and it describes itself as open-source version control system for machine learning projects. It has some ML-related features like running experiments or creating simple data pipelines, but it can also be used solely for the purpose of referencing and versioning large files in git. In this regard it is an alternative to Git Large File Storage and other similar solutions.
This is useful because git is not an ideal storage for binary or large files. Cloud git hosting providers also limit the size of the repositories or files in them, for instance the limit for Github repositories is 100 MB per file.
With DVC we can create a repository of files that are stored outside of our project folder, e.g. in Amazon S3, Azure Blog Storage, Google Drive or other cloud or local storage. Data Version Control can work as a standalone tool, but most people will probably use it together with git. Git is then used to store and version metadata about the remote repositories and files stored in them.
The end goal is to be able to work with large files like datasets or ML models inside a project folder by fetching and uploading files that match the specific git branch or commit. Built-in cache is automatically used for downloaded files, so checking out a version that was already downloaded is not going to be downloaded again.
To use it, we will need to:
- Install DVC client
- Initialize DVC in an existing git repository
- Configure a DVC remote storage
- Place the files that we want to store outside of the git repository inside our project folder, but list them in
- Add these files to DVC with the command
dvc add; this will create associated metadata files
- Upload the files to the remote storage by running
- Commit DVC metadata
.dvcdirectory in git
When this is done, we can use
dvc pull and
dvc checkout commands to keep the correct versions of the files in the project folder after operations like
git clone or
DVC is a Python CLI application, so we will need to have Python installed. Then we can simply install
dvc package with pip, pipx or an alternative installation option:
pip install dvc # global installation might be problematic when multiple projects need different versions of DVC
poetry add dvc # installation inside a Python project with poetry package manager
# other packages might be needed to support the chosen repository, e.g. for Google Drive we need to install:
pip install pydrive2
poetry add pydrive2
Example git + DVC usage
Let’s track a single dataset file with git and DVC from start to finish:
# place a data file inside the folder, e.g. data/dataset.csv
# initialize git repository and dvc inside it:
dvc add data/dataset.csv
# ignore all files in data/ folder by git:
git add data/.gitignore
# track *.dvc metadata file for dataset.csv in git:
git add data/dataset.csv.dvc
# configure a DVC remote, e.g. a Google Drive folder
dvc remote add -d storage gdrive://15B9PVuVB2hOYapm_oxfRzgv89tD7iD-3
# track DVC config folder in git:
git add .dvc/config
# upload the dataset to the specified remote storage:
# commit changes in git:
git commit -m "Add dataset, configure DVC"
Last updated on 21.10.2020.