Software development and beyond

Versioning large files in git with DVC

DVC stands for Data Version Control and it describes itself as open-source version control system for machine learning projects. It has some ML-related features like running experiments or creating simple data pipelines, but it can also be used solely for the purpose of referencing and versioning large files in git. In this regard it is an alternative to Git Large File Storage and other similar solutions.

This is useful because git is not an ideal storage for binary or large files. Cloud git hosting providers also limit the size of the repositories or files in them, for instance the limit for Github repositories is 100 MB per file.

With DVC we can create a repository of files that are stored outside of our project folder, e.g. in Amazon S3, Azure Blog Storage, Google Drive or other cloud or local storage. Data Version Control can work as a standalone tool, but most people will probably use it together with git. Git is then used to store and version metadata about the remote repositories and files stored in them.

The end goal is to be able to work with large files like datasets or ML models inside a project folder by fetching and uploading files that match the specific git branch or commit. Built-in cache is automatically used for downloaded files, so checking out a version that was already downloaded is not going to be downloaded again.

To use it, we will need to:

When this is done, we can use dvc pull and dvc checkout commands to keep the correct versions of the files in the project folder after operations like git clone or git checkout.

DVC installation

DVC is a Python CLI application, so we will need to have Python installed. Then we can simply install dvc package with pip, pipx or an alternative installation option:

pip install dvc # global installation might be problematic when multiple projects need different versions of DVC
poetry add dvc # installation inside a Python project with poetry package manager

# other packages might be needed to support the chosen repository, e.g. for Google Drive we need to install:
pip install pydrive2
poetry add pydrive2

Example git + DVC usage

Let’s track a single dataset file with git and DVC from start to finish:

# place a data file inside the folder, e.g. data/dataset.csv

# initialize git repository and dvc inside it:
git init
dvc init
dvc add data/dataset.csv

# ignore all files in data/ folder by git:
touch data/.gitignore
git add data/.gitignore

# track *.dvc metadata file for dataset.csv in git:
git add data/dataset.csv.dvc

# configure a DVC remote, e.g. a Google Drive folder
dvc remote add -d storage gdrive://15B9PVuVB2hOYapm_oxfRzgv89tD7iD-3

# track DVC config folder in git:
git add .dvc/config

# upload the dataset to the specified remote storage:
dvc push

# commit changes in git:
git commit -m "Add dataset, configure DVC"

Last updated on 21.10.2020.

git software-development