Using DataLad to version control your data

Before we begin, you should be aware that there is already an amazing DataLad Handbook with super detailed documentation and tutorials. We definitely recommend you start there when learning about DataLad. The goal of this page is to help you apply DataLad commands and principles to your own BIDS-formatted dataset on your institution’s server.

Do this once: Install DataLad in a conda environment

We recommend installing DataLad in a conda environment on the server. This will allow you to use DataLad to track data you are storing and modifying on the server.

If you have already setup a pygers conda environment following the instructions on our conda tip page, you are good to go! The DataLad package was installed as part of the setup. Go ahead and login to the server, activate your pygers environment and proceed.

# login to scotty or spock
# activate pygers environment
$ conda activate pygers
# test it out!
$ datalad --version
$ datalad --help

If you already have Miniconda installed and want to create a new, datalad-only conda environment:

# login to scotty or spock
# create a new conda environment and name it datalad
$ conda create -n datalad
$ conda activate datalad
# install DataLad
$ conda install -c conda-forge datalad
# test it out!
$ datalad --version
$ datalad --help

If you want to install datalad as part of one of your other, pre-existing conda environments:

# login to scotty or spock
$ conda activate <myenv>
# install DataLad
$ conda install -c conda-forge datalad
# test it out!
$ datalad --version
$ datalad --help

Finally, if you haven’t done this in the past, you should configure your Git identity.

# check if your git identity is already configured
$ git config --list
# if you are already configured, you should see your user.name and user.email listed

# if you still need to configure
$ git config --global --add user.name "FirstName LastName"
$ git config --global --add user.email youremail@blah.com

Great! Now you are ready to start using DataLad on the server!

Start tracking existing data with DataLad

For this demo, we will show you how to apply DataLad commands to the Pygers sample dataset (i.e., a pre-existing dataset). You can also take this workflow and apply it to your own pre-existing, BIDS-formatted dataset. Before jumping straight in to using DataLad on your own dataset, we recommend practicing with the sample dataset to familiarize you with DataLad commands and general principles.

If you are practicing with our sample dataset, make sure you have already worked through converting data to BIDS, quality control with MRIQC, and preprocessing with fMRIprep. You should have a version of the sample dataset living in your personal directory (e.g., jukebox/YOURLAB/USERNAME/sample_project) with the corresponding derivatives. Alternatively, if you want to cheat and skip these steps :), you can copy our sample project output to your personal directory.

# copy sample output to your personal directory and call it sample_project
$ cp -r /jukebox/norman/pygers/handbook/sample_project_output_v20.2.0 /jukebox/YOURLAB/USERNAME/sample_project

The (abbreviated) structure of the /sample_project directory should be the following:

└── sample_project
    └── code
        └── analysis
        └── preprocessing
        └── task
    └── data
        └── behavioral
        └── bids
            └── sub-001
            └── derivatives
                └── deface
                └── fmriprep
                └── freesurfer
                └── mriqc
        └── dicom
        └── work

We will make /sample_project our highest level dataset (we will refer to this as the “superdataset”). Then we will create a series of “subdatasets”. Subdatasets are really standalone datasets, with their own git log history and .gitignore files. For example, this will allow others (or your future self) to clone a subdataset alone, without cloning your entire dataset. In this demo, we will make the following directories their own (sub)datasets:

  • /code

  • /data/bids

  • /derivatives/deface

  • /derivatives/fmriprep

  • /derivatives/freesurfer

  • /derivatives/mriqc

There are also some files and directories that we want to ignore (or leave “untracked”). This is either because we aren’t worried about version controlling certain directories (e.g., /data/work) or because files might contain sensitive information that we don’t want to share or make publicly available (e.g., /data/dicom and anatomical scans that have not been defaced). We will make sure to add directories and files that we do not want to track into the appropriate dataset’s .gitignore file.

Let’s get started! First, make sure you have activated your datalad conda environment and navigate to the /sample_project in your personal directory. Notice on the command line, ( datalad ) indicates you are working in your datalad conda environment (this could also say ( pygers ) if you are working in your pygers conda environment). The working directory is included in [ brackets ].

[~]$ conda activate datalad
(datalad) [~]$ cd /jukebox/YOURLAB/USERNAME/sample_project

Step 1: Setup your highest level dataset

# create a new dataset
(datalad) [sample_project]$ datalad create -c text2git --description "Princeton pygers sample dataset" -f .
# don't forget the period at the end to indicate current directory!

# check your commit history
(datalad) [sample_project]$ git log

You should see two commits, each with your name <email> (from git config) as the Author and the Date of your commit. These commits were automatically generated when you ran datalad create). An example below:

commit 1b2cea79ad11d17e1fd44c8047b6fd62da8e7dc1
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 16:51:48 2020 -0400

    Instruct annex to add text files to Git

commit 92f5ddc455310e8af325f0a3843957fd5246af26
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 16:51:31 2020 -0400

    [DATALAD] new dataset

Now we can use datalad status to see which directories and files are “untracked” or “modified”.

(datalad) [sample_project]$ datalad status

You should see 1 untracked file (.DS_Store) and 2 untracked directories (/code and /data).

Before proceeding, you need to add a couple things to a .gitignore file. Warning! You will need to use vim. Getting started with vim.

After modifying your .gitignore file, you will commit your modification using a datalad save command.

(datalad) [sample_project]$ vim .gitignore

# Add the following to your gitignore:
# *.DS_Store
# data/dicom
# data/work

# commit .gitignore modification
(datalad) [sample_project]$ datalad save -m "add gitignore" .gitignore

You will now see one more commit in this dataset’s history:

# check your commit history
(datalad) [sample_project]$ git log

commit d49a496f1a629f69f627a2707f7171941179eaba
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 16:59:46 2020 -0400

    add gitignore

Now, if you check datalad status you should see that .DS_Store is no longer listed as an untracked file.

Step 2: Setup your /code dataset

Now let’s add a new (sub)dataset. Remember, this will be a standalone dataset with its own commit history (git log) and .gitignore file. However, when we create this dataset, we will “link” to our superdataset, and the time and date that we created this (sub)dataset will be tracked in the superdataset’s git log history.

A couple of notes about the command options we are including below:

  • we are using the --no-annex flag here because this directory only contains code files and not large data files

  • the -d^ flag is what “links” this dataset to the superdataset

  • note that the last input (./code) is the path to the directory we want to make a dataset (relative to your working directory)

# create a new dataset
(datalad) [sample_project]$ datalad create --no-annex -c text2git -f -d^ ./code

# check your superdataset commit history
(datalad) [sample_project]$ git log

You will see that there is a new commit that was logged when you ran datalad create. However, the commit message is not very helpful:

commit 8af8f77009abc45880a6adad2cee9576871998b7
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Thu Aug 27 14:18:37 2020 -0400

    [DATALAD] Recorded changes

What we can do is amend the most recent commit and modify the commit message to something more descriptive:

# modify most recent commit message
(datalad) [sample_project]$ git commit --amend

Using vim, edit commit message to say: [DATALAD] Add code directory dataset

Now your most recent commit should look like this:

# check superdataset commit history
(datalad) [sample_project]$ git log

commit 8af8f77009abc45880a6adad2cee9576871998b7
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Thu Aug 27 14:18:37 2020 -0400

    [DATALAD] Add code directory dataset

Great! Now let’s start looking at the code dataset.

# move into your code directory
(datalad) [sample_project]$ cd code

# check out the status of directories/files and the git history
(datalad) [code]$ datalad status # see which files/directories are untracked
(datalad) [code]$ git log

You should see several commits in the git history, going all the way back to our old, pre-covid lifetime (Feb 20 2020)! This is because the code from the original sample dataset (before you copied to your own directory) was setup with git tracking. Then you can see there are the two commits corresponding to creating this new dataset and the -c text2git command we included when we ran datalad create. Notice that this dataset has a completely new git history independent of the superdataset!

commit 8f5e3dbee686ac07edcc2d59a31ee232844d8a71
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 17:06:22 2020 -0400

    Instruct annex to add text files to Git

commit d5d0254c6eedf7fdc09f96c4632334ed80dd2e0f
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 17:06:21 2020 -0400

    [DATALAD] new dataset

# ...many other commits...

commit 7e1fddcd5e17a72b93a570a88475fcf3ded2f30b
Author: Elizabeth McDevitt <eam7@scotty.pni.Princeton.EDU>
Date:   Thu Feb 20 11:09:12 2020 -0500

    Initial commit of handbook preprocessing code

Next we will add a .gitignore file for this code dataset, and commit the directories/files within /code.

# make sure you are in the code directory
(datalad) [code]$ vim .gitignore

# Add the following to your gitignore:
# *.DS_Store

# Commit .gitignore modification
(datalad) [code]$ datalad save -m "add gitignore" .gitignore

# Commit all code files
(datalad) [code]$ datalad save -m "add code files"

# Optional: setup and link to a remote GitHub repository if you haven't done this already
(datalad) [code]$ git remote add origin [github-repo-url]

You should see two new commits in your code dataset git history:

(datalad) [code]$ git log

commit d68b035e7de4e750b0fa0ce10f9d9d15a9d9820e
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 17:50:00 2020 -0400

    add code files

commit 9c0f6245455e0bfda94b577fd3da0e7f9435d188
Author: Lizzie McDevitt <emcdevitt8287@gmail.com>
Date:   Wed Aug 12 17:07:18 2020 -0400

    add gitignore

Now go back up a level to the /sample_project directory and check datalad status. You should see that /code is modified and /data is untracked. Next we will commit the modifications to /code only so that these changes are logged in the superdataset’s git history. IMPORTANT: Make sure you include the -u flag at the end of the save command so that ONLY modifications (and NOT untracked files) are committed (-u stands for “updated”).

# go up a level
(datalad) [code]$ cd ..
(datalad) [sample_project]$ datalad status
(datalad) [sample_project]$ datalad save -m "add files to code dataset" -u

Now your superdataset git history has a commit that tracks what modifications were made to the code (sub)dataset.

(datalad) [sample_project]$ git log --oneline

9933052 add files to code dataset
8af8f77 [DATALAD] Add code directory dataset
d49a496 add gitignore
1b2cea7 Instruct annex to add text files to Git
92f5ddc [DATALAD] new dataset

And let’s compare this to the git history for the code dataset:

(datalad) [sample_project]$ cd code
(datalad) [code]$ git log --oneline

d68b035 add code files
9c0f624 add gitignore
8f5e3db Instruct annex to add text files to Git
d5d0254 [DATALAD] new dataset
7e1fddc Initial commit of handbook preprocessing code

Notice that each dataset has its own git history!

Step 3: Setup your /data/bids dataset

Now we will add the /data/bids directory as its own (sub)dataset. Ultimately, we will track all of the raw BIDS-formatted nifti files for each subject in this dataset. However, keep in mind that the /bids directory also contains /derivatives. Since we want the various derivatives directories to become their own (sub)datasets, we will need to follow a slightly different workflow here so that we don’t accidentally commit /derivatives to our /data/bids dataset.

# make sure you are in the sample_project directory
# create a new dataset
(datalad) [sample_project]$ datalad create -c text2git --description "Princeton pygers sample dataset raw BIDS files" -f -d^ ./data/bids

# edit the unuseful commit message
(datalad) [sample_project]$ git commit --amend

# edit commit message to say: [DATALAD] Add BIDS dataset

# move into the bids directory
(datalad) [sample_project]$ cd data/bids

# add and edit .gitignore file
(datalad) [bids]$ vim .gitignore

# add the following to your gitignore:
# *.DS_Store
# */*/anat/*T1w.nii.gz

# commit .gitignore modification only
(datalad) [bids]$ datalad save -m "add gitignore" .gitignore

# check your bids dataset git log
(datalad) [bids]$ git log --oneline

677ac37 add gitignore
471744c Instruct annex to add text files to Git
c7cddba [DATALAD] new dataset

And that is it for now! Don’t run any other datalad save commands for the bids dataset until you have finished adding derivatives (sub)datasets! You can check your datalad status and you should see that all files and directories are “untracked”.

If you want to check that you setup your .gitignore correctly, you can run a couple of checks.

(datalad) [bids]$ datalad status derivatives/*

You should see /deface, /fmriprep, /freesurfer, and /mriqc listed as untracked directories.

(datalad) [bids]$ datalad status */*/anat/*

You should see the T1w.json file listed as untracked, but NOT the T1w.nii.gz file listed because you added the T1w.nii.gz file to .gitignore.

Step 4: Setup your /derivatives/deface dataset

For the next four datasets, you will follow the same workflow of (1) adding a new dataset, (2) editing the commit messages in the higher level dataset git logs to be useful, (3) add and datalad save a .gitignore file, (4) datalad save the files within the dataset, and (5) then do a datalad save in the higher level dataset. Here we go!

# from your bids directory, create a new dataset
(datalad) [bids]$ datalad create -c text2git --description "Princeton pygers sample dataset defaced files" -f -d^ ./derivatives/deface

# edit the commit message in your bids dataset log
(datalad) [bids]$ git commit --amend
# edit commit message to say: [DATALAD] Add deface dataset

# edit the commit message in your sample_project dataset log
(datalad) [bids]$ cd ../..
(datalad) [sample_project]$ git commit --amend
# edit commit message to say: [DATALAD] Add deface dataset

# move into deface directory and create a .gitignore
(datalad) [bids]$ cd data/bids/derivatives/deface
(datalad) [deface]$ vim .gitignore

# add the following to your gitignore:
# *.DS_Store

# commit .gitignore only
(datalad) [deface]$ datalad save -m "add gitignore" .gitignore

# commit the contents of derivatives/deface
(datalad) [deface]$ datalad save -m "add defaced T1w files"

# go back two levels to the bids directory and check the status
(datalad) [deface]$ cd ../..
(datalad) [bids]$ datalad status

You should see that the /derivatives/deface dataset is “modified”. Everything else is “untracked”. Next you will run a datalad save command from your /bids directory using the -u flag. This will commit the modifications ONLY.

# don't forget the -u flag!!
(datalad) [bids]$ datalad save -m "files added to deface dataset" -u

You have now finished adding the deface dataset, including tracking the contents of the deface dataset. These changes have been logged in the git history of the sample_project (super)dataset, the bids (sub)dataset, and deface (sub)dataset. Go ahead and inspect those git logs.

# from the deface directory
(datalad) [deface]$ git log --oneline

3b56d62 add defaced T1w files
a788851 add gitignore
a0b3191 Instruct annex to add text files to Git
85a9219 [DATALAD] new dataset

# from the bids directory
(datalad) [bids]$ git log --oneline

a5d1b0a files added to deface dataset
f6e0260 [DATALAD] Add deface dataset
677ac37 add gitignore
471744c Instruct annex to add text files to Git
c7cddba [DATALAD] new dataset

# from the sample_project directory
(datalad) [sample_project]$ git log --oneline

0fd4b47 [DATALAD] Add deface dataset
9933052 add files to code dataset
8af8f77 [DATALAD] Add code directory dataset
d49a496 add gitignore
1b2cea7 Instruct annex to add text files to Git
92f5ddc [DATALAD] new dataset

Step 5: Setup your /derivatives/fmriprep dataset

# from your bids directory, create a new dataset
(datalad) [bids]$ datalad create -c text2git --description "Princeton pygers sample dataset fmriprep derivatives" -f -d^ ./derivatives/fmriprep

# edit the commit message in your bids dataset log
(datalad) [bids]$ git commit --amend
# edit commit message to say: [DATALAD] Add fmriprep dataset

# edit the commit message in your sample_project dataset log
(datalad) [bids]$ cd ../..
(datalad) [sample_project]$ git commit --amend
# edit commit message to say: [DATALAD] Add fmriprep dataset

# move into fmriprep directory and create a .gitignore
(datalad) [bids]$ cd data/bids/derivatives/fmriprep
(datalad) [fmriprep]$ vim .gitignore

# add the following to your gitignore:
# *.DS_Store
# */*/anat/*T1w.nii.gz

# commit .gitignore only
(datalad) [fmriprep]$ datalad save -m "add gitignore" .gitignore

# check that anat files are actually ignored
(datalad) [fmriprep]$ datalad status */*/anat/*

# commit the contents of derivatives/fmriprep
(datalad) [fmriprep]$ datalad save -m "add fmriprep output files"

# go back two levels to the bids directory and commit changes at that level
# don't forget the -u flag!!
(datalad) [fmriprep]$ cd ../..
(datalad) [bids]$ datalad save -m "files added to fmriprep dataset" -u

Step 6: Setup your /derivatives/freesurfer dataset

# from your bids directory, create a new dataset
(datalad) [bids]$ datalad create -c text2git --description "Princeton pygers sample dataset freesurfer derivatives" -f -d^ ./derivatives/freesurfer

# edit the commit message in your bids dataset log
(datalad) [bids]$ git commit --amend
# edit commit message to say: [DATALAD] Add freesurfer dataset

# edit the commit message in your sample_project dataset log
(datalad) [bids]$ cd ../..
(datalad) [sample_project]$ git commit --amend
# edit commit message to say: [DATALAD] Add freesurfer dataset

# move into freesurfer directory and create a .gitignore
(datalad) [bids]$ cd data/bids/derivatives/freesurfer
(datalad) [freesurfer]$ vim .gitignore

# add the following to your gitignore:
# *.DS_Store

# commit .gitignore only
(datalad) [freesurfer]$ datalad save -m "add gitignore" .gitignore

# commit the contents of derivatives/freesurfer
(datalad) [freesurfer]$ datalad save -m "add freesurfer output files"

# go back two levels to the bids directory and commit changes at that level
# don't forget the -u flag!!
(datalad) [freesurfer]$ cd ../..
(datalad) [bids]$ datalad save -m "files added to freesurfer dataset" -u

Step 7: Setup your /derivatives/mriqc dataset

# from your bids directory, create a new dataset
(datalad) [bids]$ datalad create -c text2git --description "Princeton pygers sample dataset mriqc derivatives" -f -d^ ./derivatives/mriqc

# edit the commit message in your bids dataset log
(datalad) [bids]$ git commit --amend
# edit commit message to say: [DATALAD] Add mriqc dataset

# edit the commit message in your sample_project dataset log
(datalad) [bids]$ cd ../..
(datalad) [sample_project]$ git commit --amend
# edit commit message to say: [DATALAD] Add mriqc dataset

# move into mriqc directory and create a .gitignore
(datalad) [bids]$ cd data/bids/derivatives/mriqc
(datalad) [mriqc]$ vim .gitignore

# add the following to your gitignore:
# *.DS_Store

# commit .gitignore only
(datalad) [mriqc]$ datalad save -m "add gitignore" .gitignore

# commit the contents of derivatives/mriqc
(datalad) [mriqc]$ datalad save -m "add mriqc output files"

# go back two levels to the bids directory and commit changes at that level
# don't forget the -u flag!!
(datalad) [mriqc]$ cd ../..
(datalad) [bids]$ datalad save -m "files added to mriqc dataset" -u

Step 8: Cleanup higher level datasets

Almost there! Now you will save all the untracked files and directories in your bids dataset, which should ONLY be the files and data directories at the bids level since all derivatives were already tracked or ignored. You can check this with datalad status.

(datalad) [bids]$ datalad status derivatives/*

# save all modified and untracked files/directories
(datalad) [bids]$ datalad save -m "add BIDS data created using HeuDiConv"

Finally, go back to the highest level (i.e., the sample_project superdataset) and save modifications.

(datalad) [bids]$ cd ../..
(datalad) [sample_project]$ datalad status # /data/bids should be modified
(datalad) [sample_project]$ datalad save -m "added BIDS files to data/bids dataset"

Congrats! You are all setup to version control all the data and code from the sample project. Make sure you reference the DataLad Handbook to explore all the functionality and incredible things you can do with DataLad as you grow your dataset and analyze data!