Prepare a configuration YAML file for the BIDS App
Table of Contents
A BIDS App usually has a few arguments, and different Apps may require different amount of cluster resources. To make sure BABS can run in a tailored way, it is required to prepare a YAML file to define a few configurations when running the BIDS App container.
YAML is a serialization language that is often used to define configurations. A YAML file for running BABS includes a few "sections". These sections not only define how exactly the BIDS App will be run, but also will be helpful in filtering out unnecessary subjects (and sessions), and in an informative debugging.
Overview of the configuration YAML file structure
Sections in the configuration YAML file
input_datasets: the input datasets to be used in this BABS project
cluster_resources: how much cluster resources are needed to run this BIDS App?
script_preamble: the preamble in the script to run a participant's job;
job_compute_space: where to run the jobs?
singularity_args: the arguments for
singularity run;bids_app_args: the arguments for the BIDS App;
imported_files: the files to be copied into the datalad dataset;
all_results_in_one_zip: whether to zip all results in one zip file;
zip_foldernames: the results foldername(s) to be zipped;
required_files: to only keep subjects (sessions) that have this list of required files in input dataset(s);
alert_log_messages: alert messages in the log files that may be helpful for debugging errors in failed jobs;
Among these sections, these sections are optional:
bids_app_args
Only if you are sure that besides arguments handled by BABS, you don't need any other argument, you may exclude this section from the YAML file.
You must include this section if there are more one input dataset.
required_files
alert_log_messages
imported_files
Example/prepopulated configuration YAML files
Example/prepopulated configuration YAML files can be found in notebooks/ folder of BABS GitHub repository.
See here for a full list and descriptions.
These include example YAML files for:
Different BIDS Apps: fMRIPrep, QSIPrep, XCP-D, as well as toy BIDS App, etc.
Cases with different input BIDS datasets, including one raw BIDS dataset, one zipped BIDS derivates dataset, and the combination of these two.
These YAML files can be customized for your cluster.
Terminology when describing a YAML file:
Below is an example "section" in a YAML file:
section_name:
key: value
In a section, the string before : is called key, the string after : is called value.
Below are the details for each section in this configuration YAML file.
Section input_datasets
This section is required.
It defines the input datasets to be used in this BABS project.
Note that the origin_url is the path to the input dataset on your local machine.
The --datasets argument is no longer allowed in babs init and is replaced by this section.
Example section input_datasets
input_datasets:
BIDS:
required_files:
- "dwi/*_dwi.nii*"
- "anat/*_T1w.nii*"
is_zipped: false
origin_url: "/path/to/BIDS"
path_in_babs: inputs/data/BIDS
FreeSurfer:
required_files:
- "*freesuefer*.zip"
is_zipped: true
origin_url: "/path/to/FreeSurfer"
unzipped_path_containing_subject_dirs: "freesurfer"
path_in_babs: inputs/data/freesurfer
This example shows two input datasets: one is a raw BIDS dataset, and the other is a zipped FreeSurfer results from another BABS project. Previously, the commandline to use something like this would have required:
babs init --datasets BIDS=/path/to/BIDS --datasets freesurfer=/path/to/FreeSurfer
You can see that the dataset names are specified as BIDS and freesurfer
in the yaml file such that the name is the key and the path to the dataset is in origin_url.
required_files is currently not implemented but will be soon.
This section is defined per input.
Section singularity_args
Singularity/Apptainer are configured differently for different clusters.
The arguments here are specified as a list and are added directly to the singularity run command.
Example section singularity_args
For maximum isolation, you can use --containall and --writable-tmpfs:
.. code-block:: yaml
- singularity_args:
--containall
--writable-tmpfs
But this doesn't always work for all clusters.
Add/remove arguments as needed.
If you need to use a GPU, this would be where to add an --nv flag.
Section imported_files
This section is optional. If you need to copy files into your datalad dataset, you can specify them here.
These will be copied into the datalad dataset from your local machine. This is particularly useful for
specifying a custom recon_spec.yaml file for qsirecon.
Example section imported_files
imported_files:
- original_path: "/path/to/recon_spec.yaml"
analysis_path: "code/recon_spec.yaml"
The analysis_path is the path to the file in your datalad dataset.
In this example, it would guarantee that when running qsirecon,
the recon_spec.yaml file will be available at "${PWD}/code/recon_spec.yaml.
This means I can use "${PWD}"/code/recon_spec.yaml in the bids_app_args section.
It also means that the recon_spec.yaml file will be tracked by datalad.
Important: If you are importing a large file this mechanism will not work.
Section bids_app_args
Currently, BABS does not support using configurations of running a BIDS App
that are defined in datalad containers-add --call-fmt.
Instead, users are expected to define these in this section, bids_app_args.
Example bids_app_args
Below is example section bids_app_args for fMRIPrep:
bids_app_args:
-w: "$BABS_TMPDIR" # this is a placeholder for temporary workspace
--n_cpus: '1'
--stop-on-first-crash: "" # argument without value
--fs-license-file: "/path/to/freesurfer/license.txt"
--skip-bids-validation: Null # Null or NULL is also a placeholder for argument without value
--output-spaces: MNI152NLin6Asym:res-2
--force-bbr: ""
--cifti-output: 91k
-v: '-v' # this is for double `-v`
This section will be turned into commands (here also showing the Singularity run command) as below:
1mkdir -p ${PWD}/.git/tmp/wkdir 2singularity run --cleanenv \ 3 -B ${PWD} \ 4 -B /test/templateflow_home:/SGLR/TEMPLATEFLOW_HOME \ 5 -B /path/to/freesurfer/license.txt:/SGLR/FREESURFER_HOME/license.txt \ 6 --env TEMPLATEFLOW_HOME=/SGLR/TEMPLATEFLOW_HOME \ 7 containers/.datalad/environments/fmriprep-20-2-3/image \ 8 inputs/data/BIDS \ 9 outputs \ 10 participant \ 11 -w ${PWD}/.git/tmp/wkdir \ 12 --n_cpus 1 \ 13 --stop-on-first-crash \ 14 --fs-license-file /SGLR/FREESURFER_HOME/license.txt \ 15 --skip-bids-validation \ 16 --output-spaces MNI152NLin6Asym:res-2 \ 17 --force-bbr \ 18 --cifti-output 91k \ 19 -v -v \ 20 --bids-filter-file "${filterfile}" \ 21 --participant-label "${subid}"
Basics - Manual of writing section bids_app_args
What arguments should I provide in this section? All arguments for running the BIDS App?
No, not all arguments. Usually you only need to provide named arguments (i.e., those with flags starting with
-or--), but not positional arguments.warning Exception for named arguments: Make sure you do NOT include these named arguments, as they've already been handled by BABS:
--participant-label--bids-filter-fileSee below Advanced - Manual of writing section bids_app_args --> bullet point regarding
--bids-filter-filefor explanations.See babs init: Initialize a BABS project for examples of
--list_sub_file/--list-sub-fileto filter subjects and sessions.
warning If you have more than one input for a BABS project, the first listed input will be used for the positional input argument. We removed
$INPUT_PATHfrom the configuration YAML file.
What's the format I should follow when providing an argument?
Say, you want to specify
--my_argument its_value, simply write as one of following format:--my_argument: 'its_value'(value in single quotation marks)--my_argument: "its_value"(value in double quotation marks)--my_argument: its_value(value without quotation marks; avoid using this format for values of numbers)
Can I mix arguments with flags that begins with double dashes (
--) and those with single dash (-)?Yes you can!
How about arguments without values (e.g.,
--force-bbrin above example of fMRIPrep)?There are several ways to specify arguments without values; just choose one of formats as follows:
my_key: ""(empty value string)my_key: Null(Nullis a placeholder recognized by BABS)my_key: NULL(NULLis a placeholder recognized by BABS)And then replace
my_keywith your keys, e.g.,--force-bbr. Do not forget the dashes (-or--) if needed!
Can I have repeated arguments?
Yes you can. However you need to follow a specific format.
This is because each YAML section will be read as a dictionary by BABS, so each key before
:cannot be repeated, e.g., repeated key of-vin more than one line is not allowed.If you need to specify repeated arguments, e.g.,
-v -v, please specify it as-v : '-v'as in the example above;For triple
-v, please specify as-v: '-v -v'
Can I see the
singularity runcommand that BABS generated?Yes you can! When running
babs initit will print outsingularity runcommand for you to check.
Advanced - Manual of writing section bids_app_args
How to specify a number as a value?
If you hope to make sure the number format will be exactly passed into
singularity run, it will be a good idea to quote it, e.g. in QSIPrep:--output-resolution: "2.0"
This is especially encouraged when there are only numbers in the value (without letters). Quoting will make sure that when BABS generates scripts, it will keep the string format of the value and pass the value exactly as it is, without the risk of data type changes (e.g., integers are changed to float numbers; and vice versa).
How to specify "path where intermediate results should be stored" (e.g.,
-win fMRIPrep or QSIPrep)?You can use
"$BABS_TMPDIR". It is a value placeholder recognized by BABS for temporary directory for holding intermediate results. Example would be:-w: "$BABS_TMPDIR"
By default BABS will automatically create such temporary directory if you use
$BABS_TMPDIR.
How to provide FreeSurfer license for argument
--fs-license-fileof BIDS App?You should provide it as you normally do when running the BIDS App: just provide the path to your FreeSurfer license on the cluster. For example:
--fs-license-file: "/path/to/freesurfer/license.txt"
When there is argument
--fs-license-fileinbids_app_argssection, BABS will bind this provided license file path to container insingularity runcommand, so that the BIDS App container can directly use that file (which is outside the container, on "host machine").Example generated
singularity runbybabs init:singualrity run ... \ -B /path/to/freesurfer/license.txt:/SGLR/FREESURFER_HOME/license.txt \ ... --fs-license-file /SGLR/FREESURFER_HOME/license.txt \ ...
After binding this license file, the value for
--fs-license-fileis changed to the path within the container by BABS.
Can I use a job environment variable, e.g., number of CPUs?
Yes you can! For number of CPUs (e.g.,
--n_cpusin QSIPrep), if you also usenumber_of_cpusin cluster_resources section (see below), then you can use environment variable for this Singularity run argument.For SLURM clusters, you can use environment variable
$NSLOTS, and you can specify it as:--n_cpus: "$SLURM_CPUS_PER_TASK"
Not sure how many CPUs or other resources you need? You can run
babs submit --count Nwith the first N (10-20) subjects and then usereportseff(library here) orseff_arrayto check the resource usage. You can then edit the resources in the<bids_app>_zip.shandparticipant_job.shin theanalysis/codefolder.
--bids-filter-file: When will BABS automatically add it?When BIDS App is fMRIPrep, QSIPrep or ASLPrep, and input BIDS dataset(s) are multi-session data.
How BABS determine it's fMRIPrep, QSIPrep or ASLPrep?
Based on
container_nameprovided when callingbabs init: Ifcontainer_namecontainsfMRIPrep,QSIPreporASLPrep(not case sensitive).
When BABS adds
--bids-filter-filehere for Singularity run, BABS will also automatically generate a filter file (JSON format) when running each session's data, so that only data from a specific session will be included for analysis.
Will BABS handle TemplateFlow environment variable?
Yes, BABS assumes all BIDS Apps use TemplateFlow, and will handle its environment variable
$TEMPLATEFLOW_HOMEif this environmental variable exists in the terminal environment wherebabs initwill be run.For BIDS Apps that truly depend on TemplateFlow (e.g., fMRIPrep, QSIPrep, XCP-D), before you run
babs init, please make sure you:Find a directory for holding TemplateFlow's templates.
If no (or not all necessary) TemplateFlow's templates has been downloaded in this directory, then this directory must be writable, so that when running the BIDS App, necessary templates can be downloaded in this directory;
if all necessary templates have been downloaded in this directory, then this directory should at least be readable.
Export environment variable
$TEMPLATEFLOW_HOMEto set its value as the path to this directory you prepared. This step should be done in the terminal environment wherebabs initwill be used.
If
babs initdetects environment variable$TEMPLATEFLOW_HOME, when generatingsingularity runcommand,babs initwill:Bind the path provided in this environment variable to the container;
Set the corresponding environment variable within the container.
For example, BABS will add these in command
singularity runof the container:singularity run ... \ ... \ -B /path/to/templateflow_home:/SGLR/TEMPLATEFLOW_HOME \ --env TEMPLATEFLOW_HOME=/SGLR/TEMPLATEFLOW_HOME \ ...
where
/path/to/templateflow_homeis the value of environment variable$TEMPLATEFLOW_HOME.
How to specify multiple spaces in argument
--output-spaces(e.g., in fMRIPrep)?Just to follow the guidelines from fMRIPrep, using space to separate different output spaces.
For example:
--output-spaces: "MNI152NLin6Asym:res-2 MNI152NLin2009cAsym"
Here,
MNI152NLin6Asym:res-2andMNI152NLin2009cAsymare two example spaces.We recommend quoting this value if there are multiple spaces (like this example). This is because there is space in the value of this argument. Quoting makes sure that BABS will take the entire value string as a whole and pass it into
singularity run.
Section zip_foldernames
This section defines the name(s) of the expected output folder(s). BABS will zip those folder(s) into separate zip file(s).
Here we provide two examples. Example #1
is for regular use cases,
where the BIDS App will generate one or several folders that wrap all derivative files.
Example use cases are fMRIPrep with legacy output layout, as well as QSIPrep and XCP-D.
If the BIDS App won't generate one or several folders that wrap all derivative files,
users should ask BABS to create a folder as an extra layer by specifying all_results_in_one_zip: true.
We explain how to do so in Example #2.
An example use case is fMRIPrep with BIDS output layout.
Example #1: for fMRIPrep legacy output layout
Here we use fMRIPrep (legacy output layout) as an example to show you
how to write this zip_foldernames section. For this case, all derivative files
are wrapped in folders generated by fMRIPrep. Similar use cases are QSIPrep
(e.g., generating a folder called qsiprep), and XCP-D (generating a folder called xcp_d).
Older versions of fMRIPrep (version < 21.0) generate
legacy output layout
which looks like below:
<output_dir>/
fmriprep/
freesurfer/
In this case, fMRIPrep generates two folders, fmriprep and freesurfer,
which include all derivatives. Therefore, we can directly tell BABS the expected foldernames,
without asking BABS to create them.
Example section zip_foldernames for fMRIPrep legacy output layout:
1zip_foldernames:
2 fmriprep: "20-2-3"
3 freesurfer: "20-2-3"
Here, we write the expected folders in line #2 and #3. For other BIDS Apps, if there is only one expected output folder, simply provide only one.
In addition to the folder name(s), please also add the version of the BIDS App as the value.
Above example means that:
BABS will zip output folder
fmriprepinto zip file${sub-id}_${ses-id}_fmriprep-20-2-3.zip;BABS will zip output folder
freesurferinto zip file${sub-id}_${ses-id}_freesurfer-20-2-3.zip;
Here, ${sub-id} is the subject ID (e.g., sub-01),
and ${ses-id} is the session ID (e.g., ses-A).
In other words, each subject (or session) will have their specific zip file(s).
Example #2: for fMRIPrep BIDS output layout: asking BABS to create additional output folder
Recent fMRIPrep (version >= 21.0) uses
BIDS output layout
which looks like below:
<output_dir>/
logs/
sub-<label>/
sub-<label>.html
dataset_description.json
.bidsignore
As you can see, there are files like sub-<label>.html and dataset_description.json
which do not belong to any folders (except <output_dir>,
which is a standard BIDS output directory).
However, BABS expects there are
one or more folders in <output_dir> that are generated by the BIDS App,
and wrap all derivative files,
so that BABS can directly zip these "wrapper" folders.
Therefore, users need to ask BABS to create an additional folder to wrap all the derivatives.
Example section zip_foldernames for fMRIPrep BIDS output layout:
1all_results_in_one_zip: true
2zip_foldernames:
3 fmriprep: "23-1-3"
Line #1 all_results_in_one_zip: true asks BABS to create an additional folder,
i.e., fmriprep specified in line #3, to wrap all derivatives.
In this way, the output will look like below:
<output_dir>/fmriprep/
logs/
sub-<label>/
sub-<label>.html
dataset_description.json
.bidsignore
Note that all derivatives will locate in the "wrapper" folder called fmriprep.
BABS will zip this folder into zip file ${sub-id}_${ses-id}_fmriprep-23-1-3.zip.
In addition, when using all_results_in_one_zip: true,
you must only provide one foldername in zip_foldernames.
Other detailed instructions
The version number should be consistent as that in image NAME when Step 2. Create a container DataLad dataset.
In example #1, you probably use
fmriprep-20-2-3for image NAME;In example #2, you probably use
fmriprep-23-1-3for image NAME.
When calling
babs init, argument--container-nameshould use the same version too, i.e.,--container-name fmriprep-20-2-3in example #1;--container-name fmriprep-23-1-3in example #2;
Please use dashes
-instead of dots.when indicating the version number, e.g.,20-2-3instead of20.2.3.If there are multiple folders to zip, we recommend using the consistent version string across these folders. In example #1, the
fMRIPrepBIDS App's version is20.2.3, so we specify20-2-3for both foldersfmriprepandfreesurfer, although the version ofFreeSurferincluded in thisfMRIPrepmay not be20.2.3.
Section cluster_resources
This section defines the cluster resources each job will use, and the interpreting shell for executing the job script.
Example section cluster_resources
Example section cluster_resources for QSIPrep:
cluster_resources:
interpreting_shell: /bin/bash
hard_memory_limit: 32G
temporary_disk_space: 200G
number_of_cpus: "6"
These will be turned into options in the directives (at the beginning) of participant_job.sh shown as below.
For example, a job requires no more than 32 GB of memory,
i.e., on SGE clusters, -l h_vmem=32G.
You may simply specify: hard_memory_limit: 32G.
Warning
Make sure you add interpreting_shell!
It is very important.
For SGE, you might need: interpreting_shell: /bin/bash;
For SLURM, you might need: interpreting_shell: /bin/bash -l.
Check what it should be like in the manual of your cluster!
Named cluster resources readily available
The table below lists all the named cluster resources requests that BABS supports.
You may not need all of them.
BABS will replace $VALUE with the value you provide.
The second row in each cell, which is also in (), is an example.
Section
cluster_resources in YAML(example key-value)
|
Generated directives for SGE clusters
(example outcome)
|
Generated directives for SLURM clusters
(example outcome)
|
|---|---|---|
interpreting_shell: $VALUE(
interpreting_shell: /bin/bash) |
#!$VALUE(
#!/bin/bash) |
#!$VALUE(
#!/bin/bash) |
hard_memory_limit: $VALUE(
hard_memory_limit: 25G) |
#$ -l h_vmem=$VALUE(
#$ -l h_vmem=25G) |
#SBATCH --mem=$VALUE(
#SBATCH --mem=25G) |
soft_memory_limit: $VALUE(
soft_memory_limit: 23.5G) |
#$ -l s_vmem=$VALUE(
#$ -l s_vmem=23.5G) |
Not applicable. |
temporary_disk_space: $VALUE(
temporary_disk_space: 200G) |
#$ -l tmpfree=$VALUE(
#$ -l tmpfree=200G) |
#SBATCH --tmp=$VALUE(
#SBATCH --tmp=200G) |
number_of_cpus: "$VALUE"(
number_of_cpus: "6") |
#$ -pe threaded $VALUE(
#$ -pe threaded 6) |
#SBATCH --cpus-per-task=$VALUE(
#SBATCH --cpus-per-task=6) |
hard_runtime_limit: "$VALUE"(
hard_runtime_limit: "24:00:00") |
#$ -l h_rt=$VALUE(
#$ -l h_rt=24:00:00) |
#SBATCH --time=$VALUE(
#SBATCH --time=24:00:00) |
Note the following:
For values with numbers only (without letters), it's recommended to quote the value, e.g.,
number_of_cpus: "6". This is to make sure that when BABS generates scripts, it will keep the string format of the value and pass the value exactly as is, without the risk of data type changes (e.g., integers are changed to float numbers; and vice versa).
Customized cluster resource requests
If you cannot find the one you want in the above table, you can still add it by customized_text.
Below is an example for SGE clusters:
cluster_resources:
<here goes keys defined in above table>: <$VALUE>
customized_text: |
#$ -abc this_is_an_example_customized_option_to_appear_in_preamble
#$ -zzz there_can_be_multiple_lines_of_customized_option
Note that:
Some clusters might not allow for specific settings (e.g.
temporary_disk_space). If you get an error that the setting is not allowed, simply remove the line that causes the issue.Remember to add
|aftercustomized_text:. This is to make sure BABS can read in multiple lines undercustomized_text.As customized texts will be directly copied to the script
participant_job.sh(without translation), please remember to add any necessary prefix before the option:#SBATCHfor SLURM clusters
For values with numbers only (without letters), it's recommended to quote the value, e.g.,
number_of_cpus: "6". This is to make sure that when BABS generates scripts, it will keep the string format of the value and pass the value exactly as it is, without the risk of data type changes (e.g., integers are changed to float numbers; and vice versa).
Section script_preamble
This part also goes to the preamble of the script participant_job.sh
(located at: /path/to/my_BABS_project/analysis/code). Different from cluster_resources
that provides options for cluster resources requests, this section script_preamble is for necessary
bash commands that are required by job running. An example would be to activate the conda environment;
however, different clusters may require different commands to do so. Therefore, BABS asks the user to
provide it.
Example section script_preamble for a specific cluster:
script_preamble: |
source "${CONDA_PREFIX}"/bin/activate babs # Penn Med CUBIC cluster; replace 'babs' with your conda env name
echo "I am running BABS." # this is an example command to show how to add another line; not necessary to include.
This will appear as below in the participant_job.sh:
# Script preambles:
source "${CONDA_PREFIX}"/bin/activate babs # Penn Med CUBIC cluster; replace 'babs' with your conda env name
echo "I am running BABS." # this is an example command to show how to add another line; not necessary to include.
Warning
Above command may not apply to your cluster; check how to activate conda environment on your cluster and replace above command.
You may also need to add command module_load for some modules (like FreeSurfer) too.
Warning
Different from other sections, please do NOT quote the commands in this section!
Notes:
Remember to add
|afterscript_preamble:;You can also add more necessary commands by adding new lines.
You can delete the 2nd line
echo "I am running BABS."as that's just a demonstration of how to add another line in the preamble.As you can see, the comments after the commands also show up in the generated script preambles. This is normal and fine.
Section job_compute_space
The jobs will be computed in ephemeral (temporary) compute space. Specifically, this space could be temporary space on a cluster node, or some scratch space. It's totally fine (and recommended!) if the data or the directory in the space will be removed after the job finishes - all results will be pushed back to (saved in) the output RIA (i.e., a permanent storage) where your BABS project locates.
Why recommending space where data/directory will be automatically removed after the job finishes?
If a job fails, and if the data or the directory won't be automatically removed, data will be accumulated and takes up space.
We recommend using space that automatically cleans after the job finishes especially for large-scale dataset which has a large amount of jobs to do.
Example section job_compute_space:
job_compute_space: "/tmp"
Here, "/tmp" is NOT a good choice, check your cluster's documentation for
the correct path.
This environment variable might not be recognized by your cluster,
but you can use the path that's specific to yours:
job_compute_space: "/path/to/some_temporary_compute_space"
You can also use an environment variable recognized by your clusters.
Note
Best to quote ("") the string of the path to the space as shown in the examples above.
Notes:
What's the different between this section and the argument "path where intermediate results should be stored" in some BIDS Apps (e.g.,
-win fMRIPrep or QSIPrep)?The space specified in this section is for job computing by BABS, and such job computing includes not only
singularity runof the BIDS App, but also other necessary data version tracking steps done by BABS.The "path where intermediate results should be stored" (e.g.,
-w) is directly used by BIDS Apps. It is also a sub-folder of the space specified in this section.