Step III: Job submission and job status
Table of Contents
List of included subjects (and sessions) to process
babs-init
follows the steps below to determine the final list of included subjects (and sessions) to process:
Determine an initial list:
If
--list-sub-file
inbabs-init
is provided by the user, use it as initial listIf not, BABS will list the subjects and sessions from the first input dataset, and use it as initial list
Filter the initial list: Remove subjects (and sessions) which do not have the required files that are defined in Section required_files in
--container-config-yaml-file
provided when runningbabs-init
.
Now, BABS gets the final included subjects (and sessions) to process.
It saves this list into a CSV file, named sub_final_inclu.csv
(for single-session dataset)
or sub_ses_final_inclu.csv
(for multiple-session dataset),
is located at /path/to/my_BABS_project/analysis/code
.
Recommended workflow of job submission and status checking
Processing large-scale datasets and handling hundreds or even thousands of jobs
can be tough. We hope BABS can help this process.
We recommend using babs-submit
and babs-status
in the following way;
in short, it's a iteration between babs-submit
and babs-status
:
Check how many jobs need to run: run
babs-status --project_root /path/to/my_BABS_project
. This will return a summary that includes the number of jobs that are expected to complete. See List of included subjects (and sessions) to process for how BABS determines this list of jobs to complete.You may submit several exemplar jobs with
babs-submit
, then check job status withbabs-status
to see if they finish successfully.If there are failed jobs, you can use
babs-status
to perform failed job auditing.You may start to iteratively call
babs-submit
andbabs-status
until all jobs finish. See below for tips of each function.
Tips of babs-submit
You have several choices when running babs-submit
:
Submit one or several specific jobs by
--job
;Submit N jobs (from the top of the list, jobs haven't been submitted yet) by
--count N
;If your clusters allow, and you're confident to run BIDS App on all remaining subjects (and sessions), you may submit all remaining jobs by
--all
. After then, only thing you need to do is to runbabs-status
once a while until all jobs finish.
Tips of babs-status
Recommended way to check job status (including failed job auditing)
To check job status and perform failed job auditing,
you can use two options of babs-status
here:
To save time, you may run:
babs-status \ --project-root /path/to/my_BABS_project \ --container-config-yaml-file /path/to/my_yaml_file.yaml
i.e., using alert_log_messages in the YAML file for failed job auditing. See the section below for how to set up this section alert_log_messages. With the YAML file provided, this may take ~1.5 min for ~2500 jobs.
If time allows, and there are failed jobs without alert messages, you may add
--job-account
:babs-status \ --project-root /path/to/my_BABS_project \ --container-config-yaml-file /path/to/my_yaml_file.yaml \ --job-account
This will perform job account, thus it may take longer (e.g., ~0.5h for ~250 failed jobs without alert messages; also depending on the speed of the cluster).
Set up section alert_log_messages
for failed job auditing
If there are failed jobs, you may be wondering why they failed.
A direct way to investigate is to check their log files, but it will take a lot of time to go through
all failed jobs' log files. babs-status
supports failed job auditing and summary
by searching pre-defined alert messages in the failed jobs' log files.
These alert messages are defined by you in the
section alert_log_messages in the container's configuration YAML file.
In this section, please define some alert messages that might be found in the failed jobs' log files, Example alert message could be
Excessive topologic defect encountered
. This is helpful for debugging.You may also refer to the example YAML files we provide in folder "notebooks/".
Do not worry if you do not cover all alert messages on the first try; you can add/change this section alert_log_messages in the YAML file anytime you want, and simply call:
babs-status \ --project-root /path/to/my_BABS_project \ --container-config-yaml-file /path/to/updated_yaml_file.yaml
to ask BABS to find updated list of alert messages.
For more details about this section, please refer to Section alert_log_messages.
Job resubmission
You can also resubmit jobs that are failed or pending.
See --resubmit
and --resubmit-job
in babs-status: Check job status for more.
Warning
Do NOT kill babs-submit
or babs-status
(especially with --resubmit*
)
when it's running! Otherwise, new job IDs may not be captured or saved into the job_status.csv
!
Example job status summary from babs-status
1$ babs-status \
2 --project_root /path/to/my_BABS_project \
3 --container_config_yaml_file /path/to/config.yaml \
4 --job-account
5
6Did not request resubmit based on job states (no `--resubmit`).
7`--job-account` was requested; `babs-status` may take longer time...
8
9Job status:
10There are in total of 2565 jobs to complete.
112565 job(s) have been submitted; 0 job(s) haven't been submitted.
12Among submitted jobs,
13697 job(s) are successfully finished;
141543 job(s) are pending;
15260 job(s) are running;
1665 job(s) are failed.
17
18Among all failed job(s):
191 job(s) have alert message: 'stdout file: Numerical result out of range';
2056 job(s) have alert message: 'BABS: No alert message found in log files.';
211 job(s) have alert message: 'stdout file: fMRIPrep failed';
227 job(s) have alert message: 'stdout file: Excessive topologic defect encountered';
23
24Among job(s) that are failed and don't have alert message in log files:
2556 job(s) have job account of: 'qacct: failed: 37 : qmaster enforced h_rt, h_cpu, or h_vmem limit';
26
27All log files are located in folder: /path/to/my_BABS_project/analysis/logs
As you can see, in the summary Job status
, there are multiple sections:
Line #9-16: Overall summary of number of jobs to complete, as well as their breakdowns: number of jobs submitted/finished/pending/running/failed;
Line #18-22: Summary of failed jobs, based on the provided section alert_log_messages in
--container-config-yaml-file
, BABS tried to find user-defined alert messages in failed jobs' log files;Line #24-25: If there are jobs that failed but don't have defined alert message, and
--job-account
is requested, BABS will then run job account and try to extract more information and summarize. For each of these jobs, BABS runs job account command and extracts messages from it.In the above case, line #25 tells us that these jobs were killed by the cluster because they exceeded resource limits.
For SGE clusters: BABS uses command
qacct
for job account, and pulls out the code and message fromfailed
section inqacct
.For Slurm clusters: BABS uses command
sacct
for job account, and pulls out message from theState
column.
Finally, you can find the log files (stdout
, stderr
) in the path provided
in the last line of the printed message (line #27).
Explanation on job_status.csv
As described above, BABS babs-status
has provided a summary of all the jobs.
This summary is based on job_status.csv
(located at: /path/to/my_BABS_project/analysis/code
).
If you hope to dig out more information, you may take a look at this CSV file.
Note
This job_status.csv
file won't exist until the first time running babs-submit
or babs-status
.
Warning
Do NOT make changes to job_status.csv
by yourself!
Changes that are not made by babs-submit
or babs-status
may cause conflicts
or confusions to BABS on the job status.
Loading job_status.csv
To take a look at job_status.csv
, you may load it into Python.
Below is an example python script of reading job_status.csv
:
import numpy as np
import pandas as pd
fn_csv = "/path/to/my_BABS_project/analysis/code/job_status.csv" # change this path
df = pd.read_csv(csv_path,
dtype={"job_id": 'int',
'has_submitted': 'bool',
'is_done': 'bool'
})
# print:
with pd.option_context('display.max_rows', None,
'display.max_columns', None,
'display.width', 120): # default is 80 characters
print(df.head()) # print the first 5 rows
You can also slice df
and extract only failed jobs, only jobs whose alert_message
matches with a specific string, etc.
Detailed description of job_status.csv
Each row in the job_status.csv
is for a job, i.e., of a subject (single-session dataset),
or of a session of a subject (multiple-session dataset).
Below is description of each column.
Note: np.nan
means numpy's NaN if loading the CSV file into Python.
sub_id
(andses_id
in multiple-session dataset): string, the subject ID (and session ID) for a job.has_submitted
: bool (True or False), whether a job has been submitted.job_id
: integer (usually positive), ID of a job. Before a job is submitted,job_id = -1
.job_state_category
: string ornp.nan
, the category of a job's state, e.g., "pending", "running", etc on SGE clusters. Before a job is submitted,job_state_category = np.nan
.job_state_code
: string ornp.nan
, the code of a job's state, e.g., "qw", "r", etc on SGE clusters. Before a job is submitted,job_state_code = np.nan
.duration
: string ornp.nan
, the runtime of a running job since it starts running, e.g.,0:00:14.733701
(i.e., 14.733701 sec). If a job is not running (not submitted, pending, finished, etc),duration = np.nan
.is_done
: bool (True or False), whether a job has been successfully finished, i.e., there is a result branch of this job in the output RIA.is_failed
: bool (True or False) ornp.nan
, whether a job is failed. If a job has been submitted and it's out of job queues, but there is no result branch in the output RIA, this job is failed. Before a job is submitted,is_failed = np.nan
.log_filename
: string ornp.nan
, the filename of the log file in the format of<jobname>.*<jobid>
, e.g.,fmr_sub-xx.*11111
. Replace.*
with.o
or.e
to get corresponding log filename. The path to the log files are indicated in the last line of printed message frombabs-status
. Before a job is submitted,log_filename = np.nan
.The log files can be printed in the terminal via
cat
(printing the entire file),head
(printing first several lines),tail
(printing last several lines), etc.Also note that if a job hasn't started running, although its
log_filename
is a valid string, the log files won't exist until the job starts running.
last_line_stdout_file
: string ornp.nan
, the last line of currentstdout
file. Before a job is submitted,last_line_stdout_file = np.nan
.alert_message
: string ornp.nan
, a message from BABS that whether BABS found any alert messages (defined in alert_log_messages in the YAML file) in the log files.Example
alert_message
:'stdout file: fMRIPrep failed'
(alert messages found);BABS: No alert message found in log files.
(alert messages not found).This column of all submitted jobs will be updated every time
babs-status
is called. It will be updated based on current--container-config-yaml-file
(if provided). if--container-config-yaml-file
is not provided, columnalert_message
will be reset tonp.nan
.If a job hasn't been submitted, or
--container-config-yaml-file
was not specified inbabs-status
,alert_message = np.nan
.
job_account
: string ornp.nan
, information extracted by running job account. This is designed for failed jobs that don't have alert message in the log files. More detailed explanation of how and what information is get by BABS can be found in Example job status summary from babs-status. Other details about this column:This column is only updated when
--job-account
is requested inbabs-status
but--resubmit failed
is not requestedFor other jobs (not failed, or failed jobs but alert messages were found),
job_account = np.nan
if
babs-status
was called again, but without--job-account
, the previous round'sjob_account
column will be kept, unless the job was resubmitted. This is because the job ID did not change, so job account information should not change for a finished job.
FAQ for job submission and status checking
Q: In printed messages from babs-status
, what if the number of submitted jobs
does not match with the total number of jobs summarized under "Among submitted jobs"?
A: This should happen infrequently. Those "missing" jobs may in some uncommon or brief states
that BABS does not recognize. Please wait for a bit moment, and rerun babs-status
.
Q: In job_status.csv
, why column alert_message
is updated every time babs-status
is called,
whereas column job_account
is only updated when --job-account
is called?
A:
alert_message
is got from log files, which are dynamic as the jobs progress; also,alert_log_messages
in the yaml file can also be changed in eachbabs-status
call. On the other hand, only failed jobs havejob_account
with actual contents, and job account won't change after a job is finished (though failed).Updating
alert_message
is quick, whereas running job account (e.g., callingqacct
on SGE clusters) is slow
Q: A job is done (i.e., is_done = True
in job_status.csv
),
but column last_line_stdout_file
is not SUCCESS
?
A: This should be an edge case. Simply run babs-status
again,
and it might be updated with 'SUCCESS'.