Developer's notes on babs status
Logic flow of the babs_status() method of BABS class
Source code: babs/babs.py -> class BABS() --> def babs_status()
create
job_status.csvif it does not exist yetget 'alert_log_messages' configs
get username (to be used by
qacctjob accounting)get list of branches in output RIA with
git branch -a(NOTE: this is quick, even for tons of branches)with
job_status.csvfile opened:get original
df_jobmake new one
df_job_updated(copy of original one)request
qstatfor all jobs:df_all_job_statusfor each job that has been submitted but not
is_done:get basic information about this job
get the last line of
stdoutfilecheck if there are any alert messages in the log files (based on 'alert_log_messages')
if the job has a branch in the output RIA, the job is done, so we update
df_job_updatedif not, the job is pending/running/failed/eqw:
if the job is in the queue
df_all_job_status, i.e., is pending/running/eqw:if
r:if
--resubmit-jobfor this job &--reckless: resubmitelse: update
df_job_updated
if
qw:resubmit if 'pending' in
flags_resubmit, or request specifically: resubmit and updatedf_job_updated
if
eqw: just update the job state code/category indf_job_updatedcurrently does not support resubmission, won't support this feature until it has been tested
else, i.e., not in the queue, so failed:
update
df_job_updatedresubmit if 'failed' in
flags_resubmit, or request specifically: resubmit and updatedf_job_updated
for each job that marked as "is_done" in previous round:
if
--resubmit-jobfor this job &--reckless: resubmitelse:
get last line of
stdoutfile. Purpose: when marked as 'is_done' (has a branch in output RIA), the job hasn't been finished yet, and needs to complete cleanup steps such as datalad dropping the input data before echoing 'SUCCESS'. This is to make sure that we can get 'SUCCESS' for 'last_line_stdout_file' for 'is_done' jobs.check if any alert message in the log files (based on 'alert_log_messages'); Purpose: update it for successful jobs too in case user updates the configs in yaml file
for jobs that haven't been submitted yet:
if
--resubmit-jobis requested, check if any requested jobs have not yet been submitted; if so, throw a warning
save
df_jobs_updatedsummarize the job status and report
Summary: - 'alert_log_messages' is detected in all submitted jobs, no matter 'is_done' in previous round or not
Resubmissions based on job's status
Note: currently, babs status CLI does not support --reckless.
job status |
what to do if resubmit is requested |
progress of implementation in BABS |
tested? |
|---|---|---|---|
not submitted |
warning: |
added |
edge case, not tested yet? |
submitted, qw |
resubmit |
added |
tested with session data |
submitted, running |
|
added |
edge case, not tested yet? |
submitted, eqw |
|
added |
edge case; not tested yet, as cannot enter eqw... |
submitted, failed |
resubmit |
added |
tested with session data |
submitted, is_done |
|
added, one TODO |
edge case, not tested yet? |
Example job_status.csv
When this CSV was just initialized:
sub_id,ses_id,has_submitted,job_id,job_state_category,job_state_code,duration,is_done,is_failed,log_filename,last_line_stdout_file,alert_message
sub-01,ses-A,False,-1,,,,False,,,,
when print(df) by python:
sub_id ses_id has_submitted job_id job_state_category job_state_code \
0 sub-01 ses-A False -1 NaN NaN
duration is_done is_failed log_filename last_line_stdout_file alert_message
0 NaN False NaN NaN NaN NaN
Note: 0 at the beginning: index of pd.DataFrame
How to test out babs status
Create pending or failed jobs
Change/Add these in participant_job.sh:
failed: see next section
pending: Please increase the cluster resources you request, e.g., memory, number of CPUs, temporary disk space, etc.
on SLurm clusters: increase
#SBATCH --mem,#SBATCH --tmp, etc
After these changes, datalad save -m "message" and datalad push --to input
Create failed cases for testing babs status failed job auditing
Add
sleep 3600tocontainer_zip.sh; make sure youdatalad savethe changesChange hard runtime limit to 20min (on SGE:
-l h_rt=0:20:00)Create failed cases:
when the job is pending, manually kill it
For SLURM cluster: you'll see normal msg from
Statecolumn ofsacctmsgFor SGE cluster: you'll see warning that
qacctfailed for this job - this is normal. See PR #98 for more details.
when the job is running, manually kill it
wait until the job is running out of time, killed by the cluster
if you don't want to wait for that long, just set the hard runtime limit to very low value, e.g., 20 sec
Perform job auditing using
--container-config:add some msg into the
alert_log_messages, which can be seen in the "failed" jobs - for testing purposealthough they can be normal msg seen in successful jobs
Terminology
<jobname>.o<jobid>: standard output stream of the job<jobname>.e<jobid>: standard error stream of the job