Why is my job not running on Argo?

The two most likely causes of a job not running are requesting too many resources and requesting resources that are not available.

Example 1

​qsub -V -l nodes=5:ppn=2 -q student_short my_script

The request violates two policies. One, the user is requesting a total of ten processors - two cores (ppn=2) on five nodes (nodes=5). The job is headed to the student_short queue. As was explained previously, the maximum number of processors (ncpus) a job may use on the student_short queue is eight (the following command with the resulting answer tells you that) but the request is for ten:

qmgr -c "list queue student_short" | grep resources_max.ncpus
resources_max.ncpus = 8

Two, the job is also requesting more nodes (max.nodect) than is permitted (the user wants five nodes when four is the maximum):

qmgr -c "list queue student_short" | grep resources_max.nodect
resources_max.nodect = 4

The following message should have appeared after issuing the qsub command:

qsub: Job exceeds queue resource limits MSG=cannot satisfy queue max nodect requirement

The words "max nodect" are key. Translation: you exceeded the maximum number of nodes permitted.

Example 2

A student issues the following command:

qsub -V -l node=argo1-1 my_script

The user requests a particular node to run a job. Since the user is a student and did not identify a queue, the job, by default, is routed to the student_short queue. But, argo1-1, is not assigned to the student_short queue. The job is requesting a resource not owned by the queue and is, therefore, unavailable. The job does not run. You should never request a node by name. Instead ask for n number of nodes and n processors per node and not a particular node.

Incorrect: 

qsub -V -l nodes=argo1-1+argo1-2 my_script

Correct:   

qsub​ -V -l nodes=2 my_script

Incorrect: 

qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script

Correct:   

qsub -V -l nodes=2:ppn=2 my_script

Example 3

It is important to note that the maximum number of nodes and CPUs is cumulative across all your submitted jobs.

qstat -u jsmith1
JobID Username Queue    Jobname    SessID NDS TSK Memory Time S Time
----- -------- -----    ---------  ------ --- --- ------ ---- - ----
1234  jsmith1  student_ my_script  12345  3   1    --  04:00  R 03:41
argo13-2/1+argo13-2/0+argo7-4/1+argo7-4/0+argo7-3/1+argo7-3
1235  jsmith1  student_ my_script    --   3   1    --  04:00  Q  --
--

Why is the first job (id 1234) "running" - indicated by a R (for running) in the status (S) column as well as by the names of the nodes assigned to the job - and the second job (id 1235) is queued, awaiting execution? Both jobs were submitted by the student jsmith1 for execution on the student_short queue, the second jobs soon after the first. And, both were submitted using the following qsub command invocation:

qsub -l nodes=3:ppn=2 -q student_short my_script

The first job requests six CPUs: two CPUs (ppn=2) on three nodes (nodes=3). The maximum number of CPUs a student may use on the student_short queue is eight:

qmgr -c "list queue student_short" | grep resources_max.ncpus 
resources_max.ncpus = 8

Since jsmith had no other resources, the first job is assigned the six CPUs and begins execution. The second job also requests six CPUs. The user has six CPUs and requests an additional six for a total of twelve, four over the limit of eight. The second job will not be assigned the requested resources and will sit, queued, awaiting the release of the four CPUs from the first job. However, the release of CPUs is an all or nothing proposition. Therefore, the first job would have to end before the second job begins.

Unlike the previous argo system, users should not request a particular node. A node may be re-assigned from one queue to another depending on system load. There is no guarantee that a particular node remains in a particular queue. Users should direct a job to a queue and not to nodes:

Incorrect: 

qsub -V -l nodes=argo1-1 my_script

Correct:   

qsub -V -q student_long my_script

Incorrect: 

qsub -V -l nodes=argo1-1+argo1-2 my_script

Correct:   

qsub -V -l nodes=2 -q student_long my_script

Incorrect: 

qsub -V -l nodes=argo1-1:ppn=2+argo1-2:ppn=2 my_script

Correct:   

qsub -V -l nodes=2:ppn=2 -q student_long my_script

tracejob and checkjob

The two commands that are most useful to diagnose problems pertaining to jobs not running are tracejob and checkjob. The operand for both commands is a jobid. The output of the checkjob can be very cryptic but the reason why the job is not running is there. For example, suppose a student issues the following command:

qsub -V -l nodes=4:ppn=3 -q student_short my_script

The job is assigned jobid 1277 but remains queued which is indicated by the capital Q in the S column (status) in the output of the qstat 1277 command:

Job id  Name       User    Time Use S Queue
------  ---------- ----    ---- --- - ------- 
1277    my_script  jsmith     0     Q student_short

    If the student issues the command:

    checkjob 1277

    the output will include the following (the output has been abbreviated for the purposes of brevity):

    Holds: Batch (hold reason: PolicyViolation)
    Messages: procs too high (12 > 8)
    PE: 12.00 StartPriority: 7
    cannot select job 1277 for partition DEFAULT (job hold active)

    Look closely: PolicyViolation: procs too high (12 >8). The student is asking for twelve processors. Go back and take a look at the qsub command:

    qsub -V -l nodes=4:ppn=3 -q student_short my_script

    Four nodes multiplied by three processors per node results in twelve processors. But, the student is limited to a total of eight processors on the student_short queue:

    qmgr -c "list queue student_short" | grep resources_max.ncpus 
    resources_max.ncpus = 8

    Issuing a new qsub command with a slight change (ppn from 3 to 2) will result in a running job:

    qsub -V -l nodes=4:ppn=2 -q student_short my_script

    Remember to delete the queued job:

    qdel 1277

    Need help?

    Last updated: 

    August 29, 2012

    Browse by tag