Problem
My job is permanently in PEND mode.
Solution
If a job remains an unusually long time in pend mode, it is usually due to some mistake in the options specified in the submit file, so that no machine fits all the criteria. You can check the partitions characteristics for further details about partitions and machines.
For a detailed information:
squeue -j <job_id> -O reason
A state code will be given, which identifies the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.
In most cases, jobs will be pending because of Priority or Resources.
More Pending job reason codes
AssociationJobLimit | The job's association has reached its maximum job count. |
---|
AssociationResourceLimit | The job's association has reached some resource limit. |
---|
AssociationTimeLimit | The job's association has reached its time limit. |
---|
BadConstraints | The job's constraints can not be satisfied. |
---|
BeginTime | The job's earliest start time has not yet been reached. |
---|
Cleaning | The job is being re-queued and still cleaning up from its previous execution. |
---|
Dependency | This job is waiting for a dependent job to complete. |
---|
| This job is waiting for a dependent job that will never be completed. |
---|
InactiveLimit | The job reached the system InactiveLimit. |
---|
InvalidAccount | The job's account is invalid. |
---|
InvalidQOS | The job's QOS is invalid. |
---|
JobHeldAdmin | The job is held by a system administrator. |
---|
JobHeldUser | The job is held by the user. |
---|
JobLaunchFailure | The job could not be launched. This may be due to a file system problem, invalid program name, etc. |
---|
Licenses | The job is waiting for a license. |
---|
NodeDown | A node required by the job is down. |
---|
NonZeroExitCode | The job terminated with a non-zero exit code. |
---|
PartitionDown | The partition required by this job is in a DOWN state. |
---|
PartitionInactive | The partition required by this job is in an Inactive state and not able to start jobs. |
---|
PartitionNodeLimit | The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED. |
---|
PartitionTimeLimit | The job's time limit exceeds it's partition's current time limit. |
---|
Priority | One or more higher priority jobs exist for this partition or advanced reservation. |
---|
Prolog | It's PrologSlurmctld program is still running. |
---|
QOSJobLimit | The job's QOS has reached its maximum job count. |
---|
QOSResourceLimit | The job's QOS has reached some resource limit. |
---|
QOSTimeLimit | The job's QOS has reached its time limit. |
---|
ReqNodeNotAvail | Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available. |
---|
Reservation | The job is waiting its advanced reservation to become available. |
---|
Resources | The job is waiting for resources to become available. |
---|
SystemFailure | Failure of the Slurm system, a file system, the network, etc. |
---|
TimeLimit | The job exhausted its time limit. |
---|
QOSUsageThreshold | Required QOS threshold has been breached. |
---|
WaitingForScheduling | No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason. |
---|
Related articles
-
Page:
-
Page:
-
Page:
-
Page:
-
Page: