Why is my job permanently in PENDING mode?

Problem

My job is permanently in PEND mode.

Solution

If a job remains an unusually long time in pend mode, it is usually due to some mistake in the options specified in the submit file, so that no machine fits all the criteria. You can check the partitions characteristics for further details about partitions and machines.

For a detailed information:

squeue -j <job_id> -O reason

A state code will be given, which identifies the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

In most cases, jobs will be pending because of Priority or Resources.

More Pending job reason codes

AssociationJobLimit	The job's association has reached its maximum job count.
AssociationResourceLimit	The job's association has reached some resource limit.
AssociationTimeLimit	The job's association has reached its time limit.
BadConstraints	The job's constraints can not be satisfied.
BeginTime	The job's earliest start time has not yet been reached.
Cleaning	The job is being re-queued and still cleaning up from its previous execution.
Dependency	This job is waiting for a dependent job to complete.
DependencyNeverSatisfied	This job is waiting for a dependent job that will never be completed.
InactiveLimit	The job reached the system InactiveLimit.
InvalidAccount	The job's account is invalid.
InvalidQOS	The job's QOS is invalid.
JobHeldAdmin	The job is held by a system administrator.
JobHeldUser	The job is held by the user.
JobLaunchFailure	The job could not be launched. This may be due to a file system problem, invalid program name, etc.
Licenses	The job is waiting for a license.
NodeDown	A node required by the job is down.
NonZeroExitCode	The job terminated with a non-zero exit code.
PartitionDown	The partition required by this job is in a DOWN state.
PartitionInactive	The partition required by this job is in an Inactive state and not able to start jobs.
PartitionNodeLimit	The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.
PartitionTimeLimit	The job's time limit exceeds it's partition's current time limit.
Priority	One or more higher priority jobs exist for this partition or advanced reservation.
Prolog	It's PrologSlurmctld program is still running.
QOSJobLimit	The job's QOS has reached its maximum job count.
QOSResourceLimit	The job's QOS has reached some resource limit.
QOSTimeLimit	The job's QOS has reached its time limit.
ReqNodeNotAvail	Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.
Reservation	The job is waiting its advanced reservation to become available.
Resources	The job is waiting for resources to become available.
SystemFailure	Failure of the Slurm system, a file system, the network, etc.
TimeLimit	The job exhausted its time limit.
QOSUsageThreshold	Required QOS threshold has been breached.
WaitingForScheduling	No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.

Space shortcuts

Page tree

Problem

Solution

Related articles