Problem


My job is permanently in PEND mode. 

Solution


If a job remains an unusually long time in pend mode, it is usually due to some mistake in the options specified in the submit file, so that no machine fits all the criteria. You can check the partitions characteristics for further details about partitions and machines.

For a detailed information:

squeue -j <job_id> -O reason

A state code will be given, which identifies the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.

In most cases, jobs will be pending because of Priority or Resources.


AssociationJobLimit

The job's association has reached its maximum job count.

AssociationResourceLimit

The job's association has reached some resource limit.

AssociationTimeLimit

The job's association has reached its time limit.

BadConstraints

The job's constraints can not be satisfied.

BeginTime

The job's earliest start time has not yet been reached.

Cleaning

The job is being re-queued and still cleaning up from its previous execution.

Dependency

This job is waiting for a dependent job to complete.

DependencyNeverSatisfied

This job is waiting for a dependent job that will never be completed.

InactiveLimit

The job reached the system InactiveLimit.

InvalidAccount

The job's account is invalid.

InvalidQOS

The job's QOS is invalid.

JobHeldAdmin

The job is held by a system administrator.

JobHeldUser

The job is held by the user.

JobLaunchFailure

The job could not be launched. This may be due to a file system problem, invalid program name, etc.

Licenses

The job is waiting for a license.

NodeDown

A node required by the job is down.

NonZeroExitCode

The job terminated with a non-zero exit code.

PartitionDown

The partition required by this job is in a DOWN state.

PartitionInactive

The partition required by this job is in an Inactive state and not able to start jobs.

PartitionNodeLimit

The number of nodes required by this job is outside of it's partitions current limits. Can also indicate that required nodes are DOWN or DRAINED.

PartitionTimeLimit

The job's time limit exceeds it's partition's current time limit.

Priority

One or more higher priority jobs exist for this partition or advanced reservation.

Prolog

It's PrologSlurmctld program is still running.

QOSJobLimit

The job's QOS has reached its maximum job count.

QOSResourceLimit

The job's QOS has reached some resource limit.

QOSTimeLimit

The job's QOS has reached its time limit.

ReqNodeNotAvail

Some node specifically required by the job is not currently available. The node may currently be in use, reserved for another job, in an advanced reservation, DOWN, DRAINED, or not responding. Nodes which are DOWN, DRAINED, or not responding will be identified as part of the job's "reason" field as "UnavailableNodes". Such nodes will typically require the intervention of a system administrator to make available.

Reservation

The job is waiting its advanced reservation to become available.

Resources

The job is waiting for resources to become available.

SystemFailure

Failure of the Slurm system, a file system, the network, etc.

TimeLimit

The job exhausted its time limit.

QOSUsageThreshold

Required QOS threshold has been breached.

WaitingForScheduling

No reason has been set for this job yet. Waiting for the scheduler to determine the appropriate reason.