Schedulers 

Scheduler Module 

Scheduler Plugin Class 

class pavilion.schedulers.SchedulerPlugin(name, description, priority=0)

Bases: IPlugin

The base scheduler plugin class. Scheduler plugins should inherit from this.

ISOLATE_KICKOFF_SUFFIX = '_isolated'

JOB_SHARE_KEY_ATTRS = []

JOB_STATUS_TIMEOUT = 1

KICKOFF_FN = None: If the kickoff script requires a special filename, set it here.

KICKOFF_LOG_DEFAULT_FN = 'kickoff.log'

KICKOFF_SCRIPT_HEADER_CLASS: alias of KickoffScriptHeader

NODE_SELECTION = {'contiguous': <function contiguous>, 'distributed': <function distributed>, 'rand_dist': <function rand_dist>, 'random': <function random>}

PRIO_COMMON = 10

PRIO_CORE = 0

PRIO_USER = 20

SCHED_DATA_FN = 'sched_data.txt': This file holds scheduler data for a test, so that it doesn’t have to be regenerated.

VAR_CLASS: alias of SchedulerVariables

__annotations__ = {'VAR_CLASS': 'Type[SchedulerVariables]', '_job_statuses': 'JobStatusDict'}

__init__(name, description, priority=0): Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.

__module__ = 'pavilion.schedulers.scheduler'

static _add_schedule_script_body(script, test): Add the script body to the given script object. This default simply adds a comment and the test run command.

_available() → bool: Return true if this scheduler is available on this system, false otherwise.

_create_kickoff_script_stub(pav_cfg: PavConfig, job_name: str, sched_config: dict, log_path: Path | None = None, nodes: NodeList | None = None, node_range: Tuple[int, int] | None = None, shebang: str | None = None, isolate: bool = False) → ScriptComposer

Generate the kickoff script essentials preamble common to all scheduled tests.

The ‘nodes’ and ‘node_range’ argument are mutually exclusive and required - See _kickoff().

_get_alloc_nodes(job: Job) → NodeList

Given that this is running on an allocation, return the allocation’s node list.

Parameters:: job – The Job is passed in case a scheduler saved node information with the job. Usually not needed.

_get_config_elems() → Tuple[List[ConfigElement], dict, dict]

Return the configuration elements specific to this scheduler, along with a dictionary of validation functions and defaults.

The configuration elements will configurable under schedule.<plugin_name> in test configs, where <plugin_name> is the name argument passed to the plugin’s __init__ method. This should be a list of yaml_config ConfigElement instances. This additional configuration can have any structure you like, but should not repeat general schedule options.

The second return should be a dict of validators for those elements, the keys being the config element’s name. If a validator is not present, the value will not be added to the dict of validated config values for the scheduler to use. The values should be one of the following:

A tuple of all accepted values. Values not in this tuple are an error.

A function that takes the raw value (including None), and returns a validated/transformed result. Should raise ValueError for invalid input, and return None if no value was given.

A dictionary of additional validators to recursively evaluate. This is used when value is itself a dictionary.

The third returned value should be a dictionary of defaults for each key. Like with the validators, a dictionary can be used to store another level of defaults for when the config key represents a dictionary. Defaults are optional, but keys without a default will have a None value.

These will all be dynamically added and removed from the configuration setup in schedulers.config as plugins are added/removed. For examples, see the slurm plugin or the schedulers.config module itself.

_get_initial_vars(sched_config: dict) → SchedulerVariables: Return the deferred scheduler variable object for the given scheduler config.

_get_kickoff_script_header(job_name: str, sched_config: dict, nodes: None | NodeList, node_range: Tuple[int, int] | None, shebang: str | None = None) → KickoffScriptHeader: Get a script header object for the kickoff script. The nodes are those picked specifically for this job (if empty, the choice is left to the scheduler). The node list may not be comprehensive, but won’t exceed the allocation size.

abstract _job_name(tests: TestRun | List[TestRun]) → str: Given a test, get the name of the job.

_job_status(pav_cfg, job_info: JobInfo) → TestStatusInfo | None

Override this to provide job status information given a job_info dict. The format of the job_info is scheduler dependent, and produced in the kickoff method. This can, optionally, set the job status for all jobs it can at once in the _job_statuses dict, which would greatly reduce the number of calls to the scheduler. It will only be called if a status hasn’t been recently cached.

It should return a TestStatusInfo object with one of these states:

SCHEDULED - The job is still waiting for an allocation.
SCHED_ERROR - The job is dead because of some error.
SCHED_CANCELLED - The job was cancelled.
SCHED_STARTUP - The job has started, but not the test.

# The following is deprecated, and will be silently converted into SCHED_STARTUP - SCHED_RUNNING - The job is running (but not usually the test yet).

Lastly, this may return None when we can’t determine the test state at all. This typically happens when job id’s don’t stick around after a test finishes, so we don’t have any information on it. This isn’t an error - more like a shrug.

_kickoff(pav_cfg, job: Job, sched_config: dict, job_name: str, nodes: NodeList | None = None, node_range: Tuple[int, int] | None = None) → JobInfo

Schedule the test under this scheduler.

Parameters:

job_name – The job should be named this under the scheduler, if possible.
sched_config – The (validated) scheduler config.
nodes – A list of specific nodes to kickoff the test under. This list, when provided, should be comprehensive. A non-comprehensive list of nodes to include can be provided via the config[‘include_nodes’] option.
node_range – A tuple of the (min, max) nodes to request.

Returns:

The job info of the kicked off job.

One and only one of ‘nodes’ and ‘node_range’ may be provided.

By default, scheduler plugins will produce a bash kickoff script for you to hand to the scheduler to run. You should generally rely on this autogenerated script to do the right thing internally.

If your scheduler uses script header information to figure out the parameters for a job, you can provide that by overriding KICKOFF_SCRIPT_HEADER_CLASS class variable and providing your own KickoffScriptHeader sub-class. Otherwise, you may simply pass the appropriate information to your scheduler through command line or API arguments.

Pavilion expects you to ask for an allocation as specified in the schedule section. You should include the following, assuming your scheduler has such a concept (or an equivalent concept):

‘partition’ - The sub-division of nodes to run on.

‘reservation’ - Set of reserved nodes to run on (Pavilion assumes that nodes in a reservation are only available if the reservation is specified.

‘qos’ - Quality of Service run group. Determines allocation limits along with account’

‘account’ - Account that tracks node sharing information.

‘time_limit’ - Provided in ‘seconds’. Overall job time limit.

‘nodes’ - The number of nodes to request.

‘include_nodes’ - A list of nodes to include. These nodes are guaranteed to be included, but the final set of nodes may include other nodes as well. When ‘nodes’ is provided, these will already be included.

‘exclude_nodes’ - The list of nodes to exclude. When ‘nodes’ is provided, these will already be excluded.

‘across_nodes’ - A complete list of nodes to use. When provided, these nodes, and only these nodes (or potentially a subset of them) will be used for scheduling.

‘node_range’ - From the ‘node_range’ argument. The minimum and maximum number of nodes to request.

‘job_name’ - What to label the job.

Any output from the scheduler should be written to ‘job.sched_log’.

_make_kickoff_error(orig_err, tests): Convert a generic error to something with more information.

activate(): Add this plugin to the scheduler plugin list.

available() → bool: Returns true if this scheduler is available on this host.

cancel(job_info: JobInfo) → str | None

Do your best to cancel the given job. A return of None denotes success.

Returns:: None, or a message stating why the job couldn’t be cancelled.

abstract create_kickoff_script(pav_cfg: PavConfig, tests: TestRun | List[TestRun], log_path: Path | None = None, nodes: Optional = None, isolate: bool = False) → ScriptComposer: Create the kickoff script.

deactivate(): Remove this plugin from the scheduler plugin list.

gen_job_share_key(sched_config, min_nodes, max_nodes) → Tuple

Generate a job sharing key - Tests with the same key are considered eligable to share a job with each other.

This always depends on the min and max nodes (which will always be the first and second values, plus whatever keys a the scheduler plugin gives in JOB_SHARE_KEY_ATTRS.

get_conf() → Tuple[KeyedElem | None, dict, dict]: Return the configuration object suitable for adding scheduler specific keys under ‘scheduler.<scheduler_name> in the test configuration.

get_final_vars(test: TestRun) → SchedulerVariables: Get the final, non-deferred scheduler variables for a test. This should only be called within an allocation.

get_initial_vars(raw_sched_config: dict) → SchedulerVariables

Queries the scheduler to auto-detect its current state, and returns the dictionary of scheduler variables for that test given its config.

Parameters:: raw_sched_config – The raw scheduler config for a given test.
Returns:: A tuple of the scheduler variables object and the node_list_id, which should be saved as part of the test config.

job_status(pav_cfg, test) → TestStatusInfo

Get the job state from the scheduler, and map it to one of the of the following states: SCHEDULED, SCHED_ERROR, SCHED_CANCELLED, SCHED_STARTUP. This should only be called if the current recorded test state is ‘SCHEDULED’.

The first SCHED_ERROR and SCHED_CANCELLED statuses encountered will be saved to the test status file, Other statuses are never saved. The test will also be set as complete in this case.

Parameters:

pav_cfg – The pavilion configuration.
test (pavilion.test_run.TestRun) – The test we’re checking on.

Returns:

A StatusInfo object representing the status.

refresh(): Clear gathered scheduler information, generally to force the scheduler to re-gather info and node lists.

register_core_plugins(): Find and activate all builtin plugins.

schedule_tests(pav_cfg, tests: List[TestRun]): Schedule each test using this scheduler.

Scheduler Plugin Basic Class 

class pavilion.schedulers.SchedulerPluginBasic(name, description, priority=0)

Bases: SchedulerPlugin, ABC

A Scheduler plugin that does not support automatic node inventories. It relies on manually set parameters in ‘schedule.cluster_info’.

IS_CONCURRENT = True

__abstractmethods__ = frozenset({})

__annotations__ = {'VAR_CLASS': 'Type[SchedulerVariables]', '_job_statuses': 'JobStatusDict'}

__module__ = 'pavilion.schedulers.basic'

_abc_impl = <_abc._abc_data object>

_get_alloc_node_info(node_name) → NodeInfo: Given that this is running on an allocation, get information about the given node. While this is completely optional, it can help pavilion better populate variables like ‘test_min_cpus’ and ‘test_min_mem’.

_get_initial_vars(sched_config: dict) → SchedulerVariables: Get the initial variables for the basic scheduler.

_job_name(tests: List[TestRun]) → str: Given a test, get the name of the job.

create_kickoff_script(pav_cfg: PavConfig, tests: TestRun | List[TestRun], log_path: Path | None = None, nodes: NodeSet | None = None, isolate: bool = False) → ScriptComposer: Create the kickoff script.

get_final_vars(test: TestRun) → SchedulerVariables: Gather node information from within the allocation.

schedule_tests(pav_cfg, tests: List[TestRun]) → List[SchedulerPluginError]: Schedule all test tests in a single job kickoff script.

Scheduler Plugin Advanced Class 

class pavilion.schedulers.SchedulerPluginAdvanced(name, description, priority=10)

Bases: SchedulerPlugin, ABC

A scheduler plugin that supports automatic node inventories, and as a consequence chunking and other advanced features.

JOB_SHARE_KEY_ATTRS = ['partition', 'reservation', 'account', 'qos']

__abstractmethods__ = frozenset({})

__annotations__ = {'VAR_CLASS': 'Type[SchedulerVariables]', '_job_statuses': 'JobStatusDict'}

__init__(name, description, priority=10): Initialize tracking of node info and chunks, in addition to the basics.

__module__ = 'pavilion.schedulers.advanced'

_abc_impl = <_abc._abc_data object>

_filter_custom(sched_config: dict, node_name: str, node: NodeInfo) → None | str: Apply scheduler specific filters to the node list. Returns a reason why the node should be filtered out, or None if it shouldn’t be.

_filter_nodes(sched_config: Dict[str, Any]) → Tuple[NodeList, Dict[str, List[str]]]

Filter the system nodes down to just those we can use. This should check to make sure the nodes available are compatible with the test. The arguments for this function will vary by scheduler.

Returns:: A list of compatible node names.
Return type:: list

_get_chunks(node_list_id, sched_config) → List[NodeSet]

Chunking is specific to the node list, chunk size, and node selection settings of a job. The actual chunk used by a test_run won’t be known until after the test is at least partially resolved, however. Until then, it only knows what chunks are available.

This method retrieves or creates a list of ChunkInfo objects, and returns it.

_get_initial_vars(sched_config: dict) → SchedulerVariables: Get initial variables (and chunks) for this scheduler.

_get_raw_node_data(sched_config) → Tuple[List[Any], Any]

Get the raw data for the nodes on the current cluster/host.

Returns:: A list of raw data for each node (to be processed by _transform_raw_node_data, and an object (of any type) of data that applies to every node.

_get_system_inventory(sched_config: dict) → Nodes | None: Returns a dictionary of node data, or None if the scheduler does not support node data acquisition.

_job_name(tests: TestRun | List[TestRun]) → str: Given a test, get the name of the job.

_make_chunk_group_id(node_list_id, sched_config): Generate a ‘chunk_group_id’ - a tuple of values that denote a unique type of chunk.

_schedule_chunk(pav_cfg, chunk: NodeSet, tests: List[TestRun], sched_configs: Dict[str, dict]) → List[SchedulerPluginError]

Schedule all the tests that belong to a given chunk. Group tests that can be scheduled in a shared allocation together.

Returns:: A list of encountered errors.

_schedule_indi_chunk(pav_cfg, tests: List[TestRun], sched_configs: Dict[str, dict], chunk: NodeSet): Schedule tests individually under the given chunk. These are not flex scheduled.

_schedule_indi_flex(pav_cfg, tests: List[TestRun], sched_configs: Dict[str, dict], chunk: NodeSet) → List[SchedulerPluginError]: Schedule tests individually in ‘flexible’ allocations, where the scheduler picks the nodes.

_schedule_shared(pav_cfg, tests: List[TestRun], node_range: NodeRange, sched_configs: Dict[str, dict], chunk: NodeSet) → List[SchedulerPluginError]: Scheduler tests in a shared allocation. This allocation will use chunking when enabled, or allow the scheduler to pick the nodes otherwise.

_transform_raw_node_data(sched_config, node_data, extra) → NodeInfo

Transform the raw node data into a node info dictionary. Not all keys are required, but you must provide enough information to filter out nodes that can’t be used or to differentiate nodes that can’t be used together. You may return additional keys, typically to use with scheduler specific filter parameters.

Base supported keys: # Node Name (required)

name - The name of the node (from the scheduler’s perspective)

# Node Status - (required)

up (bool) - Whether the node is up (allocatable).
available (bool) - Whether the node is allocatable and unallocated.

# Informational

cpus - The number of CPUs on the node.
mem - The node memory in GB

# Partitions - this information is used to separate nodes into groups that # can be allocated together. If this information is lacking, Pavilion will # attempt to create allocations that aren’t possible on a system, such as # across partitions.

partitions (List) - The cluster partitions on which the node resides.
reservations (List) - List of reservations to which the node belongs.
features (List[str]) - A list of feature tags that differentiate nodes,
typically on heterogeneous systems.

create_kickoff_script(pav_cfg: PavConfig, tests: TestRun | List[TestRun], log_path: Path | None = None, nodes: NodeSet | None = None, isolate: bool = False) → ScriptComposer: Create the kickoff script.

get_final_vars(test: TestRun) → SchedulerVariables: Load our saved node data from kickoff time, and compute the final scheduler variables from that.

refresh(): Clear all internal state variables.

schedule_tests(pav_cfg, tests: List[TestRun]) → List[SchedulerPluginError]

Schedule each of the given tests using this scheduler using a separate allocation (if applicable) for each.

Parameters:

pav_cfg – The pavilion config
tests ([pavilion.test_run.TestRun]) – A list of pavilion tests to schedule.

Returns:

A list of Scheduler errors encountered when starting tests.

Scheduler Variables 

class pavilion.schedulers.SchedulerVariables(sched_config: dict, nodes: Nodes | None = None, chunks: List[NodeSet] | None = None, node_list_id: int | None = None, deferred=True)

Bases: VarDict

The base scheduler variables class. Each scheduler should have a child class of this that contains all the variable functions it provides.

To add a scheduler variable, create a method and decorate it with either @var_method or @dfr_var_method. Only methods tagged with either of those will be given as variables, so you’re free to create any support methods as needed.

Variables should be given lower case names, with words separated with underscores. Variables that are prefixed with ‘test_*’ are specific to a given test (or job). These are often deferred. All other variables should not be deferred.

This class is meant to be inherited from - each scheduler can provide its own set of variables in addition to these defaults, and may also provide different implementations of each variable. Most schedulers can get away with overriding one variable - the ‘test_cmd’ method. See the documentation for that method below for more information.

Return values of all variables should be the same format as those allowed by regular test variables: a string, a list of strings, a dict (with string keys and values), or a list of such dicts.

Scheduler variables are requested once per test run by Pavilion when it is created, and again for each test right before it runs on an allocation in order to un-defer values.

DEFER_ERRORS = True: Each scheduler variable class should provide an example set of values for itself to display when using ‘pav show’ to list the variables. These are easily obtained by running a test under the scheduler, and then harvesting the results of the test run.

EXAMPLE = {'chunk_ids': ['0', '1', '2', '3'], 'errors': ['oh no, there was an error.'], 'node_list': ['node01', 'node03', 'node04'], 'srun_args': '--account=myaccount --partition=mypart --qos=myqos ...', 'status_info': '', 'tasks': '35', 'tasks_per_node': '5', 'tasks_total': '180', 'test_min_cpus': '4', 'test_min_mem': '32', 'test_node_list': ['node02', 'node04'], 'test_nodes': '45'}

NO_EXAMPLE = '<no example>'

__abstractmethods__ = frozenset({})

__annotations__ = {}

__init__(sched_config: dict, nodes: Nodes | None = None, chunks: List[NodeSet] | None = None, node_list_id: int | None = None, deferred=True)

Initialize the scheduler var dictionary. This will be initialized when preliminary variables are gathered vs when it is no longer deferred. Initial variables are based on the full node list and then given list of chunks. For deferred variables, however, the nodes only contain those nodes that are part of the actual allocation. ‘chunks’ is not given in this case.

Parameters:

nodes – The dict of node names to node data. If None, will default to an empty dict.
sched_config – The scheduler configuration for the corresponding test.
chunks – The list of chunks, each of which is a list of node names. If None, will default to an empty list.
node_list_id – Should always be included when chunks is included. Provides the scheduler with a way to recover the original node list that was chunked without having to store it.
deferred – Whether the variables are deferred.

__module__ = 'pavilion.schedulers.vars'

__repr__(): Return repr(self).

_abc_impl = <_abc._abc_data object>

_get_min(nodes: List[NodeInfo], attr: str, default: int): Get the minimum of the given attribute across the list of nodes, settling for the cluster_info value, and then the default.

_test_cmd()

The command to prepend to a line to kick it off under the scheduler.

This should return the command needed to start one or more MPI processes within an existing allocation. This is often mpirun, but may be something more specific. This command should be given options, as appropriate, such that the MPI process is started with the options specified in self._sched_config. In most cases, these options won’t be necessary, as the MPI command while simply inherit what was provided when the job was created.

Tests may share jobs. While node selection and other high level settings will be identical for each test, Pavilion reserves the option to allow tests to run with modified node-lists within an allocation. This means you should always specify that

account(): The scheduler account as defined in the scheduler configs.

chunk_ids(): A list of indices of the available chunks.

chunk_size(): The size of each chunk.

concurrent_default(): The default level of concurrency for the scheduler.

info(key): Get the info dict for the given key, and add the example to it.

min_cpus(): Get a minimum number of cpus available on each (filtered) noded. Defaults to 1 if unknown.

min_mem(): Get a minimum for any node across each (filtered) nodes. Returns a value in bytes (4 GB if unknown).

mpirun_opts(): Sets up mpirun command with user-defined options.

node_list() → NodeList: The list of node names that the test could run on, after filtering, as per the ‘nodes’ variable.

node_list_id(): Return the node list id, if available. This is meaningless to test configs, but is used internally by Pavilion.

nodes() → int: The number of nodes that a test may run on, after filtering according to the test’s ‘schedule’ section. The actual nodes selected for the test, whether selected by Pavilion or the scheduler itself, will be in ‘test_nodes’.

partition(): This variable provides extra status info for a test. It is particularly meant to be overridden by plugins.

qos() → str: Return the QOS as defined in the scheduler config.

requested_nodes(): Number of requested nodes.

reservation() → str: Return the reservation as defined in the scheduler config.

srun_args() → str: A composed list of arguments for srun, based on the test’s scheduler configuration. These are generally meant for use with the ‘raw’ scheduler to make allocations. It only accounts for nodes, account, partition, qos, and reservation.

tasks_per_node() → int: The number of tasks to create per node. If the scheduler does not support node info, just returns 1.

tasks_total() → int: The total number of tasks for the job, either as defined by ‘tasks’ or by the tasks_per_node and number of nodes.

test_cmd(): Calls the actual test command and then wraps the result with the wrapper provided in the schedule section of the configuration.

test_min_cpus(): The min cpus for each node in the chunk. Defaults to 1 if no info is available.

test_min_mem(): The min memory for each node in the chunk in bytes. Defaults to 4 GB if no info is available.

test_node_list() → NodeList: The list of nodes by name allocated for this test. Note that more nodes than this may exist in the allocation.

test_nodes() → int: The number of nodes for this specific test, determined once the test has an allocation. Note that the allocation size may be larger than this number.

Scheduler Plugins 

Slurm 

Slurm Variables 

class pavilion.schedulers.plugins.slurm.SlurmVars(sched_config: dict, nodes: Nodes | None = None, chunks: List[NodeSet] | None = None, node_list_id: int | None = None, deferred=True)

Bases: SchedulerVariables

Scheduler variables for the Slurm scheduler.

EXAMPLE = {'chunk_ids': ['0', '1', '2', '3'], 'errors': ['oh no, there was an error.'], 'node_list': ['node01', 'node03', 'node04'], 'srun_args': '--account=myaccount --partition=mypart --qos=myqos ...', 'status_info': '', 'tasks': '35', 'tasks_per_node': '5', 'tasks_total': '180', 'test_cmd': 'srun -N 5 -w node[05-10],node23 -n 20', 'test_min_cpus': '4', 'test_min_mem': '32', 'test_node_list': ['node02', 'node04'], 'test_nodes': '45'}

test_cmd(): Calls the actual test command and then wraps the return with the wrapper provided in the schedule section of the configuration.

Slurm Scheduler Plugin 

class pavilion.schedulers.plugins.slurm.SbatchHeader(job_name: str, sched_config: Dict[str, Any], sched_vars: SchedulerVariables, nodes: NodeList | None = None, node_range: Tuple[int, int] | None = None, shebang: str | None = None)

Bases: KickoffScriptHeader

Provides header information specific to sbatch files for the slurm kickoff script.

__annotations__ = {}

__module__ = 'pavilion.schedulers.plugins.slurm'

_kickoff_lines() → List[str]: Get the sbatch header lines.

class pavilion.schedulers.plugins.slurm.Slurm

Bases: SchedulerPluginAdvanced

Schedule tests with Slurm!

JOB_SHARE_KEY_ATTRS = ['partition', 'reservation', 'account', 'qos', 'slurm.sbatch_extra', 'slurm.features']

KICKOFF_SCRIPT_HEADER_CLASS: alias of SbatchHeader

MPIRUN_BIND_OPTS = ('slot', 'hwthread', 'core', 'L1cache', 'L2cache', 'L3cache', 'socket', 'numa', 'board', 'node')

MPI_CMD_MPIRUN = 'mpirun'

MPI_CMD_OPTIONS = ('srun', 'mpirun')

MPI_CMD_SRUN = 'srun'

NODE_BRACKET_FORMAT_RE = re.compile('([a-zA-Z](?:[a-zA-Z0-9_-]*[a-zA-Z_-])?\\d*)\\[(.*)]')

NODE_LIST_RE = re.compile('[a-zA-Z](?:[a-zA-Z0-9_-]*[a-zA-Z_-])?(?:\\d*|\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\]))(?:,[a-zA-Z](?:[a-zA-Z0-9_-]*[a-zA-Z_-])?(?:\\d*|\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\)

NODE_SEQ_REGEX_STR = '[a-zA-Z](?:[a-zA-Z0-9_-]*[a-zA-Z_-])?(?:\\d*|\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\]))'

SCHED_CANCELLED = ['CANCELLED', 'DEADLINE', 'PREEMPTED', 'BOOT_FAIL']

SCHED_ERROR = ['DEADLINE', 'FAILED', 'NODE_FAIL', 'OUT_OF_MEMORY', 'PREEMPTED', 'REVOKED', 'SPECIAL_EXIT', 'TIMEOUT']

SCHED_OTHER = ['RESV_DEL_HOLD', 'REQUEUE_FED', 'REQUEUE_HOLD', 'REQUEUED', 'RESIZING', 'SIGNALING', 'SUSPENDED']

SCHED_RUN = ['COMPLETED', 'COMPLETING', 'RUNNING', 'STAGE_OUT']

SCHED_WAITING = ['CONFIGURING', 'PENDING']

SCONTROL_KEY_RE = re.compile('(?:^|\\s+)([A-Z][a-zA-Z0-9:/]*)=')

SCONTROL_WS_RE = re.compile('\\s+')

VAR_CLASS: alias of SlurmVars

__abstractmethods__ = frozenset({})

__annotations__ = {'VAR_CLASS': 'Type[SchedulerVariables]', '_job_statuses': 'JobStatusDict'}

__init__(): Initialize tracking of node info and chunks, in addition to the basics.

__module__ = 'pavilion.schedulers.plugins.slurm'

_abc_impl = <_abc._abc_data object>

_available() → bool: Looks for several slurm commands, and tests slurm can talk to the slurm db.

_filter_custom(sched_config: dict, node_name: str, node: NodeInfo) → str | None: Filter nodes by features. (Returns why a node should be filtered out, or None if it shouldn’t be.

_get_alloc_nodes(job) → NodeList: Get the list of allocated nodes.

_get_config_elems()

Return the configuration elements specific to this scheduler, along with a dictionary of validation functions and defaults.

The configuration elements will configurable under schedule.<plugin_name> in test configs, where <plugin_name> is the name argument passed to the plugin’s __init__ method. This should be a list of yaml_config ConfigElement instances. This additional configuration can have any structure you like, but should not repeat general schedule options.

The second return should be a dict of validators for those elements, the keys being the config element’s name. If a validator is not present, the value will not be added to the dict of validated config values for the scheduler to use. The values should be one of the following:

A tuple of all accepted values. Values not in this tuple are an error.

A function that takes the raw value (including None), and returns a validated/transformed result. Should raise ValueError for invalid input, and return None if no value was given.

A dictionary of additional validators to recursively evaluate. This is used when value is itself a dictionary.

The third returned value should be a dictionary of defaults for each key. Like with the validators, a dictionary can be used to store another level of defaults for when the config key represents a dictionary. Defaults are optional, but keys without a default will have a None value.

These will all be dynamically added and removed from the configuration setup in schedulers.config as plugins are added/removed. For examples, see the slurm plugin or the schedulers.config module itself.

_get_raw_node_data(sched_config) → Tuple[List[Any] | None, Any]: Use the scontrol show node command to collect data on nodes. Types are converted according to self.FIELD_TYPES.

_job_status(pav_cfg, job_info: JobInfo) → TestStatusInfo: Get the current status of the slurm job for the given test.

_kickoff(pav_cfg, job: Job, sched_config: dict, job_name: str, nodes: NodeList | None = None, node_range: Tuple[int, int] | None = None) → JobInfo: Submit the kick off script using sbatch.

_scontrol_parse(section: str) → Dict[str, str]

_scontrol_show(*args, timeout=30) → List[Dict]

Run scontrol show and return the parsed output.

Parameters:

args (list(str)) – Additional args to scontrol.
timeout (int) – How long to wait for results.

_transform_raw_node_data(sched_config, node_data, extra) → NodeInfo: Translate the gathered data into a NodeInfo dict.

cancel(job_info: JobInfo) → str | None: Scancel the job attached to the given test.

classmethod parse_node_list(node_list) → NodeList: Convert a slurm format node list into a list of nodes, and throw errors that help the user identify their exact mistake.

The Raw (local system) scheduler.

CANCEL_TIMEOUT = 1

KICKOFF_SCRIPT_HEADER_CLASS: alias of RawKickoffHeader

UNIQ_ID_LEN = 10

VAR_CLASS: alias of RawSchedulerVariables

__abstractmethods__ = frozenset({})

__annotations__ = {'VAR_CLASS': 'Type[SchedulerVariables]', '_job_statuses': 'JobStatusDict'}

__init__(): Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.

__module__ = 'pavilion.schedulers.plugins.raw'

_abc_impl = <_abc._abc_data object>

_available() → bool: The raw scheduler is always available.

_get_alloc_node_info(node_name) → NodeInfo: Return mem and cpu info for this host.

_get_alloc_nodes(job) → NodeList: Return just the hostname of this host.

_job_status(pav_cfg, job_info: JobInfo) → TestStatusInfo | None: Raw jobs will either be scheduled (waiting on a concurrency lock), or in an unknown state (as there aren’t records of dead jobs).

_kickoff(pav_cfg, job: Job, sched_config: dict, job_name: str, nodes: NodeList | None = None, node_range: Tuple[int, int] | None = None) → JobInfo: Run the kickoff script in a separate process. The job id a combination of the hostname and pid.

static _pid_running(job_info: JobInfo) → bool

Verify that the test is running under the given pid. Note that this may change before, after, or during this call.

Returns:: True - If the given pid is for the given test_id (False otherwise)

available(): The raw scheduler is always available.

cancel(job_info: JobInfo) → None | str: Try to kill the given job_id (if it is the right pid).

Schedulers 

Scheduler Module 

Scheduler Plugin Class 

Scheduler Plugin Basic Class 

Scheduler Plugin Advanced Class 

Scheduler Variables 

Scheduler Plugins 

Slurm 

Slurm Variables 

Slurm Scheduler Plugin 

Raw 

Raw Variables 

Raw Scheduler 