Schedulers¶
Table of Contents
Scheduler Module¶
Scheduler plugins give you the ability to (fairly) easily add new scheduling mechanisms to Pavilion.
Scheduler Plugin Class¶
-
class
pavilion.schedulers.
SchedulerVariables
(scheduler, sched_config)¶ Bases:
pavilion.var_dict.VarDict
The base scheduler variables class. Each scheduler should have a child class of this that contains all the variable functions it provides.
To add a scheduler variable, create a method and decorate it with either ‘@sched_var’ or ‘@dfr_sched_var()’. The method name will be the variable name, and the method will be called to resolve the variable value. Methods that start with ‘_’ are ignored.
Naming Conventions:
- ‘alloc_*’
- Variable names should be prefixed with ‘alloc_’ if they are deferred.
- ‘test_*’
- Variable names prefixed with test denote that the variable is specific to a test. These also tend to be deferred.
-
BYTE_SIZE_UNITS
= {'': 1, 'B': 1, 'GB': 1000000000, 'GiB': 1073741824, 'KiB': 1024, 'MB': 1000000, 'MiB': 1048576, 'kB': 1000}¶
-
__abstractmethods__
= frozenset()¶
-
__init__
(scheduler, sched_config)¶ Initialize the scheduler var dictionary.
Parameters: - scheduler (SchedulerPlugin) – The scheduler for this set of variables.
- sched_config (dict) – The test object for which this set of variables is relevant.
-
__module__
= 'pavilion.schedulers'¶
-
__repr__
()¶ Return repr(self).
-
_abc_impl
= <_abc_data object>¶
-
min_cpus
()¶ Get a minimum number of cpus we have available on the local system. Defaults to 1 on error (and logs the error).
-
min_mem
()¶ Get a minimum amount of memory for the system, in Gibibytes. Returns 1 on error (and logs the error).
-
sched_data
¶ A convenience function for getting data from the scheduler.
Scheduler Variables¶
-
class
pavilion.schedulers.
SchedulerVariables
(scheduler, sched_config) Bases:
pavilion.var_dict.VarDict
The base scheduler variables class. Each scheduler should have a child class of this that contains all the variable functions it provides.
To add a scheduler variable, create a method and decorate it with either ‘@sched_var’ or ‘@dfr_sched_var()’. The method name will be the variable name, and the method will be called to resolve the variable value. Methods that start with ‘_’ are ignored.
Naming Conventions:
- ‘alloc_*’
- Variable names should be prefixed with ‘alloc_’ if they are deferred.
- ‘test_*’
- Variable names prefixed with test denote that the variable is specific to a test. These also tend to be deferred.
-
BYTE_SIZE_UNITS
= {'': 1, 'B': 1, 'GB': 1000000000, 'GiB': 1073741824, 'KiB': 1024, 'MB': 1000000, 'MiB': 1048576, 'kB': 1000}
-
__abstractmethods__
= frozenset()
-
__init__
(scheduler, sched_config) Initialize the scheduler var dictionary.
Parameters: - scheduler (SchedulerPlugin) – The scheduler for this set of variables.
- sched_config (dict) – The test object for which this set of variables is relevant.
-
__module__
= 'pavilion.schedulers'
-
__repr__
() Return repr(self).
-
_abc_impl
= <_abc_data object>
-
min_cpus
() Get a minimum number of cpus we have available on the local system. Defaults to 1 on error (and logs the error).
-
min_mem
() Get a minimum amount of memory for the system, in Gibibytes. Returns 1 on error (and logs the error).
-
sched_data
A convenience function for getting data from the scheduler.
Scheduler Plugins¶
Slurm¶
Slurm Variables¶
-
class
pavilion.plugins.sched.slurm.
SlurmVars
(scheduler, sched_config)¶ Bases:
pavilion.schedulers.SchedulerVariables
Scheduler variables for the Slurm scheduler.
-
alloc_cpu_total
()¶ Total CPUs across all nodes in this allocation.
-
alloc_max_mem
()¶ Max mem per node for this allocation. (in MiB)
-
alloc_max_ppn
()¶ Max ppn for this allocation.
-
alloc_min_mem
()¶ Min mem per node for this allocation. (in MiB)
-
alloc_min_ppn
()¶ Min ppn for this allocation.
-
alloc_node_list
()¶ A space separated list of nodes in this allocation.
-
alloc_nodes
()¶ The number of nodes in this allocation.
-
max_mem
()¶ The maximum memory per node across all nodes (in MiB).
-
max_ppn
()¶ The maximum processors per node across all nodes.
-
min_mem
()¶ The minimum memory per node across all nodes (in MiB).
-
min_ppn
()¶ The minimum processors per node across all nodes.
-
node_avail_list
()¶ List of nodes who are in an a state that is considered available. Warning: Tests that use this will fail to start if no nodes are available.
-
node_list
()¶ List of nodes on the system.
-
node_up_list
()¶ List of nodes who are in an a state that is considered available.
-
nodes
()¶ Number of nodes on the system.
-
nodes_avail
()¶ Number of nodes in an ‘avail’ state.
-
nodes_up
()¶ Number of nodes in an ‘avail’ state.
-
test_cmd
()¶ Construct a cmd to run a process under this scheduler, with the criteria specified by this test.
-
test_node_list
()¶ A list of nodes dedicated to this test run.
-
test_nodes
()¶ The number of nodes allocated for this test (may be less than the total in this allocation).
-
test_procs
()¶ The number of processors to request for this test.
-
Slurm Scheduler Plugin¶
-
class
pavilion.plugins.sched.slurm.
SbatchHeader
(sched_config, nodes, test_id, slurm_vars)¶ Bases:
pavilion.scriptcomposer.ScriptHeader
Provides header information specific to sbatch files for the slurm kickoff script.
-
__init__
(sched_config, nodes, test_id, slurm_vars)¶ Build a header for an sbatch file.
Parameters: - sched_config (dict) – The slurm section of the test config.
- nodes (str) – The node list
- test_id (int) – The test’s id.
- slurm_vars (dict) – The test variables.
-
__module__
= 'pavilion.plugins.sched.slurm'¶
-
get_lines
()¶ Get the sbatch header lines.
-
-
class
pavilion.plugins.sched.slurm.
Slurm
¶ Bases:
pavilion.schedulers.SchedulerPlugin
Schedule tests with Slurm!
-
KICKOFF_SCRIPT_EXT
= '.sbatch'¶
-
NODE_FIELD_TYPES
= {'ActiveFeatures': <function Slurm.<lambda>>, 'AllocMemory': <class 'int'>, 'AvailableFeatures': <function Slurm.<lambda>>, 'CPUAlloc': <class 'int'>, 'CPULoad': <function slurm_float>, 'CPUTot': <class 'int'>, 'FreeMemory': <class 'int'>, 'Partitions': <function Slurm.<lambda>>, 'RealMemory': <class 'int'>, 'State': <function slurm_states>}¶
-
NODE_LIST_RE
= re.compile('[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?(?:,[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?)*$')¶
-
NODE_SEQ_REGEX_STR
= '[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?'¶
-
SCHED_CANCELLED
= ['CANCELLED', 'DEADLINE', 'PREEMPTED', 'BOOT_FAIL']¶
-
SCHED_ERROR
= ['DEADLINE', 'FAILED', 'NODE_FAIL', 'OUT_OF_MEMORY', 'PREEMPTED', 'REVOKED', 'SPECIAL_EXIT', 'TIMEOUT']¶
-
SCHED_OTHER
= ['RESV_DEL_HOLD', 'REQUEUE_FED', 'REQUEUE_HOLD', 'REQUEUED', 'RESIZING', 'SIGNALING', 'SUSPENDED']¶
-
SCHED_RUN
= ['COMPLETED', 'COMPLETING', 'RUNNING', 'STAGE_OUT']¶
-
SCHED_WAITING
= ['CONFIGURING', 'PENDING']¶
-
SCONTROL_KEY_RE
= re.compile('(?:^|\\s+)([A-Z][a-zA-Z0-9:/]*)=')¶
-
SCONTROL_WS_RE
= re.compile('\\s+')¶
-
__init__
()¶ Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.
-
__module__
= 'pavilion.plugins.sched.slurm'¶
-
_cancel_job
(test)¶ Scancel the job attached to the given test.
Parameters: test (pavilion.test_run.TestRun) – The test to cancel. Returns: A statusInfo object with the latest scheduler state. Return type: StatusInfo
-
_collect_node_data
(nodes=None)¶ Use the scontrol show node command to collect data on nodes. Types are converted according to self.FIELD_TYPES.
Parameters: nodes (str) – The nodes to collect data on. If None, collect data on all nodes. The format is slurm standard node list, which can include compressed series eg ‘n00[20-99],n0101’ Return type: dict Returns: A dict of node dictionaries.
-
_filter_nodes
(min_nodes, config, nodes)¶ Filter the system nodes down to just those we can use. For each step, we check to make sure we still have the minimum nodes needed in order to give more relevant errors.
Parameters: - min_nodes (int) – The minimum number of nodes desired. This will
- config (dict) – The scheduler config for a test.
- nodes ([list]) – Nodes (as defined by collect node data)
Returns: A list of node names that are compatible with the given config.
Return type: list
-
_get_data
()¶ Get the slurm node state information.
Returns: A dict with individual node and summary information. Return type: dict
-
_get_kickoff_script_header
(test)¶ Get the kickoff header. Most of the work here
-
_get_node_range
(sched_config, nodes)¶ Translate user requests for a number of nodes into a numerical range based on the number of nodes on the actual system.
Parameters: - sched_config (dict) – The scheduler config for a particular test.
- nodes (list) – A list of nodes.
Return type: str
Returns: A range suitable for the num_nodes argument of slurm.
-
_in_alloc
()¶ Check if we’re in an allocation.
-
static
_make_summary
(nodes)¶ Get aggregate data about the given nodes. This includes:
- min_ppn - min procs per node
- max_ppn - max procs per node
- min_mem - min mem per node (in MiB)
- max_mem - min mem per node (in MiB)
- total_cpu - Total cpu’s on these nodes.
Parameters: nodes (typing.Iterable) – Node dictionaries as returned by _collect_node_data. Return type: dict
-
classmethod
_parse_node_list
(node_list)¶ Convert a slurm format node list into a list of nodes, and throw errors that help the user identify their exact mistake.
-
_schedule
(test, kickoff_path)¶ Submit the kick off script using sbatch.
Parameters: - test (TestRun) – The TestRun we’re kicking off.
- kickoff_path (Path) – The kickoff script path.
-
_scontrol_parse
(section)¶
-
_scontrol_show
(*args, timeout=10)¶ Run scontrol show and return the parsed output.
Parameters: - args (list(str)) – Additional args to scontrol.
- timeout (int) – How long to wait for results.
-
get_conf
()¶ Set up the Slurm configuration attributes.
-
job_status
(pav_cfg, test)¶ Get the current status of the slurm job for the given test.
-
Raw¶
Raw Variables¶
-
class
pavilion.plugins.sched.raw.
RawVars
(scheduler, sched_config)¶ Bases:
pavilion.schedulers.SchedulerVariables
Variables for running tests locally on a system.
-
MEM_UNITS
= {None: 1, 'b': 1, 'kb': 1000, 'mb': 1000000}¶
-
avail_mem
()¶ Available memory in MiB to the nearest MiB.
-
cpus
()¶ Total CPUs (includes hyperthreading cpus).
-
free_mem
()¶ Free memory in MiB to the nearest MiB.
-
mem_to_mib
(key)¶ Get a meminfo value from the meminfo dict, and convert it to a standard unit (MiB).
-
total_mem
()¶ Total memory in MiB to the nearest MiB.
-
Raw Scheduler¶
-
class
pavilion.plugins.sched.raw.
Raw
¶ Bases:
pavilion.schedulers.SchedulerPlugin
-
CANCEL_TIMEOUT
= 1¶
-
__init__
()¶ Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.
-
__module__
= 'pavilion.plugins.sched.raw'¶
-
_cancel_job
(test)¶ Try to kill the given test’s pid (if it is the right pid).
Parameters: test (pavilion.test_run.TestRun) – The test to cancel.
-
_filter_nodes
()¶ Do nothing, and like it.
-
_get_data
()¶ Mostly we need the number of cpus and memory informaton.
-
_in_alloc
()¶ In raw mode, we’re always in an allocation.
-
_schedule
(test_obj, kickoff_path)¶ Run the kickoff script in a separate process. The job id a combination of the hostname and pid.
Parameters: - test_obj (pavilion.test_config.TestRun) – The test to schedule.
- kickoff_path (Path) –
- Path to the submission script.
Returns: ‘<host>_<pid>’
-
static
_verify_pid
(pid, test_id)¶ Verify that the test is running under the given pid. Note that this may change before, after, or during this call.
Parameters: - pid (str) – The pid to search for.
- test_id (int) – The id of the test started under that pid.
Returns: True - If the given pid is for the given test_id (False otherwise)
-
get_conf
()¶ Define the configuration attributes.
-
job_status
(pav_cfg, test)¶ Raw jobs will either be scheduled (waiting on a concurrency lock), or in an unknown state (as there aren’t records of dead jobs).
Return type: StatusInfo
-
lock_concurrency
(pav_cfg, test)¶ Acquire the concurrency lock for this scheduler, if necessary.
Parameters: - pav_cfg – The pavilion configuration.
- test (pavilion.pav_config.test.TestRun) – The pavilion test to lock concurrency for.
-