Schedulers¶
Table of Contents
Scheduler Module¶
Scheduler Plugin Class¶
-
class
pavilion.schedulers.
SchedulerPlugin
(name, description, priority=0)¶ Bases:
yapsy.IPlugin.IPlugin
The base scheduler plugin class. Scheduler plugins should inherit from this.
-
KICKOFF_SCRIPT_EXT
= '.sh'¶ The extension for the kickoff script.
-
PRIO_COMMON
= 10¶
-
PRIO_CORE
= 0¶
-
PRIO_USER
= 20¶
-
VAR_CLASS
¶ The scheduler’s variable class.
alias of
SchedulerVariables
-
__init__
(name, description, priority=0)¶ Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.
-
__module__
= 'pavilion.schedulers'¶
-
static
_add_schedule_script_body
(script, test)¶ Add the script body to the given script object. This default simply adds a comment and the test run command.
-
_cancel_job
(test)¶ Override in scheduler plugins to handle cancelling a job.
Parameters: test (pavilion.test_run.TestRun) – The test to cancel. Returns: Whether we’re confident the job was canceled, and an explanation. Return type: StatusInfo
-
_create_kickoff_script
(pav_cfg, test_obj)¶ Function to accept a list of lines and generate a script that is then submitted to the scheduler.
Parameters: test_obj (pavilion.test_config.TestRun) –
-
_do_lock_concurrency
(pav_cfg, test)¶ Acquire the concurrency lock for this scheduler, if necessary.
Parameters: - pav_cfg – The pavilion configuration.
- test (pavilion.pav_config.test.TestRun) – The pavilion test to lock concurrency for.
-
_filter_nodes
(*args, **kwargs)¶ Filter the system nodes down to just those we can use. This should check to make sure the nodes available are compatible with the test. The arguments for this function will vary by scheduler.
Returns: A list of compatible node names. Return type: list
-
_get_data
()¶ Child classes should override this and use it as a way to gather broad amounts of data about the scheduling system. The resulting data structure is generally expected to be a dictionary, though that’s entirely up to the scheduler plugin.
Return type: dict
-
_get_kickoff_script_header
(test)¶
-
_in_alloc
()¶ The plugin specific implementation of ‘in_alloc’.
-
_kickoff_script_path
(test)¶
-
static
_now
()¶ Convenience method for getting a reasonable current time object.
-
_schedule
(test_obj, kickoff_path)¶ Run the kickoff script at script path with this scheduler.
Parameters: - test_obj (pavilion.test_config.TestRun) – The test to schedule.
- kickoff_path (Path) – Path to the submission script.
Return str: Job ID number.
-
activate
()¶ Add this plugin to the scheduler plugin list.
-
available
()¶ Returns true if this scheduler is available on this host.
Return type: bool
-
cancel_job
(test)¶ Tell the scheduler to cancel the given test, if it can. This should simply try it’s best for the test given, and note in the test status (with a SCHED_ERROR) if there were problems. Update the test status to SCHED_CANCELLED if it succeeds.
Parameters: test (pavilion.test_run.TestRun) – The test to cancel. Returns: A status info object describing the state. If we actually cancel the job the test status will be set to SCHED_CANCELLED. This should return SCHED_ERROR when something goes wrong. Return type: StatusInfo
-
deactivate
()¶ Remove this plugin from the scheduler plugin list.
-
get_conf
()¶ Return the configuration object suitable for adding to the test configuration.
-
get_data
(refresh=False)¶ Get data relevant to this scheduler. This is a wrapper method; child classes should override _get_data instead. This simply ensures we only gather the data once.
Returns: A dictionary of gathered scheduler data. Return type: dict
-
get_vars
(sched_config)¶ Returns the dictionary of scheduler variables.
Parameters: sched_config (dict) – The scheduler config for a given test.
-
in_alloc
¶ Determines whether we’re on a scheduled node.
-
job_status
(pav_cfg, test)¶ Get the job state from the scheduler, and map it to one of the on of the following states: SCHEDULED, SCHED_ERROR, SCHED_CANCELLED. This may also simply re-fetch the latest state from the state file, and return that if necessary.
Parameters: - pav_cfg – The pavilion configuration.
- test (pavilion.test_run.TestRun) – The test we’re checking on.
Returns: A StatusInfo object representing the status.
Return type:
-
lock_concurrency
(pav_cfg, test)¶ Acquire the concurrency lock for this scheduler, if necessary.
Parameters: - pav_cfg – The pavilion config.
- test – A test object
-
run_suite
(tests)¶ Run each of the given tests using a single allocation. This is effectively a placeholder.
-
schedule_test
(pav_cfg, test_obj)¶ Create the test script and schedule the job.
Parameters: - pav_cfg – The pavilion cfg.
- test_obj (pavilion.test_run.TestRun) – The pavilion test to start.
-
schedule_tests
(pav_cfg, tests)¶ Schedule each of the given tests using this scheduler using a separate allocation (if applicable) for each.
Parameters: - pav_cfg – The pavilion config
- tests ([pavilion.test_run.TestRun]) – A list of pavilion tests to schedule.
-
static
unlock_concurrency
(lock)¶ Unlock the concurrency lock, if one exists.
Parameters: None) lock (Union(Lockfile,) –
-
Scheduler Variables¶
-
class
pavilion.schedulers.
SchedulerVariables
(scheduler, sched_config)¶ Bases:
pavilion.var_dict.VarDict
The base scheduler variables class. Each scheduler should have a child class of this that contains all the variable functions it provides.
To add a scheduler variable, create a method and decorate it with either
@sched_var
or@dfr_sched_var()
. The method name will be the variable name, and the method will be called to resolve the variable value. Methods that start with ‘_’ are ignored.Naming Conventions:
- ‘alloc_*’
- Variable names should be prefixed with ‘alloc_’ if they are deferred.
- ‘test_*’
- Variable names prefixed with test denote that the variable is specific to a test. These also tend to be deferred.
-
BYTE_SIZE_UNITS
= {'': 1, 'B': 1, 'GB': 1000000000, 'GiB': 1073741824, 'KiB': 1024, 'MB': 1000000, 'MiB': 1048576, 'kB': 1000}¶
-
EXAMPLE
= {'min_cpus': '3', 'min_mem': '123412'}¶ Each scheduler variable class should provide an example set of values for itself to display when using ‘pav show’ to list the variables. These are easily obtained by running a test under the scheduler, and then harvesting the results of the test run.
-
NO_EXAMPLE
= '<no example>'¶
-
__abstractmethods__
= frozenset()¶
-
__init__
(scheduler, sched_config)¶ Initialize the scheduler var dictionary.
Parameters: - scheduler (SchedulerPlugin) – The scheduler for this set of variables.
- sched_config (dict) – The test object for which this set of variables is relevant.
-
__module__
= 'pavilion.schedulers'¶
-
__repr__
()¶ Return repr(self).
-
_abc_impl
= <_abc_data object>¶
-
info
(key)¶ Get the info dict for the given key, and add the example to it.
-
min_cpus
()¶ Get a minimum number of cpus we have available on the local system. Defaults to 1 on error (and logs the error).
-
min_mem
()¶ Get a minimum amount of memory for the system, in Gibibytes. Returns 1 on error (and logs the error).
-
sched_data
¶ A convenience function for getting data from the scheduler.
-
test_cmd
()¶ The command to prepend to a line to kick it off under the scheduler. This is blank by default, but most schedulers will want to define something that utilizes relevant scheduler parameters.
Scheduler Plugins¶
Slurm¶
Slurm Variables¶
-
class
pavilion.plugins.sched.slurm.
SlurmVars
(scheduler, sched_config)¶ Bases:
pavilion.schedulers.SchedulerVariables
Scheduler variables for the Slurm scheduler.
-
EXAMPLE
= {'alloc_cpu_total': '36', 'alloc_max_mem': '128842', 'alloc_max_ppn': '36', 'alloc_min_mem': '128842', 'alloc_min_ppn': '36', 'alloc_node_list': ['node004', 'node005'], 'alloc_nodes': '2', 'max_mem': '128842', 'max_ppn': '36', 'min_mem': '128842', 'min_ppn': '36', 'node_avail_list': ['node003', 'node004', 'node005'], 'node_list': ['node001', 'node002', 'node003', 'node004', 'node005'], 'node_up_list': ['node002', 'node003', 'node004', 'node005'], 'nodes': '371', 'nodes_avail': '3', 'nodes_up': '350', 'test_cmd': 'srun -N 2 -n 2', 'test_node_list': ['node004', 'node005'], 'test_nodes': '2', 'test_procs': '2'}¶
-
alloc_cpu_total
()¶ Total CPUs across all nodes in this allocation.
-
alloc_max_mem
()¶ Max mem per node for this allocation. (in MiB)
-
alloc_max_ppn
()¶ Max ppn for this allocation.
-
alloc_min_mem
()¶ Min mem per node for this allocation. (in MiB)
-
alloc_min_ppn
()¶ Min ppn for this allocation.
-
alloc_node_list
()¶ A space separated list of nodes in this allocation.
-
alloc_nodes
()¶ The number of nodes in this allocation.
-
max_mem
()¶ The maximum memory per node across all nodes (in MiB).
-
max_ppn
()¶ The maximum processors per node across all nodes.
-
min_mem
()¶ The minimum memory per node across all nodes (in MiB).
-
min_ppn
()¶ The minimum processors per node across all nodes.
-
node_avail_list
()¶ List of nodes who are in an a state that is considered available. Warning: Tests that use this will fail to start if no nodes are available.
-
node_list
()¶ List of nodes on the system.
-
node_up_list
()¶ List of nodes who are in an a state that is considered available.
-
nodes
()¶ Number of nodes on the system.
-
nodes_avail
()¶ Number of nodes in an ‘avail’ state.
-
nodes_up
()¶ Number of nodes in an ‘avail’ state.
-
test_cmd
()¶ Construct a cmd to run a process under this scheduler, with the criteria specified by this test.
-
test_node_list
()¶ A list of nodes dedicated to this test run.
-
test_nodes
()¶ The number of nodes allocated for this test (may be less than the total in this allocation).
-
test_procs
()¶ The number of processors to request for this test.
-
Slurm Scheduler Plugin¶
-
class
pavilion.plugins.sched.slurm.
SbatchHeader
(sched_config, nodes, test_id, slurm_vars)¶ Bases:
pavilion.scriptcomposer.ScriptHeader
Provides header information specific to sbatch files for the slurm kickoff script.
-
__init__
(sched_config, nodes, test_id, slurm_vars)¶ Build a header for an sbatch file.
Parameters: - sched_config (dict) – The slurm section of the test config.
- nodes (str) – The node list
- test_id (int) – The test’s id.
- slurm_vars (dict) – The test variables.
-
__module__
= 'pavilion.plugins.sched.slurm'¶
-
get_lines
()¶ Get the sbatch header lines.
-
-
class
pavilion.plugins.sched.slurm.
Slurm
¶ Bases:
pavilion.schedulers.SchedulerPlugin
Schedule tests with Slurm!
-
KICKOFF_SCRIPT_EXT
= '.sbatch'¶
-
NODE_FIELD_TYPES
= {'ActiveFeatures': <function Slurm.<lambda>>, 'AllocMemory': <class 'int'>, 'AvailableFeatures': <function Slurm.<lambda>>, 'CPUAlloc': <class 'int'>, 'CPULoad': <function slurm_float>, 'CPUTot': <class 'int'>, 'FreeMemory': <class 'int'>, 'Partitions': <function Slurm.<lambda>>, 'RealMemory': <class 'int'>, 'State': <function slurm_states>}¶
-
NODE_LIST_RE
= re.compile('[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?(?:,[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?)*$')¶
-
NODE_SEQ_REGEX_STR
= '[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[\\d+-\\d+\\])?'¶
-
NUM_NODES_REGEX
= re.compile('^(\\d+|all)(-(\\d+|all))?$')¶
-
SCHED_CANCELLED
= ['CANCELLED', 'DEADLINE', 'PREEMPTED', 'BOOT_FAIL']¶
-
SCHED_ERROR
= ['DEADLINE', 'FAILED', 'NODE_FAIL', 'OUT_OF_MEMORY', 'PREEMPTED', 'REVOKED', 'SPECIAL_EXIT', 'TIMEOUT']¶
-
SCHED_OTHER
= ['RESV_DEL_HOLD', 'REQUEUE_FED', 'REQUEUE_HOLD', 'REQUEUED', 'RESIZING', 'SIGNALING', 'SUSPENDED']¶
-
SCHED_RUN
= ['COMPLETED', 'COMPLETING', 'RUNNING', 'STAGE_OUT']¶
-
SCHED_WAITING
= ['CONFIGURING', 'PENDING']¶
-
SCONTROL_KEY_RE
= re.compile('(?:^|\\s+)([A-Z][a-zA-Z0-9:/]*)=')¶
-
SCONTROL_WS_RE
= re.compile('\\s+')¶
-
__init__
()¶ Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.
-
__module__
= 'pavilion.plugins.sched.slurm'¶
-
_cancel_job
(test)¶ Scancel the job attached to the given test.
Parameters: test (pavilion.test_run.TestRun) – The test to cancel. Returns: A statusInfo object with the latest scheduler state. Return type: StatusInfo
-
_collect_node_data
(nodes=None)¶ Use the scontrol show node command to collect data on nodes. Types are converted according to self.FIELD_TYPES.
Parameters: nodes (str) – The nodes to collect data on. If None, collect data on all nodes. The format is slurm standard node list, which can include compressed series eg ‘n00[20-99],n0101’ Return type: dict Returns: A dict of node dictionaries.
-
_filter_nodes
(min_nodes, config, nodes)¶ Filter the system nodes down to just those we can use. For each step, we check to make sure we still have the minimum nodes needed in order to give more relevant errors.
Parameters: - min_nodes (int) – The minimum number of nodes desired. This will
- config (dict) – The scheduler config for a test.
- nodes ([list]) – Nodes (as defined by collect node data)
Returns: A list of node names that are compatible with the given config.
Return type: list
-
static
_get_config_elems
()¶
-
_get_data
()¶ Get the slurm node state information.
Returns: A dict with individual node and summary information. Return type: dict
-
_get_kickoff_script_header
(test)¶ Get the kickoff header. Most of the work here
-
_get_node_range
(sched_config, nodes)¶ Translate user requests for a number of nodes into a numerical range based on the number of nodes on the actual system.
Parameters: - sched_config (dict) – The scheduler config for a particular test.
- nodes (list) – A list of nodes.
Return type: str
Returns: A range suitable for the num_nodes argument of slurm.
-
_in_alloc
()¶ Check if we’re in an allocation.
-
static
_make_summary
(nodes)¶ Get aggregate data about the given nodes. This includes:
- min_ppn - min procs per node
- max_ppn - max procs per node
- min_mem - min mem per node (in MiB)
- max_mem - min mem per node (in MiB)
- total_cpu - Total cpu’s on these nodes.
Parameters: nodes (typing.Iterable) – Node dictionaries as returned by _collect_node_data. Return type: dict
-
classmethod
_parse_node_list
(node_list)¶ Convert a slurm format node list into a list of nodes, and throw errors that help the user identify their exact mistake.
-
_schedule
(test, kickoff_path)¶ Submit the kick off script using sbatch.
Parameters: - test (TestRun) – The TestRun we’re kicking off.
- kickoff_path (Path) – The kickoff script path.
-
_scontrol_parse
(section)¶
-
_scontrol_show
(*args, timeout=10)¶ Run scontrol show and return the parsed output.
Parameters: - args (list(str)) – Additional args to scontrol.
- timeout (int) – How long to wait for results.
-
available
()¶ Looks for several slurm commands, and tests slurm can talk to the slurm db.
-
get_conf
()¶ Set up the Slurm configuration attributes.
-
job_status
(pav_cfg, test)¶ Get the current status of the slurm job for the given test.
-
Raw¶
Raw Variables¶
-
class
pavilion.plugins.sched.raw.
RawVars
(scheduler, sched_config)¶ Bases:
pavilion.schedulers.SchedulerVariables
Variables for running tests locally on a system.
-
EXAMPLE
= {'avail_mem': '54171', 'cpus': '8', 'free_mem': '49365', 'total_mem': '62522'}¶
-
MEM_UNITS
= {None: 1, 'b': 1, 'kb': 1000, 'mb': 1000000}¶
-
avail_mem
()¶ Available memory in MiB to the nearest MiB.
-
cpus
()¶ Total CPUs (includes hyperthreading cpus).
-
free_mem
()¶ Free memory in MiB to the nearest MiB.
-
mem_to_mib
(key)¶ Get a meminfo value from the meminfo dict, and convert it to a standard unit (MiB).
-
total_mem
()¶ Total memory in MiB to the nearest MiB.
-
Raw Scheduler¶
-
class
pavilion.plugins.sched.raw.
Raw
¶ Bases:
pavilion.schedulers.SchedulerPlugin
-
CANCEL_TIMEOUT
= 1¶
-
__init__
()¶ Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.
-
__module__
= 'pavilion.plugins.sched.raw'¶
-
_cancel_job
(test)¶ Try to kill the given test’s pid (if it is the right pid).
Parameters: test (pavilion.test_run.TestRun) – The test to cancel.
-
_filter_nodes
()¶ Do nothing, and like it.
-
_get_data
()¶ Mostly we need the number of cpus and memory informaton.
-
_in_alloc
()¶ In raw mode, we’re always in an allocation.
-
_schedule
(test_obj, kickoff_path)¶ Run the kickoff script in a separate process. The job id a combination of the hostname and pid.
Parameters: - test_obj (pavilion.test_config.TestRun) – The test to schedule.
- kickoff_path (Path) –
- Path to the submission script.
Returns: ‘<host>_<pid>’
-
static
_verify_pid
(pid, test_id)¶ Verify that the test is running under the given pid. Note that this may change before, after, or during this call.
Parameters: - pid (str) – The pid to search for.
- test_id (int) – The id of the test started under that pid.
Returns: True - If the given pid is for the given test_id (False otherwise)
-
available
()¶ The raw scheduler is always available.
-
get_conf
()¶ Define the configuration attributes.
-
job_status
(pav_cfg, test)¶ Raw jobs will either be scheduled (waiting on a concurrency lock), or in an unknown state (as there aren’t records of dead jobs).
Return type: StatusInfo
-
lock_concurrency
(pav_cfg, test)¶ Acquire the concurrency lock for this scheduler, if necessary.
Parameters: - pav_cfg – The pavilion configuration.
- test (pavilion.pav_config.test.TestRun) – The pavilion test to lock concurrency for.
-