Schedulers

Scheduler Module

Scheduler Plugin Class

class pavilion.schedulers.SchedulerPlugin(name, description, priority=0)

Bases: yapsy.IPlugin.IPlugin

The base scheduler plugin class. Scheduler plugins should inherit from this.

KICKOFF_SCRIPT_EXT = '.sh'

The extension for the kickoff script.

PRIO_COMMON = 10
PRIO_CORE = 0
PRIO_USER = 20
VAR_CLASS

The scheduler’s variable class.

alias of SchedulerVariables

__init__(name, description, priority=0)

Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.

__module__ = 'pavilion.schedulers'
static _add_schedule_script_body(script, test)

Add the script body to the given script object. This default simply adds a comment and the test run command.

_cancel_job(test)

Override in scheduler plugins to handle cancelling a job.

Parameters:test (pavilion.test_run.TestRun) – The test to cancel.
Returns:Whether we’re confident the job was canceled, and an explanation.
Return type:StatusInfo
_create_kickoff_script(pav_cfg, test_obj: pavilion.test_run.TestRun)

Function to accept a list of lines and generate a script that is then submitted to the scheduler.

Parameters:test_obj (pavilion.test_config.TestRun) –
_do_lock_concurrency(pav_cfg, test)

Acquire the concurrency lock for this scheduler, if necessary.

Parameters:
  • pav_cfg – The pavilion configuration.
  • test (pavilion.pav_config.test.TestRun) – The pavilion test to lock concurrency for.
_filter_nodes(*args, **kwargs)

Filter the system nodes down to just those we can use. This should check to make sure the nodes available are compatible with the test. The arguments for this function will vary by scheduler.

Returns:A list of compatible node names.
Return type:list
_get_data()

Child classes should override this and use it as a way to gather broad amounts of data about the scheduling system. The resulting data structure is generally expected to be a dictionary, though that’s entirely up to the scheduler plugin.

Return type:dict
_get_kickoff_script_header(test)
_in_alloc()

The plugin specific implementation of ‘in_alloc’.

_kickoff_script_path(test)
static _now()

Convenience method for getting a reasonable current time object.

_schedule(test_obj, kickoff_path)

Run the kickoff script at script path with this scheduler.

Parameters:
  • test_obj (pavilion.test_config.TestRun) – The test to schedule.
  • kickoff_path (Path) – Path to the submission script.
Return str:

Job ID number.

activate()

Add this plugin to the scheduler plugin list.

available()

Returns true if this scheduler is available on this host.

Return type:bool
cancel_job(test)

Tell the scheduler to cancel the given test, if it can. This should simply try it’s best for the test given, and note in the test status (with a SCHED_ERROR) if there were problems. Update the test status to SCHED_CANCELLED if it succeeds.

Parameters:test (pavilion.test_run.TestRun) – The test to cancel.
Returns:A status info object describing the state. If we actually cancel the job the test status will be set to SCHED_CANCELLED. This should return SCHED_ERROR when something goes wrong.
Return type:StatusInfo
deactivate()

Remove this plugin from the scheduler plugin list.

get_conf()

Return the configuration object suitable for adding to the test configuration.

get_data(refresh=False)

Get data relevant to this scheduler. This is a wrapper method; child classes should override _get_data instead. This simply ensures we only gather the data once.

Returns:A dictionary of gathered scheduler data.
Return type:dict
get_vars(sched_config)

Returns the dictionary of scheduler variables.

Parameters:sched_config (dict) – The scheduler config for a given test.
in_alloc

Determines whether we’re on a scheduled node.

job_status(pav_cfg, test) → pavilion.status_file.StatusInfo

Get the job state from the scheduler, and map it to one of the on of the following states: SCHEDULED, SCHED_ERROR, SCHED_CANCELLED. This may also simply re-fetch the latest state from the state file, and return that if necessary.

Parameters:
Returns:

A StatusInfo object representing the status.

lock_concurrency(pav_cfg, test)

Acquire the concurrency lock for this scheduler, if necessary.

Parameters:
  • pav_cfg – The pavilion config.
  • test – A test object
run_suite(tests)

Run each of the given tests using a single allocation. This is effectively a placeholder.

schedule_test(pav_cfg, test_obj)

Create the test script and schedule the job.

Parameters:
schedule_tests(pav_cfg, tests)

Schedule each of the given tests using this scheduler using a separate allocation (if applicable) for each.

Parameters:
static unlock_concurrency(lock)

Unlock the concurrency lock, if one exists.

Parameters:None) lock (Union(Lockfile,) –

Scheduler Variables

class pavilion.schedulers.SchedulerVariables(scheduler, sched_config)

Bases: pavilion.var_dict.VarDict

The base scheduler variables class. Each scheduler should have a child class of this that contains all the variable functions it provides.

To add a scheduler variable, create a method and decorate it with either @sched_var or @dfr_sched_var(). The method name will be the variable name, and the method will be called to resolve the variable value. Methods that start with ‘_’ are ignored.

Naming Conventions:

‘alloc_*’
Variable names should be prefixed with ‘alloc_’ if they are deferred.
‘test_*’
Variable names prefixed with test denote that the variable is specific to a test. These also tend to be deferred.
BYTE_SIZE_UNITS = {'': 1, 'B': 1, 'GB': 1000000000, 'GiB': 1073741824, 'KiB': 1024, 'MB': 1000000, 'MiB': 1048576, 'kB': 1000}
EXAMPLE = {'min_cpus': '3', 'min_mem': '123412'}

Each scheduler variable class should provide an example set of values for itself to display when using ‘pav show’ to list the variables. These are easily obtained by running a test under the scheduler, and then harvesting the results of the test run.

NO_EXAMPLE = '<no example>'
__abstractmethods__ = frozenset()
__init__(scheduler, sched_config)

Initialize the scheduler var dictionary.

Parameters:
  • scheduler (SchedulerPlugin) – The scheduler for this set of variables.
  • sched_config (dict) – The test object for which this set of variables is relevant.
__module__ = 'pavilion.schedulers'
__repr__()

Return repr(self).

_abc_impl = <_abc_data object>
info(key)

Get the info dict for the given key, and add the example to it.

min_cpus()

Get a minimum number of cpus we have available on the local system. Defaults to 1 on error (and logs the error).

min_mem()

Get a minimum amount of memory for the system, in Gibibytes. Returns 1 on error (and logs the error).

sched_data

A convenience function for getting data from the scheduler.

test_cmd()

The command to prepend to a line to kick it off under the scheduler. This is blank by default, but most schedulers will want to define something that utilizes relevant scheduler parameters.

Scheduler Plugins

Slurm

Slurm Variables

class pavilion.plugins.sched.slurm.SlurmVars(scheduler, sched_config)

Bases: pavilion.schedulers.SchedulerVariables

Scheduler variables for the Slurm scheduler.

EXAMPLE = {'alloc_cpu_total': '36', 'alloc_max_mem': '128842', 'alloc_max_ppn': '36', 'alloc_min_mem': '128842', 'alloc_min_ppn': '36', 'alloc_node_list': 'node004 node005', 'alloc_nodes': '2', 'job_name': 'pav', 'max_mem': '128842', 'max_ppn': '36', 'min_mem': '128842', 'min_ppn': '36', 'node_avail_list': ['node003', 'node004', 'node005'], 'node_list': ['node001', 'node002', 'node003', 'node004', 'node005'], 'node_up_list': ['node002', 'node003', 'node004', 'node005'], 'nodes': '371', 'nodes_avail': '3', 'nodes_up': '350', 'test_cmd': 'srun -N 2 -n 2', 'test_node_list': 'node004 node005', 'test_node_list_short': 'node00[4-5]', 'test_nodes': '2', 'test_procs': '2'}
alloc_cpu_total()

Total CPUs across all nodes in this allocation.

alloc_max_mem()

Max mem per node for this allocation. (in MiB)

alloc_max_ppn()

Max ppn for this allocation.

alloc_min_mem()

Min mem per node for this allocation. (in MiB)

alloc_min_ppn()

Min ppn for this allocation.

alloc_node_list()

A space separated list of nodes in this allocation.

alloc_nodes()

The number of nodes in this allocation.

max_mem()

The maximum memory per node across all nodes (in MiB).

max_ppn()

The maximum processors per node across all nodes.

min_mem()

The minimum memory per node across all nodes (in MiB).

min_ppn()

The minimum processors per node across all nodes.

node_avail_list()

List of nodes who are in an a state that is considered available. Warning: Tests that use this will fail to start if no nodes are available.

node_list()

List of nodes on the system.

node_up_list()

List of nodes who are in an a state that is considered available.

nodes()

Number of nodes on the system.

nodes_avail()

Number of nodes in an ‘avail’ state.

nodes_up()

Number of nodes in an ‘avail’ state.

test_cmd()

Construct a cmd to run a process under this scheduler, with the criteria specified by this test.

test_node_list()

A list of nodes dedicated to this test run.

test_node_list_short()

Node list, compressed in a slurm compatible way.

test_nodes()

The number of nodes allocated for this test (may be less than the total in this allocation).

test_procs()

The number of processors to request for this test.

Slurm Scheduler Plugin

class pavilion.plugins.sched.slurm.SbatchHeader(sched_config, nodes, test_id, slurm_vars)

Bases: pavilion.scriptcomposer.ScriptHeader

Provides header information specific to sbatch files for the slurm kickoff script.

__init__(sched_config, nodes, test_id, slurm_vars)

Build a header for an sbatch file.

Parameters:
  • sched_config (dict) – The slurm section of the test config.
  • nodes (str) – The node list
  • test_id (int) – The test’s id.
  • slurm_vars (dict) – The test variables.
__module__ = 'pavilion.plugins.sched.slurm'
get_lines()

Get the sbatch header lines.

class pavilion.plugins.sched.slurm.Slurm

Bases: pavilion.schedulers.SchedulerPlugin

Schedule tests with Slurm!

KICKOFF_SCRIPT_EXT = '.sbatch'
NODE_BRACKET_FORMAT_RE = re.compile('([a-zA-Z][a-zA-Z_-]*\\d*)\\[(.*)]')
NODE_FIELD_TYPES = {'ActiveFeatures': <function Slurm.<lambda>>, 'AllocMemory': <class 'int'>, 'AvailableFeatures': <function Slurm.<lambda>>, 'CPUAlloc': <class 'int'>, 'CPULoad': <function slurm_float>, 'CPUTot': <class 'int'>, 'FreeMemory': <class 'int'>, 'Partitions': <function Slurm.<lambda>>, 'RealMemory': <class 'int'>, 'State': <function slurm_states>}
NODE_LIST_RE = re.compile('[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\])?(?:,[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\])?)*$')
NODE_SEQ_REGEX_STR = '[a-zA-Z][a-zA-Z_-]*\\d*(?:\\[(?:\\d+|\\d+-\\d+)(?:,\\d+|,\\d+-\\d+)*\\])?'
NUM_NODES_REGEX = re.compile('^(\\d+|all)(-(\\d+|all))?$')
SCHED_CANCELLED = ['CANCELLED', 'DEADLINE', 'PREEMPTED', 'BOOT_FAIL']
SCHED_ERROR = ['DEADLINE', 'FAILED', 'NODE_FAIL', 'OUT_OF_MEMORY', 'PREEMPTED', 'REVOKED', 'SPECIAL_EXIT', 'TIMEOUT']
SCHED_OTHER = ['RESV_DEL_HOLD', 'REQUEUE_FED', 'REQUEUE_HOLD', 'REQUEUED', 'RESIZING', 'SIGNALING', 'SUSPENDED']
SCHED_RUN = ['COMPLETED', 'COMPLETING', 'RUNNING', 'STAGE_OUT']
SCHED_WAITING = ['CONFIGURING', 'PENDING']
SCONTROL_KEY_RE = re.compile('(?:^|\\s+)([A-Z][a-zA-Z0-9:/]*)=')
SCONTROL_WS_RE = re.compile('\\s+')
VAR_CLASS

alias of SlurmVars

__init__()

Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.

__module__ = 'pavilion.plugins.sched.slurm'
_cancel_job(test)

Scancel the job attached to the given test.

Parameters:test (pavilion.test_run.TestRun) – The test to cancel.
Returns:A statusInfo object with the latest scheduler state.
Return type:StatusInfo
_collect_node_data(nodes=None)

Use the scontrol show node command to collect data on nodes. Types are converted according to self.FIELD_TYPES.

Parameters:nodes (str) – The nodes to collect data on. If None, collect data on all nodes. The format is slurm standard node list, which can include compressed series eg ‘n00[20-99],n0101’
Return type:dict
Returns:A dict of node dictionaries.
_filter_nodes(min_nodes, config, nodes)

Filter the system nodes down to just those we can use. For each step, we check to make sure we still have the minimum nodes needed in order to give more relevant errors.

Parameters:
  • min_nodes (int) – The minimum number of nodes desired. This will
  • config (dict) – The scheduler config for a test.
  • nodes ([list]) – Nodes (as defined by collect node data)
Returns:

A list of node names that are compatible with the given config.

Return type:

list

static _get_config_elems()
_get_data()

Get the slurm node state information.

Returns:A dict with individual node and summary information.
Return type:dict
_get_kickoff_script_header(test)

Get the kickoff header. Most of the work here

_get_node_range(sched_config, nodes)

Translate user requests for a number of nodes into a numerical range based on the number of nodes on the actual system.

Parameters:
  • sched_config (dict) – The scheduler config for a particular test.
  • nodes (list) – A list of nodes.
Return type:

str

Returns:

A range suitable for the num_nodes argument of slurm.

_in_alloc()

Check if we’re in an allocation.

static _make_summary(nodes)

Get aggregate data about the given nodes. This includes:

  • min_ppn - min procs per node
  • max_ppn - max procs per node
  • min_mem - min mem per node (in MiB)
  • max_mem - min mem per node (in MiB)
  • total_cpu - Total cpu’s on these nodes.
Parameters:nodes (typing.Iterable) – Node dictionaries as returned by _collect_node_data.
Return type:dict
_schedule(test, kickoff_path)

Submit the kick off script using sbatch.

Parameters:
  • test (TestRun) – The TestRun we’re kicking off.
  • kickoff_path (Path) – The kickoff script path.
_scontrol_parse(section)
_scontrol_show(*args, timeout=10)

Run scontrol show and return the parsed output.

Parameters:
  • args (list(str)) – Additional args to scontrol.
  • timeout (int) – How long to wait for results.
available()

Looks for several slurm commands, and tests slurm can talk to the slurm db.

get_conf()

Set up the Slurm configuration attributes.

job_status(pav_cfg, test)

Get the current status of the slurm job for the given test.

classmethod parse_node_list(node_list)

Convert a slurm format node list into a list of nodes, and throw errors that help the user identify their exact mistake.

classmethod short_node_list(nodes: List[str], logger)

Convert a list of nodes into an abbreviated node list that slurm should understand.

Raw

Raw Variables

class pavilion.plugins.sched.raw.RawVars(scheduler, sched_config)

Bases: pavilion.schedulers.SchedulerVariables

Variables for running tests locally on a system.

EXAMPLE = {'avail_mem': '54171', 'cpus': '8', 'free_mem': '49365', 'total_mem': '62522'}
MEM_UNITS = {None: 1, 'b': 1, 'kb': 1000, 'mb': 1000000}
avail_mem()

Available memory in MiB to the nearest MiB.

cpus()

Total CPUs (includes hyperthreading cpus).

free_mem()

Free memory in MiB to the nearest MiB.

mem_to_mib(key)

Get a meminfo value from the meminfo dict, and convert it to a standard unit (MiB).

total_mem()

Total memory in MiB to the nearest MiB.

Raw Scheduler

class pavilion.plugins.sched.raw.Raw

Bases: pavilion.schedulers.SchedulerPlugin

CANCEL_TIMEOUT = 1
VAR_CLASS

alias of RawVars

__init__()

Scheduler plugin that is expected to be overriden by subclasses. The plugin will populate a set of expected ‘sched’ variables.

__module__ = 'pavilion.plugins.sched.raw'
_cancel_job(test)

Try to kill the given test’s pid (if it is the right pid).

Parameters:test (pavilion.test_run.TestRun) – The test to cancel.
_filter_nodes()

Do nothing, and like it.

_get_data()

Mostly we need the number of cpus and memory informaton.

_in_alloc()

In raw mode, we’re always in an allocation.

_schedule(test_obj, kickoff_path)

Run the kickoff script in a separate process. The job id a combination of the hostname and pid.

Parameters:
  • test_obj (pavilion.test_config.TestRun) – The test to schedule.
  • kickoff_path (Path) –
    • Path to the submission script.
Returns:

‘<host>_<pid>’

static _verify_pid(pid, test_id)

Verify that the test is running under the given pid. Note that this may change before, after, or during this call.

Parameters:
  • pid (str) – The pid to search for.
  • test_id (int) – The id of the test started under that pid.
Returns:

True - If the given pid is for the given test_id (False otherwise)

available()

The raw scheduler is always available.

get_conf()

Define the configuration attributes.

job_status(pav_cfg, test)

Raw jobs will either be scheduled (waiting on a concurrency lock), or in an unknown state (as there aren’t records of dead jobs).

Return type:StatusInfo
lock_concurrency(pav_cfg, test)

Acquire the concurrency lock for this scheduler, if necessary.

Parameters:
  • pav_cfg – The pavilion configuration.
  • test (pavilion.pav_config.test.TestRun) – The pavilion test to lock concurrency for.