Scheduling Tests 

Tests are scheduled according to which scheduler they specify. This page covers the basics of how scheduler plugins operate.

Included Scheduler Plugins 

Pavilion comes with three scheduler plugins:

pav show sched

 Available Scheduler Plugins
-----------+------------------------------------------------------
 Name      | Description
-----------+------------------------------------------------------
 raw       | Schedules tests as local processes.
 slurm     | Schedules tests via the Slurm scheduler.
 flux      | Schedules tests via the Flux Framework scheduler.

Scheduler Configuration 

The configuration options for schedulers are documented in their config file format. This is viewable by using the pav show sched --conf command.

$ pav show sched --conf raw

The listed options all go in the schedule section of a test config. You may also notice scheduler specific sections in the listed options as well. Those allow for custom configuration specific to a particular schedulers - options that are not generally applicable.

Note that not all options are expected to be generally applicable either. We may, in the future, add a scheduler with concept of a QOS setting, for instance. When a setting is not applicable, it is simply ignored.

mytest:
    scheduler: slurm

    schedule:
        nodes: 5

    run:
        cmds:
            - echo "I'm a slurm test!"

Scheduler Plugin Basics 

Scheduler plugins are responsible for the following:

Providing test runs with scheduler variables
(Optionally) writing kickoff scripts
Using kickoff scripts (or other mechanisms) to then run pav _run <test_run_id> on allocations with reasonable environments
Generating a unique scheduler job_id for each test run
Providing mechanisms for canceling tests
Providing mechanisms for checking test statuses

Scheduler Variables 

Each scheduler must provide a set of scheduler variables. Most of these are generic and available across all schedulers. Some of these will be Deferred Variables. The best way to see what scheduler variables are available is to to use the pav show sched --vars <sched_name> command.

$ pav show sched --vars slurm

 Variables for the slurm scheduler plugin.
----------------+----------+-----------------+------------------------------------------------------
 Name           | Deferred | Example         | Help
----------------+----------+-----------------+------------------------------------------------------
 chunk_ids      | False    | []              | A list of indices of the available chunks.
 errors         | False    | []              | Return the list of retrieval errors encountered when
                |          |                 | using this var_dict. Key errors are not included.
 min_cpus       | False    | 1               | Get a minimum number of cpus available on each
                |          |                 | (filtered) noded. Defaults to 1 if unknown.
 min_mem        | False    | 4294967296      | Get a minimum for any node across each (filtered)
                |          |                 | nodes. Returns a value in bytes (4 GB if unknown).
 node_list      | False    | []              | The list of node names on the system. If the
                |          |                 | scheduler supports auto-detection, will be the
                |          |                 | filtered list. This list will otherwise be empty.
 node_list_id   | False    |                 | Return the node list id, if available. This is
                |          |                 | meaningless to test configs, but is used internally
                |          |                 | by Pavilion.
 nodes          | False    | 1               | The number of nodes available on the system. If the
                |          |                 | scheduler supports auto-detection, this will be the
                |          |                 | filtered count of nodes. Otherwise, this will be the
                |          |                 | 'cluster_info.node_count' value, or 1 if that isn't
                |          |                 | set.
 tasks_per_node | True     | 5               | The number of tasks to create per node. If the
                |          |                 | scheduler does not support node info, just returns
                |          |                 | 1.
 tasks_total    | True     | 180             | Total tasks to create, based on number of nodes
                |          |                 | actually acquired.
 test_cmd       | True     | srun -N 5 -w no | Construct a cmd to run a process under this
                |          | de[05-10],node2 | scheduler, with the criteria specified by this test.
                |          | 3 -n 20         |

Jobs 

When Pavilion schedules a test, it also creates a job. Jobs organize all the information used to kick off a test (or tests!), including the kickoff script, kickoff log, job id, and symlinks back to each test that’s part of the job. Each job is named by a random hash located in the working_dir>/jobs directory. Tests also refer back to their job through a symlink in each test run directory.

The Kickoff Script 

The kickoff script’s job is to have Pavilion run specific test run instances under an allocation. This is generally expected to be a shell script of some sort that will both define the allocation (if possible) and run pav _run <test_run_id> within that allocation under an environment that can find Pavilion and its libraries.

For slurm, the kickoff script would look something like this:

#!/bin/bash
#SBATCH --job-name "pav test #18697"
#SBATCH -p standard
#SBATCH -N 3-3
#SBATCH --tasks-per-node=1

# Redirect all output to kickoff.log
exec >/usr/local/pav/working_dir/test_runs/0018697/kickoff.log 2>&1
export PATH=/usr/local/pav/src/bin:${PATH}
export PAV_CONFIG_FILE=/usr/local/pav/config/pavilion.yaml
export PAV_CONFIG_DIR=/usr/local/pav/config

pav _run 18697

job_id 

The plugin must assign the test run a job id. This will generally be used by the scheduler plugin to cancel or check the status of tests. It’s saved in the job’s ‘job_id’ file, and also as part of the test results.

Cancel Mechanisms 

Pavilion scheduler plugins are required to provide a mechanism to cancel jobs managed by that scheduler, whether they’re currently running or queued under the scheduler. Generally this means just using the test_run’s job id to cancel the test. Cancelled tests will be given the ‘SCHED_CANCELLED’ status.

Status Mechanisms 

Similarly, Pavilion scheduler plugins must be able to query the status of jobs, and give useful feedback on their state in the scheduler. As long as the test is in the ‘SCHEDULED’ or ‘RUNNING’ states from the test run’s perspective (in the run’s status file), Pavilion will use the scheduler to look up the schedulers status for the job, in order to provide more up-to-date test status information.

Scheduler Plugin Types 

Scheduler plugins come in two varieties: Basic and Advanced

Basic 

The only ‘basic’ scheduler is ‘raw’ which only ever has one node. Most of this doesn’t apply except to user added schedulers.

Basic Schedulers don’t know anything about the system that isn’t manually configured. This information is given via the schedule.cluster_info section (see pav show sched --config). This information should generally be set in the host config for a particular system.

Asking for ‘all’ nodes on a basic scheduler will result in an allocation for the configured number of nodes, regardless of the state of those nodes.

mytest:
  schedule:
    # Tell the scheduler that this system has 60 nodes (at peak)
    cluster_info:
      node_count: 60
    # Ask for between 90% (56 nodes) and all 60 nodes
    # This gives some flexibility in case some nodes are down.
    min_nodes: '90%'
    nodes: all

Advanced 

Advanced scheduler plugins are plugins that can get an inventory of nodes and node state from the system. Such schedulers are able to dynamically determine how many nodes are up or available, and create allocations based on that. As a result, asking for ‘all’ nodes via an advanced scheduler will get you an allocation request for all nodes that are currently up and not otherwise filtered out by partition or other scheduler settings.

Advanced schedulers also enable chunking and job sharing.

Node Filtering Exceptions 

Advanced schedulers filter nodes down to only those which are currently usable. Pavilion offers several mechanisms for providing exceptions to filtering rules. There are three scheduler options that control this behavior:

include_nodes: specifies a set of nodes to be included in every chunk. Other nodes may be
used as well, but those specified are guaranteed to be among the final set of nodes on which the test is scheduled (provided they are in an ‘available’ state).
exclude_nodes: specifies a set of nodes to be excluded when scheduling tests.
across_nodes: specifies a set of nodes to be considered exclusively when scheduling tests. No
nodes beyond those requested will be scheduled. The final set of nodes on which the test is scheduled may be a subset of those specified.

The syntax for specifying nodes is identical to that used with Slurm’s –nodelist option; it can combine full names of nodes (e.g. nid001) with node ranges (e.g. nid[007-023]), which can in turn be combined with commas.

For example, the following test excludes nodes 1, 3, and 7-23 from being scheduled:

mytest:
  schedule:
    exclude_nodes: 'nid001,nid003,nid[007-023]'

To accomplish the same thing via a command-line override:

pav run mytest -c schedule.exclude_nodes='nid001,nid003,nid[007-023]'

Chunking 

On an advanced scheduler, the chunking section of the schedule configuration enables powerful tools for dividing up a system to test it piece by piece. It is disabled when the chunk size is equal to all nodes on the system (the default), but can be enabled by selecting a specific chunk size.

mytest:
  schedule:
    # When using chunking, this is relative to the chunk and not the whole system.
    nodes: all

    # Get 500 node chunks
    chunking:
      size: 500

When using chunking, Pavilion selects nodes for each job entirely in advance. This can lead to the tests being a bit more fragile than usual: the failure of a single node can keep a test from running even if the are ‘spare’ nodes outside of the chunk.

Chunk Selection 

By default, Pavilion will assign each test to the least used chunk for a given set of tests. This will distribute your tests evenly across the entire system.

It is also possible to choose a specific chunk on which each test will run, or even to create permutations of a test such that it will run once on each chunk. The sched.chunk_ids scheduler variable contains a list of all allocated chunk IDs. A common idiom is to use a permutation over this list (via permute_on) so that each node’s instance of the variable chunk stores the ID of the chunk of which it is a member.

# This will create an instance of this test for every chunk available, giving
# full coverage of the system.
mytest:
  # Creates separate instance of the test for each chunk.
  permute_on: sched.chunk_ids

  # Each test instance stores the ID of the chunk it belongs to
  chunk: '{{sched.chunk_ids}}'
  schedule:
    # When using chunking, 'all' refers to all nodes in the chunk
    # rather than on the whole system.
    nodes: all

    # Get 500 node chunks
    chunking:
      size: 500

Note: It is not generically safe to specify chunks other than chunk ‘0’, as chunks with indices greater than 0 aren’t guaranteed to exist.

Node Selection 

By default, Pavilion selects (near) contiguous blocks of nodes for each chunk, but this is customizable. Instead, you can select nodes randomly for each chunk (random), distributed across the system (dist), or semi-randomly distributed (rand-dist). Regardless of the node selection method, the number of chunks will be the same and they (mostly) won’t overlap.

It is likely that the chunk size won’t divide evenly into the total number of nodes. Nodes which make up the remainder may be excluded or back-filled with nodes from another chunk (these nodes are always drawn from the second to last chunk). The default behavior is to ‘backfill’.

Chunking behavior is set via the schedule.chunking.node_selection and schedule.chunking.extra options.

# This test run over a random selection of 25% of the nodes on the system.
mytest:
  schedule:
    # When using chunking, 'all' refers to all nodes in the chunk
    # rather than on the whole system.
    nodes: all

    # Get 500 node chunks
    chunking:
      size: 25%
      node_selection: random

Wrapper 

You can use the wrapper feature on any scheduler to wrap the scheduler test command and run the wrapper command before actually running the intended command.

basic:
    scheduler: slurm
    schedule:
        wrapper: valgrind
        partition: standard
        nodes: 1

    run:
        cmds:
            # The run command will be `srun -N1 -p standard valgrind ./supermagic -a`
            # It will run `valgrind ./supermagic -a` on the allocation
            - '{{sched.test_cmd}} ./supermagic -a'

When using the raw scheduler, {{sched.test_cmd}} normally evaluates to an empty string. You can use the wrapper setting to control a different scheduler directly.

shoot_yourself_in_the_foot_mode:
    scheduler: raw
    schedule:
        # Note that generally it's MUCH better to use the Pavilion's scheduling options,
        # but this allows you to, for example, test the scheduler itself.
        # Other note - You can use mpirun under slurm by setting ``schedule.slurm.mpi_cmd=mpirun``.
        wrapper: 'mpirun -np 2'
    run:
        cmds:
            # With the schedule wrapper, this will be `mpirun -np 2 ./supermagic -a`
            - '{{sched.test_cmd}} ./supermagic -a'