.. _tests.scheduling:

Scheduling Tests
================

Tests are scheduled according to which ``scheduler`` they specify. This page
covers the basics of how scheduler plugins operate.

.. contents:: Table of Contents

Included Scheduler Plugins
--------------------------

Pavilion comes with three scheduler plugins:

.. code-block:: bash

    pav show sched

     Available Scheduler Plugins
    -----------+------------------------------------------------------
     Name      | Description
    -----------+------------------------------------------------------
     raw       | Schedules tests as local processes.
     slurm     | Schedules tests via the Slurm scheduler.
     flux      | Schedules tests via the Flux Framework scheduler.

Scheduler Configuration
~~~~~~~~~~~~~~~~~~~~~~~

The configuration options for schedulers are documented in their config
file format. This is viewable by using the ``pav show sched --conf`` command.

.. code-block:: bash

    $ pav show sched --conf raw

The listed options all go in the ``schedule`` section of a test config.
You may also notice scheduler specific sections in the listed options as well. Those
allow for custom configuration specific to a particular schedulers - options that are
not generally applicable.

Note that not all options are expected to be generally applicable either. We may, in the future,
add a scheduler with concept of a QOS setting, for instance. When a setting is not applicable, it
is simply ignored.

.. code-block:: yaml

    mytest:
        scheduler: slurm

        schedule:
            nodes: 5

        run:
            cmds:
                - echo "I'm a slurm test!"


Scheduler Plugin Basics
-----------------------

Scheduler plugins are responsible for the following:

- Providing test runs with *scheduler* variables
- (Optionally) writing kickoff scripts
- Using kickoff scripts (or other mechanisms) to then run `pav _run
  <test_run_id>` on allocations with reasonable environments
- Generating a unique scheduler ``job_id`` for each test run
- Providing mechanisms for canceling tests
- Providing mechanisms for checking test statuses

.. _tests.scheduling.variables:

Scheduler Variables
~~~~~~~~~~~~~~~~~~~

Each scheduler must provide a set of scheduler variables. Most of these are
generic and available across all schedulers. Some of
these will be :ref:`tests.variables.deferred`. The best way to see what
scheduler variables are available is to to use the ``pav show sched --vars <sched_name>``
command.

.. code-block::

    $ pav show sched --vars slurm

     Variables for the slurm scheduler plugin.
    ----------------+----------+-----------------+------------------------------------------------------
     Name           | Deferred | Example         | Help
    ----------------+----------+-----------------+------------------------------------------------------
     chunk_ids      | False    | []              | A list of indices of the available chunks.
     errors         | False    | []              | Return the list of retrieval errors encountered when
                    |          |                 | using this var_dict. Key errors are not included.
     min_cpus       | False    | 1               | Get a minimum number of cpus available on each
                    |          |                 | (filtered) noded. Defaults to 1 if unknown.
     min_mem        | False    | 4294967296      | Get a minimum for any node across each (filtered)
                    |          |                 | nodes. Returns a value in bytes (4 GB if unknown).
     node_list      | False    | []              | The list of node names on the system. If the
                    |          |                 | scheduler supports auto-detection, will be the
                    |          |                 | filtered list. This list will otherwise be empty.
     node_list_id   | False    |                 | Return the node list id, if available. This is
                    |          |                 | meaningless to test configs, but is used internally
                    |          |                 | by Pavilion.
     nodes          | False    | 1               | The number of nodes available on the system. If the
                    |          |                 | scheduler supports auto-detection, this will be the
                    |          |                 | filtered count of nodes. Otherwise, this will be the
                    |          |                 | 'cluster_info.node_count' value, or 1 if that isn't
                    |          |                 | set.
     tasks_per_node | True     | 5               | The number of tasks to create per node. If the
                    |          |                 | scheduler does not support node info, just returns
                    |          |                 | 1.
     tasks_total    | True     | 180             | Total tasks to create, based on number of nodes
                    |          |                 | actually acquired.
     test_cmd       | True     | srun -N 5 -w no | Construct a cmd to run a process under this
                    |          | de[05-10],node2 | scheduler, with the criteria specified by this test.
                    |          | 3 -n 20         |

.. _tests.scheduling.jobs:

Jobs
----

When Pavilion schedules a test, it also creates a job. Jobs organize all the information used
to kick off a test (or tests!), including the kickoff script, kickoff log, job id, and symlinks
back to each test that's part of the job. Each job is named by a random hash located in the
``working_dir>/jobs`` directory. Tests also refer back to their job through a symlink in each
test run directory.

The Kickoff Script
~~~~~~~~~~~~~~~~~~

The kickoff script's job is to have Pavilion run specific test run instances under an
allocation. This is generally expected to be a shell script of some sort that
will both define the allocation (if possible) and run ``pav _run <test_run_id>``
within that allocation under an environment that can find Pavilion and its
libraries.

For slurm, the kickoff script would look something like this:

.. code-block:: bash

    #!/bin/bash
    #SBATCH --job-name "pav test #18697"
    #SBATCH -p standard
    #SBATCH -N 3-3
    #SBATCH --tasks-per-node=1

    # Redirect all output to kickoff.log
    exec >/usr/local/pav/working_dir/test_runs/0018697/kickoff.log 2>&1
    export PATH=/usr/local/pav/src/bin:${PATH}
    export PAV_CONFIG_FILE=/usr/local/pav/config/pavilion.yaml
    export PAV_CONFIG_DIR=/usr/local/pav/config

    pav _run 18697

job_id
~~~~~~

The plugin must assign the test run a job id. This will generally be used by
the scheduler plugin to cancel or check the status of tests. It's saved in
the job's 'job_id' file, and also as part of the test results.

Cancel Mechanisms
~~~~~~~~~~~~~~~~~

Pavilion scheduler plugins are required to provide a mechanism to cancel jobs
managed by that scheduler, whether they're currently running or queued under
the scheduler. Generally this means just using the test_run's job id to
cancel the test. Cancelled tests will be given the 'SCHED_CANCELLED' status.

Status Mechanisms
~~~~~~~~~~~~~~~~~

Similarly, Pavilion scheduler plugins must be able to query the status of
jobs, and give useful feedback on their state in the scheduler. As long as
the test is in the 'SCHEDULED' or 'RUNNING' states from the test run's perspective (in the
run's status file), Pavilion will use the scheduler to look up the schedulers
status for the job, in order to provide more up-to-date test status
information.

.. _tests.scheduling.types:

Scheduler Plugin Types
----------------------

Scheduler plugins come in two varieties: Basic and Advanced

Basic
~~~~~

**The only 'basic' scheduler is 'raw' which only ever has one node. Most of this doesn't apply
except to user added schedulers.**

Basic Schedulers don't know anything about the system that isn't manually configured. This
information is given via the ``schedule.cluster_info`` section (see ``pav show sched --config``).
This information should generally be set in the host config for a particular system.

Asking for 'all' nodes on a basic scheduler will result in an allocation for the
configured number of nodes, regardless of the state of those nodes.

.. code-block:: yaml

    mytest:
      schedule:
        # Tell the scheduler that this system has 60 nodes (at peak)
        cluster_info:
          node_count: 60
        # Ask for between 90% (56 nodes) and all 60 nodes
        # This gives some flexibility in case some nodes are down.
        min_nodes: '90%'
        nodes: all

Advanced
~~~~~~~~

Advanced scheduler plugins are plugins that can get an inventory of nodes and node state
from the system. Such schedulers are able to dynamically determine how many nodes are up or
available, and create allocations based on that. As a result, asking for 'all' nodes via an
advanced scheduler will get you an allocation request for all nodes that are currently up and not
otherwise filtered out by ``partition`` or other scheduler settings.

Advanced schedulers also enable chunking and job sharing.

.. _tests.scheduling.job_sharing:

Job Sharing
-----------

On an advanced scheduler, when two tests have the same job parameters, they are automatically
scheduled together in the same job allocation. The kickoff script for that job will run the tests
concurrently up to the limit set by ``run.concurrent`` for each test.

Job sharing makes the most sense for short tests that cover a wide range of nodes - such tests
often take longer to set up the allocation than they do to run. By default (when
``schedule.share_allocation`` is set to ``true``), Pavilion will try to balance the sharing with
the number of nodes available - effectively distributing the test runs into jobs that span the
nodes.

With ``schedule.share_allocation`` set to ``max``, Pavilion forces as many test runs into the same
job as possible.

.. _tests.scheduling.chunking:

Node Filtering Exceptions
-------------------------

Advanced schedulers filter nodes down to only those which are currently usable. Pavilion offers
several mechanisms for providing exceptions to filtering rules. There are three scheduler options
that control this behavior:

1. `include_nodes`: specifies a set of nodes to be included in every chunk. Other nodes may be
                    used as well, but those specified are guaranteed to be among the final set of
                    nodes on which the test is scheduled (provided they are in an 'available'
                    state).
2. `exclude_nodes`: specifies a set of nodes to be excluded when scheduling tests.
3. `across_nodes`: specifies a set of nodes to be considered exclusively when scheduling tests. No
                   nodes beyond those requested will be scheduled. The final set of nodes on which
                   the test is scheduled may be a subset of those specified.

The syntax for specifying nodes is identical to that used with Slurm's `--nodelist` option; it can
combine full names of nodes (e.g. `nid001`) with node ranges (e.g. `nid[007-023]`), which can in
turn be combined with commas.

For example, the following test excludes nodes 1, 3, and 7-23 from being scheduled:

.. code-block:: yaml

    mytest:
      schedule:
        exclude_nodes: 'nid001,nid003,nid[007-023]'

To accomplish the same thing via a command-line override:

.. code-block:: bash

  pav run mytest -c schedule.exclude_nodes='nid001,nid003,nid[007-023]'

Chunking
--------

On an advanced scheduler, the ``chunking`` section of the ``schedule`` configuration enables
powerful tools for dividing up a system to test it piece by piece. It is disabled when the chunk
size is equal to all nodes on the system (the default), but can be enabled by selecting a
specific chunk size.

.. code-block:: yaml

    mytest:
      schedule:
        # When using chunking, this is relative to the chunk and not the whole system.
        nodes: all

        # Get 500 node chunks
        chunking:
          size: 500

When using chunking, Pavilion selects nodes for each job entirely in advance. This can lead
to the tests being a bit more fragile than usual: the failure of a single node can keep a test
from running even if the are 'spare' nodes outside of the chunk.

Chunk Selection
~~~~~~~~~~~~~~~

By default, Pavilion will assign each test to the least used chunk for a given set of tests. This
will distribute your tests evenly across the entire system.

It is also possible to choose a specific chunk on which each test will run, or even to create
permutations of a test such that it will run once on each chunk. The ``sched.chunk_ids`` scheduler
variable contains a list of all allocated chunk IDs. A common idiom is to use a permutation over
this list (via ``permute_on``) so that each node's instance of the variable ``chunk`` stores the ID
of the chunk of which it is a member.

.. code-block:: yaml

    # This will create an instance of this test for every chunk available, giving
    # full coverage of the system.
    mytest:
      # Creates separate instance of the test for each chunk.
      permute_on: sched.chunk_ids

      # Each test instance stores the ID of the chunk it belongs to
      chunk: '{{sched.chunk_ids}}'
      schedule:
        # When using chunking, 'all' refers to all nodes in the chunk
        # rather than on the whole system.
        nodes: all

        # Get 500 node chunks
        chunking:
          size: 500

**Note: It is not generically safe to specify chunks other than chunk '0', as chunks with indices
greater than 0 aren't guaranteed to exist.**

Node Selection
~~~~~~~~~~~~~~

By default, Pavilion selects (near) contiguous blocks of nodes for each chunk, but this is
customizable. Instead, you can select nodes randomly for each chunk (``random``), distributed across the
system (``dist``), or semi-randomly distributed (``rand-dist``). Regardless of the node selection method,
the number of chunks will be the same and they (mostly) won't overlap.

It is likely that the chunk size won't divide evenly into the total number of nodes. Nodes which make
up the remainder may be excluded or back-filled with nodes from another chunk (these nodes are always
drawn from the second to last chunk). The default behavior is to 'backfill'.

Chunking behavior is set via the ``schedule.chunking.node_selection`` and ``schedule.chunking.extra``
options.

.. code-block:: yaml

    # This test run over a random selection of 25% of the nodes on the system.
    mytest:
      schedule:
        # When using chunking, 'all' refers to all nodes in the chunk
        # rather than on the whole system.
        nodes: all

        # Get 500 node chunks
        chunking:
          size: 25%
          node_selection: random

.. _tests.scheduling.wrapper:

Wrapper
-------

You can use the wrapper feature on any scheduler to wrap the scheduler test command and run the
wrapper command before actually running the intended command.

.. code-block:: yaml

    basic:
        scheduler: slurm
        schedule:
            wrapper: valgrind
            partition: standard
            nodes: 1

        run:
            cmds:
                # The run command will be `srun -N1 -p standard valgrind ./supermagic -a`
                # It will run `valgrind ./supermagic -a` on the allocation
                - '{{sched.test_cmd}} ./supermagic -a'

When using the ``raw`` scheduler, ``{{sched.test_cmd}}`` normally evaluates to an empty string. You
can use the wrapper setting to control a different scheduler directly.

.. code-block:: yaml

    shoot_yourself_in_the_foot_mode:
        scheduler: raw
        schedule:
            # Note that generally it's MUCH better to use the Pavilion's scheduling options,
            # but this allows you to, for example, test the scheduler itself.
            # Other note - You can use mpirun under slurm by setting ``schedule.slurm.mpi_cmd=mpirun``.
            wrapper: 'mpirun -np 2'
        run:
            cmds:
                # With the schedule wrapper, this will be `mpirun -np 2 ./supermagic -a`
                - '{{sched.test_cmd}} ./supermagic -a'