doc: Add section on management system communications (#2564)

Co-authored-by: architeuthidae <am@m-labs.hk>
2025-01-26 10:28:13 +08:00 · 2024-10-18 12:59:22 +02:00 · 2024-10-18 12:59:22 +02:00 · 02235b2d80
commit 02235b2d80
parent 58ea3b5bcc
2 changed files with 47 additions and 18 deletions
--- a/doc/manual/conf.py
+++ b/doc/manual/conf.py
@ -152,6 +152,10 @@ nitpick_ignore_regex = [
    (r'py:.*', r'artiq.gateware.*'),
    ('py:mod', r'artiq.test.*'),
    ('py:mod', r'artiq.applets.*'),
+    # we can't use artiq.master.* because we shouldn't ignore the scheduler
+    ('py:class', r'artiq.master.experiments.*'),
+    ('py:class', r'artiq.master.databases.*'),
+    ('py:.*', r'artiq.master.worker.*'),
    ('py:class', 'dac34H84'),
    ('py:class', 'trf372017'),
    ('py:class', r'list(.*)'),
--- a/doc/manual/management_system.rst
+++ b/doc/manual/management_system.rst
@ -66,30 +66,30 @@ Experiment scheduling
 Basics
 ^^^^^^

-To make more efficient use of hardware resources, experiments are generally split into three phases and pipelined, such that potentially compute-intensive pre-computation or analysis phases may be executed in parallel with the bodies of other experiments, which access hardware.
+To make more efficient use of resources, experiments are generally split into three phases and pipelined. While one experiment has control of the specialized hardware, others may carry out pre-computation or post-analysis in parallel. There are three stages of a standard experiment users may write code for:
+
+1. The **preparation** stage, which pre-fetches and pre-computes any data that is necessary to run the experiment. Users may implement this stage by overloading the :meth:`~artiq.language.environment.Experiment.prepare` method. It is not permitted to access hardware in this stage.
+
+2. The **run** stage, which corresponds to the body of the experiment. Users *must* implement this stage and overload the :meth:`~artiq.language.environment.Experiment.run` method. In this stage, the experiment has the right to run kernels and access hardware.
+
+3. The **analysis** stage, where raw results collected in the running stage can be post-processed and/or saved. This stage may be implemented by overloading the :meth:`~artiq.language.environment.Experiment.analyze` method. It is not permitted to access hardware in this stage.

 .. seealso::
-   These steps are implemented in :class:`~artiq.language.environment.Experiment`. However, user-written experiments should usually derive from (sub-class) :class:`artiq.language.environment.EnvExperiment`, which additionally provides access to the methods of :class:`artiq.language.environment.HasEnvironment`.
+   These steps are implemented in :class:`artiq.language.environment.Experiment`. User-written experiments should usually derive from (sub-class) :class:`artiq.language.environment.EnvExperiment`, which additionally provides access to the methods of :class:`~artiq.language.environment.HasEnvironment`.

-There are three stages of a standard experiment users may write code in:
+Only the :meth:`~artiq.language.environment.Experiment.run` method implementation is mandatory; if the experiment does not fit into the pipelined scheduling model, it can leave one or both of the other methods empty (which is the default). Preparation and analysis stages are forbidden from accessing hardware so as not to interfere with a potential concurrent run stage. Note that they are not *prevented* from doing so, and it is up to the programmer to respect these guidelines.

-1. The **preparation** stage, which pre-fetches and pre-computes any data that necessary to run the experiment. Users may implement this stage by overloading the :meth:`~artiq.language.environment.Experiment.prepare` method. It is not permitted to access hardware in this stage, as doing so may conflict with other experiments using the same devices.
-2. The **run** stage, which corresponds to the body of the experiment and generally accesses hardware. Users must implement this stage and overload the :meth:`~artiq.language.environment.Experiment.run` method.
-3. The **analysis** stage, where raw results collected in the running stage are post-processed and may lead to updates of the parameter database. This stage may be implemented by overloading the :meth:`~artiq.language.environment.Experiment.analyze` method.
-
-Only the :meth:`~artiq.language.environment.Experiment.run` method implementation is mandatory; if the experiment does not fit into the pipelined scheduling model, it can leave one or both of the other methods empty (which is the default).
-
-Consecutive experiments are then executed in a pipelined manner by the ARTIQ master's scheduler: first experiment A runs its preparation stage, than experiment A executes its running stage while experiment B executes its preparation stage, and so on.
+Consecutive experiments are automatically pipelined by the ARTIQ master's scheduler: first experiment A executes its preparation stage, then experiment A executes its running stage while experiment B executes its preparation stage, and so on.

 .. note::
-    The next experiment (B) may start its :meth:`~artiq.language.environment.Experiment.run` before all events placed into (core device) RTIO buffers by the previous experiment (A) have been executed. These events may then execute while experiment B's :meth:`~artiq.language.environment.Experiment.run` is already in progress. Using :meth:`~artiq.coredevice.core.Core.reset` in experiment B will clear the RTIO buffers, discarding pending events, including those left over from A.
+   An experiment A can exit its :meth:`~artiq.language.environment.Experiment.run` method before all its RTIO events have been executed, i.e., while those events are still 'waiting' in the RTIO core buffers. If the next experiment entering the running stage uses :meth:`~artiq.coredevice.core.Core.reset`, those buffers will be cleared, and any remaining events discarded, potentially including those scheduled by A.

-    Interactions between events of different experiments can be avoided by preventing the :meth:`~artiq.language.environment.Experiment.run` method of experiment A from returning until all events have been executed. This is discussed in the section on RTIO :ref:`rtio-handover-synchronization`.
+   This is a deliberate feature of seamless handover, but can cause problems if the events scheduled by A were important and should not have been skipped. In those cases, it is recommended to ensure the :meth:`~artiq.language.environment.Experiment.run` method of experiment A does not return until *all* its scheduled events have been executed, or that it is followed only by experiments which do not perform a core reset. See also :ref:`RTIO Synchronization<rtio-handover-synchronization>`.

 Priorities and timed runs
 ^^^^^^^^^^^^^^^^^^^^^^^^^

-When determining what experiment should begin executing next (i.e. enter the preparation stage), the scheduling looks at the following factors, by decreasing order of precedence:
+When determining what experiment should begin executing next (i.e. enter its preparation stage), the scheduling looks at the following factors, by decreasing order of precedence:

 1. Experiments may be scheduled with a due date. This is considered the *earliest possible* time of their execution (rather than a deadline, or latest possible -- ARTIQ makes no guarantees about experiments being started or completed before any specified time). If a due date is set and it has not yet been reached, the experiment is not eligible for preparation.
 2. The integer priority value specified by the user.
@ -106,11 +106,36 @@ When using multiple pipelines it is the responsibility of the user to ensure tha
 Pauses
 ^^^^^^

-In the run stage, an experiment may yield to the scheduler by calling the :meth:`pause` method of the scheduler.
-If there are other experiments with higher priority (e.g. a high-priority experiment has been newly submitted, or reached its due date and become eligible for execution), the higher-priority experiments are executed first, and then :meth:`pause` returns. If there are no such experiments, :meth:`pause` returns immediately. To check whether :meth:`pause` would in fact *not* return immediately, use :meth:`artiq.master.scheduler.Scheduler.check_pause`.
+In the run stage, an experiment may yield to the scheduler by calling the :meth:`pause` method of the scheduler. If there are other experiments with higher priority (e.g. a high-priority experiment has been newly submitted, or reached its due date and become eligible for execution), the higher-priority experiments are executed first, and then :meth:`pause` returns. If there are no such experiments, :meth:`pause` returns immediately. To check whether :meth:`pause` would in fact *not* return immediately, use :meth:`~artiq.master.scheduler.Scheduler.check_pause`.

-The experiment must place the hardware in a safe state and disconnect from the core device (typically, by calling ``self.core.comm.close()`` from the kernel, which is equivalent to :meth:`artiq.coredevice.core.Core.close`) before calling :meth:`pause`.
+The experiment must place the hardware in a safe state and disconnect from the core device before calling :meth:`pause` - typically by calling ``self.core.comm.close()``, which is equivalent to :meth:`~artiq.coredevice.core.Core.close`, from the host after completion of the kernel.

-Accessing the :meth:`pause` and :meth:`~artiq.master.scheduler.Scheduler.check_pause` methods is done through a virtual device called ``scheduler`` that is accessible to all experiments. The scheduler virtual device is requested like regular devices using :meth:`~artiq.language.environment.HasEnvironment.get_device` (``self.get_device()``) or :meth:`~artiq.language.environment.HasEnvironment.setattr_device` (``self.setattr_device()``).
+Accessing the :meth:`pause` and :meth:`~artiq.master.scheduler.Scheduler.check_pause` methods is done through a virtual device called ``scheduler`` that is accessible to all experiments. The scheduler virtual device is requested like any other device, with :meth:`~artiq.language.environment.HasEnvironment.get_device` or :meth:`~artiq.language.environment.HasEnvironment.setattr_device`. See also the detailed reference on the :doc:`mgmt_system_reference` page.

-:meth:`~artiq.master.scheduler.Scheduler.check_pause` can be called (via RPC) from a kernel, but :meth:`pause` must not be.
+.. note::
+   For maximum compatibility, the ``scheduler`` virtual device can also be accessed when running experiments with :mod:`~artiq.frontend.artiq_run`. However, since there is no :mod:`~artiq.master.scheduler.Scheduler` backend, the methods are replaced by simple dummies, e.g. :meth:`~artiq.master.scheduler.Scheduler.check_pause` simply returns false, and requests are printed into the console. Much the same is true of client control broadcasts (see again :doc:`mgmt_system_reference`).
+
+:meth:`~artiq.master.scheduler.Scheduler.check_pause` can be called (via RPC) from a kernel, but :meth:`pause` cannot be.
+
+Internal details
+----------------
+
+Internally, the ARTIQ management system uses Simple Python Communications, or `SiPyCo <https://github.com/m-labs/sipyco>`_, which was originally written as part of ARTIQ and later split away as a generic communications library. The SiPyCo manual is hosted `here <https://m-labs.hk/artiq/sipyco-manual/>`_. The core of the management system is largely contained within ``artiq.master``, which contains the :class:`~artiq.master.scheduler.Scheduler`, the various environment and filesystem databases, and the worker processes that execute the experiments themselves.
+
+By default, the master communicates with other processes over four network ports, see :doc:`default_network_ports`, for logging, broadcasts, notifications, and control. All four of these can be customized by using the ``--port`` flags, see :ref:`the front-end reference<frontend-artiq-master>`.
+
+- The logging port is occupied by a :class:`sipyco.logging_tools.Server`, and used only by the worker processes to transmit exceptions and other information to the master.
+- The broadcast port is occupied by a :class:`sipyco.broadcast.Broadcaster`, which inherits from :class:`sipyco.pc_rpc.AsyncioServer`. Both the dashboard and the client automatically connect to this port, using :class:`sipyco.broadcast.Receiver` to receive logs and CCB messages.
+- The notification port is occupied by a :class:`sipyco.sync_struct.Publisher`. The dashboard and client automatically connect to this port, using :class:`sipyco.sync_struct.Subscriber`. Several objects are given to the :class:`~sipyco.sync_struct.Publisher` to monitor, among them the experiment schedule, the device database, the dataset database, and the experiment list. It notifies the subscribers whenever these objects are modified.
+- The control port is occupied by a :class:`sipyco.pc_rpc.Server`, which when running can be queried with :mod:`sipyco.sipyco_rpctool` like any other source of RPC targets. Multiple concurrent calls to target methods are supported. Through this server, the clients are provided with access to control methods to access the various databases and repositories the master handles, through classes like :class:`artiq.master.databases.DeviceDB`, :class:`artiq.master.databases.DatasetDB`, and :class:`artiq.master.experiments.ExperimentDB`.
+
+The experiment database is supported by :class:`artiq.master.experiments.GitBackend` when Git integration is active, and :class:`artiq.master.experiments.FilesystemBackend` if not.
+
+Experiment workers
+^^^^^^^^^^^^^^^^^^
+
+The :mod:`~artiq.frontend.artiq_run` tool makes use of many of the same databases and handlers as the master (whereas the scheduler and CCB manager are replaced by dummies, as mentioned above), but also directly runs the build, run, and analyze stages of the experiments. On the other hand, within the management system, the master's :class:`~artiq.master.scheduler.Scheduler` spawns a new worker process for each experiment. This allows for the parallelization of stages and pipelines described above in :ref:`experiment-scheduling`.
+
+The master and the worker processes communicate through IPC, Inter Process Communcation, implemented with :mod:`sipyco.pipe_ipc`. Specifically, it is :mod:`artiq.master.worker_impl` which is spawned as a new process for each experiment, and the class :class:`artiq.master.worker.Worker` which manages the IPC requests of the workers, including access to :class:`~artiq.master.scheduler.Scheduler` but also to devices, datasets, arguments, and CCBs. This allows the worker to support experiment :meth:`~artiq.language.environment.HasEnvironment.build` methods and the :doc:`management system interfaces <mgmt_system_reference>`.
+
+The worker process also executes the experiment code itself. Within the experiment, kernel decorators -- :class:`~artiq.language.core.kernel`, :class:`~artiq.language.core.subkernel`, etc. -- call the ARTIQ compiler as necessary and trigger core device execution.