doc: Add "Using Drtio" and overhaul DRTIO page

2025-02-19 05:49:45 +08:00 · 2024-06-12 18:22:46 +08:00 · 2024-06-12 18:22:46 +08:00 · 1ca7a3c6a3
commit 1ca7a3c6a3
parent 102506d603
5 changed files with 238 additions and 215 deletions
--- a/doc/manual/drtio.rst
+++ b/doc/manual/drtio.rst
@ -1,17 +1,11 @@
-Distributed Real Time Input/Output (DRTIO)
-==========================================
+DRTIO system 
+============

-DRTIO is a time and data transfer system that allows ARTIQ RTIO channels to be distributed among several satellite devices synchronized and controlled by a central core device.
+DRTIO is the time and data transfer system that allows ARTIQ RTIO channels to be distributed among several satellite devices, synchronized and controlled by a central core device. The main source of DRTIO traffic is the remote control of RTIO output and input channels. The protocol is optimized to maximize throughput and minimize latency, and handles flow control and error conditions (underflows, overflows, etc.)

-The link is a high speed duplex serial line operating at 1Gbps or more, over copper or optical fiber.
+The DRTIO protocol also supports auxiliary traffic, which is low-priority and non-realtime, e.g., to override and monitor TTL I/Os. Auxiliary traffic never interrupts or delays the main traffic, so it cannot cause unexpected latencies or exceptions (e.g. RTIO underflows).

-The main source of DRTIO traffic is the remote control of RTIO output and input channels. The protocol is optimized to maximize throughput and minimize latency, and handles flow control and error conditions (underflows, overflows, etc.)
-
-The DRTIO protocol also supports auxiliary, low-priority and non-realtime traffic. The auxiliary channel supports overriding and monitoring TTL I/Os. Auxiliary traffic never interrupts or delays the main traffic, so that it cannot cause unexpected poor performance (e.g. RTIO underflows).
-
-Time transfer and clock syntonization is typically done over the serial link alone. The DRTIO code is organized as much as possible to support porting to different types of transceivers (Xilinx MGTs, Altera MGTs, soft transceivers running off regular FPGA IOs, etc.) and different synchronization mechanisms.
-
-The lower layers of DRTIO are similar to White Rabbit, with the following main differences:
+The lower layers of DRTIO are similar to `White Rabbit <https://white-rabbit.web.cern.ch/>`_ , with the following main differences:

 * lower latency
 * deterministic latency
@ -20,80 +14,31 @@ The lower layers of DRTIO are similar to White Rabbit, with the following main d
 * no Ethernet compatibility
 * only star or tree topologies are supported

-From ARTIQ kernels, DRTIO channels are used in the same way as local RTIO channels.
-
-.. _using-drtio:
-
-Using DRTIO
-----------
+Time transfer and clock syntonization is typically done over the serial link alone. The DRTIO code is written as much as possible to support porting to different types of transceivers (Xilinx MGTs, Altera MGTs, soft transceivers running off regular FPGA IOs, etc.) and different synchronization mechanisms.

 Terminology
-+++++++++++
+-----------

-In a system of interconnected DRTIO devices, each RTIO core (driving RTIO PHYs; for example a RTIO core would connect to a large bank of TTL signals) is assigned a number and is called a *destination*. One DRTIO device normally contains one RTIO core.
+In a system of interconnected DRTIO devices, each RTIO core is assigned a number and is called a *destination*. One DRTIO device normally contains one RTIO core.

 On one DRTIO device, the immediate path that a RTIO request must take is called a *hop*: the request can be sent to the local RTIO core, or to another device downstream. Each possible hop is assigned a number. Hop 0 is normally the local RTIO core, and hops 1 and above correspond to the respective downstream ports of the device.

 DRTIO devices are arranged in a tree topology, with the core device at the root. For each device, its distance from the root (in number of devices that are crossed) is called its *rank*. The root has rank 0, the devices immediately connected to it have rank 1, and so on.

 The routing table
-+++++++++++++++++
+-----------------

 The routing table defines, for each destination, the list of hops ("route") that must be taken from the root in order to reach it.

-It is stored in a binary format that can be manipulated with the :ref:`artiq_route utility <routing-table-tool>`. The binary file is then programmed into the flash storage of the core device under the ``routing_table`` key. It is automatically distributed to downstream devices when the connections are established. Modifying the routing table requires rebooting the core device for the new table to be taken into account.
-
-All routes must end with the local RTIO core of the last device (0).
-
-The local RTIO core of the core device is a destination like any other, and it needs to be explicitly part of the routing table for kernels to be able to access it.
-
-If no routing table is programmed, the core device takes a default routing table for a star topology (i.e. with no devices of rank 2 or above), with destination 0 being the core device's local RTIO core and destinations 1 and above corresponding to devices on the respective downstream ports.
-
-Here is an example of creating and programming a routing table for a chain of 3 devices: ::
-
-    # create an empty routing table
-    $ artiq_route rt.bin init
-
-    # set destination 0 to the local RTIO core
-    $ artiq_route rt.bin set 0 0
-
-    # for destination 1, first use hop 1 (the first downstream port)
-    # then use the local RTIO core of that second device.
-    $ artiq_route rt.bin set 1 1 0
-
-    # for destination 2, use hop 1 and reach the second device as
-    # before, then use hop 1 on that device to reach the third
-    # device, and finally use the local RTIO core (hop 0) of the
-    # third device.
-    $ artiq_route rt.bin set 2 1 1 0
-
-    $ artiq_route rt.bin show
-      0:   0
-      1:   1   0
-      2:   1   1   0
-
-    $ artiq_coremgmt config write -f routing_table rt.bin
-
-Addressing distributed RTIO cores from kernels
-++++++++++++++++++++++++++++++++++++++++++++++
-
-Remote RTIO channels are accessed in the same way as local ones. Bits 16-24 of the RTIO channel number define the destination. Bits 0-15 of the RTIO channel number select the channel within the destination.
-
-Link establishment
-++++++++++++++++++
-
-After devices have booted, it takes several seconds for all links in a DRTIO system to become established (especially with the long locking times of low-bandwidth PLLs that are used for jitter reduction purposes). Kernels should not attempt to access destinations until all required links are up (when this happens, the ``RTIODestinationUnreachable`` exception is raised). ARTIQ provides the method :meth:`~artiq.coredevice.core.Core.get_rtio_destination_status` that determines whether a destination can be reached. We recommend calling it in a loop in your startup kernel for each important destination, to delay startup until they all can be reached.
-
-Latency
-+++++++
-
-Each hop increases the RTIO latency of a destination by a significant amount; that latency is however constant and can be compensated for in kernels. To limit latency in a system, fully utilize the downstream ports of devices to reduce the depth of the tree, instead of creating chains.
+It is stored in a binary format that can be generated and manipulated with the :ref:`artiq_route utility <routing-table-tool>`, see :ref:`drtio-routing`. The binary file is programmed into the flash storage of the core device under the ``routing_table`` key. It is automatically distributed to downstream devices when the connections are established. Modifying the routing table requires rebooting the core device for the new table to be taken into account.

 Internal details
 ----------------

+Bits 16-24 of the RTIO channel number (assigned to a respective device in the initial system description JSON, and specified again for use of the ARTIQ front-end in the device database) define the destination. Bits 0-15 of the RTIO channel number select the channel within the destination.
+
 Real-time and auxiliary packets
-+++++++++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 DRTIO is a packet-based protocol that uses two types of packets:

@ -101,9 +46,9 @@ DRTIO is a packet-based protocol that uses two types of packets:
 * auxiliary packets, which are lower-bandwidth and are used for ancillary tasks such as housekeeping and monitoring/injection. Auxiliary packets are low-priority and their transmission has no impact on the timing of real-time packets (however, transmission of real-time packets slows down the transmission of auxiliary packets). In the ARTIQ DRTIO implementation, the contents of the auxiliary packets are read and written directly by the firmware, with the gateware simply handling the transmission of the raw data.

 Link layer
-++++++++++
+^^^^^^^^^^

-The lower layer of the DRTIO protocol stack is the link layer, which is responsible for delimiting real-time and auxiliary packets, and assisting with the establishment of a fixed-latency high speed serial transceiver link.
+The lower layer of the DRTIO protocol stack is the link layer, which is responsible for delimiting real-time and auxiliary packets and assisting with the establishment of a fixed-latency high speed serial transceiver link.

 DRTIO uses the IBM (Widmer and Franaszek) 8b/10b encoding. D characters (the encoded 8b symbols) always transmit real-time packet data, whereas K characters are used for idling and transmitting auxiliary packet data.

@ -113,7 +58,8 @@ A real-time packet is defined by a series of D characters containing the packet'

 K characters, which are transmitted whenever there is no real-time data to transmit and to delimit real-time packets, are chosen using a 3-bit K selection word. If this K character is the first character in the set of N characters processed by the transceiver in the logic clock cycle, the mapping between the K selection word and the 8b/10b K space contains commas. If the K character is any of the subsequent characters processed by the transceiver, a different mapping is used that does not contain any commas. This scheme allows the receiver to align its logic clock with that of the transmitter, simply by shifting its logic clock so that commas are received into the first character position.

-.. note:: Due to the shoddy design of transceiver hardware, this simple process of clock and comma alignment is difficult to perform in practice. The paper "High-speed, fixed-latency serial links with Xilinx FPGAs" (by Xue LIU, Qing-xu DENG, Bo-ning HOU and Ze-ke WANG) discusses techniques that can be used. The ARTIQ implementation simply keeps resetting the receiver until the comma is aligned, since relatively long lock times are acceptable.
+.. note:: 
+    Due to the shoddy design of transceiver hardware, this simple process of clock and comma alignment is difficult to perform in practice. The paper "High-speed, fixed-latency serial links with Xilinx FPGAs" (by Xue LIU, Qing-xu DENG, Bo-ning HOU and Ze-ke WANG) discusses techniques that can be used. The ARTIQ implementation simply keeps resetting the receiver until the comma is aligned, since relatively long lock times are acceptable.

 The series of K selection words is then used to form auxiliary packets and the idle pattern. When there is no auxiliary packet to transfer or to delimitate auxiliary packets, the K selection word ``100`` is used. To transfer data from an auxiliary packet, the K selection word ``0ab`` is used, with ``ab`` containing two bits of data from the packet. An auxiliary packet is delimited by at least one ``100`` K selection word.

@ -122,23 +68,23 @@ Both real-time traffic and K selection words are scrambled in order to make the
 Due to the use of K characters both as delimiters for real-time packets and as information carrier for auxiliary packets, auxiliary traffic is guaranteed a minimum bandwidth simply by having a maximum size limit on real-time packets.

 Clocking
-++++++++
+^^^^^^^^

 At the DRTIO satellite device, the recovered and aligned transceiver clock is used for clocking RTIO channels, after appropriate jitter filtering using devices such as the Si5324. The same clock is also used for clocking the DRTIO transmitter (loop timing), which simplifies clock domain transfers and allows for precise round-trip-time measurements to be done.

 RTIO clock synchronization
-++++++++++++++++++++++++++
+^^^^^^^^^^^^^^^^^^^^^^^^^^

 As part of the DRTIO link initialization, a real-time packet is sent by the core device to each satellite device to make them load their respective timestamp counters with the timestamp values from their respective packets.

 RTIO outputs
-++++++++++++
+^^^^^^^^^^^^

 Controlling a remote RTIO output involves placing the RTIO event into the buffer of the destination. The core device maintains a cache of the buffer space available in each destination. If, according to the cache, there is space available, then a packet containing the event information (timestamp, address, channel, data) is sent immediately and the cached value is decremented by one. If, according to the cache, no space is available, then the core device sends a request for the space available in the destination and updates the cache. The process repeats until at least one remote buffer entry is available for the event, at which point a packet containing the event information is sent as before.

 Detecting underflow conditions is the responsibility of the core device; should an underflow occur then no DRTIO packet is transmitted. Sequence errors are handled similarly.

 RTIO inputs
-+++++++++++
+^^^^^^^^^^^

 The core device sends a request to the satellite for reading data from one of its channels. The request contains a timeout, which is the RTIO timestamp to wait for until an input event appears. The satellite then replies with either an input event (containing timestamp and data), a timeout, or an overflow error.
--- a/doc/manual/getting_started_core.rst
+++ b/doc/manual/getting_started_core.rst
@ -213,6 +213,8 @@ The core device records the real-time I/O waveforms into a circular buffer. It i

 Afterwards, the recorded data can be extracted and written to a VCD file using ``artiq_coreanalyzer -w rtio.vcd`` (see :ref:`core-device-rtio-analyzer-tool`). VCD files can be viewed using third-party tools such as GtkWave.

+.. _getting-started-dma: 
+
 Direct Memory Access (DMA)
 --------------------------

@ -252,136 +254,3 @@ Try this: ::
                self.core_dma.playback_handle(pulses_handle)

 For more documentation on the methods used, see the :mod:`artiq.coredevice.dma` reference.
-
-Distributed Direct Memory Access (DDMA)
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-By default on DRTIO systems, all events recorded by the DMA core are kept and played back on the master.
-
-With distributed DMA, RTIO events that should be played back on remote destinations are distributed to the corresponding satellites. In some cases (typically, large buffers on several satellites with high event throughput), it allows for better performance and higher bandwidth, as the RTIO events do not have to be sent over the DRTIO link(s) during playback.
-
-To enable distributed DMA, simply provide an ``enable_ddma=True`` argument for the :meth:`~artiq.coredevice.dma.CoreDMA.record` method - taking a snippet from the previous example: ::
-
-        @kernel
-        def record(self):
-            with self.core_dma.record("pulses", enable_ddma=True):
-                # all RTIO operations now go to the "pulses"
-                # DMA buffer, instead of being executed immediately.
-                for i in range(50):
-                    self.ttl0.pulse(100*ns)
-                    delay(100*ns)
-
-In standalone systems this argument is ignored and has no effect.  
-
-Enabling DDMA on a purely local sequence on a DRTIO system introduces an overhead during trace recording which comes from additional processing done on the record, so careful use is advised. Due to the extra time that communicating with relevant satellites takes, an additional delay before playback may be necessary to prevent a :exc:`~artiq.coredevice.exceptions.RTIOUnderflow` when playing back a DDMA-enabled sequence.
-
-Subkernels
----------
-
-Subkernels refers to kernels running on a satellite device. This allows offloading some processing and control over remote RTIO devices, freeing up resources on the master.
-
-Subkernels behave for the most part like regular kernels; they accept arguments and can return values. However, there are few caveats:
-
-   - they do not support RPCs,
-   - they do not support DRTIO,
-   - their return value must be fully annotated with an ARTIQ type,
-   - their arguments should be annotated, and only basic ARTIQ types are supported,
-   - while ``self`` is allowed, there is no attribute writeback - any changes will be discarded when the subkernel is completed,
-   - they can raise exceptions, but the exceptions cannot be caught by the master (rather, they are propagated directly to the host),
-   - they begin execution as soon as possible when called, and can be awaited.
-
-To define a subkernel, use the subkernel decorator (``@subkernel(destination=X)``). The destination is the satellite number as defined in the routing table, and must be between 1 and 255. To call a subkernel, call it like a normal function; and to await its result, use ``subkernel_await(function, [timeout])``.
-
-For example, a subkernel performing integer addition: ::
-
-    from artiq.experiment import *
-
-
-    @subkernel(destination=1)
-    def subkernel_add(a: TInt32, b: TInt32) -> TInt32:
-        return a + b
-
-    class SubkernelExperiment(EnvExperiment):
-        def build(self):
-            self.setattr_device("core")
-
-        @kernel
-        def run(self):
-            subkernel_add(2, 2)
-            result = subkernel_await(subkernel_add)
-            assert result == 4
-
-Sometimes subkernel execution may take large amounts of time. By default, the await function will wait as long as necessary. If a timeout is needed, it can be set using the optional argument of ``subkernel_await()``. The value given is interpreted in milliseconds. If a negative value is given, timeout is disabled. 
-
-Subkernels are compiled after the main kernel and immediately uploaded to satellites. When called, the master instructs the appropriate satellite to load the subkernel into their kernel core and run it. If the subkernel is complex, and its binary relatively large, the delay between the call and actually running the subkernel may be substantial; if it's necessary to minimize this delay, ``subkernel_preload(function)`` should be used before the call.
-
-While ``self`` is accepted as an argument for subkernels, it is embedded into the compiled data. Any changes made by the main kernel or other subkernels will not be available.
-
-Subkernels can call other kernels and subkernels. For a more complex example: ::
-
-    from artiq.experiment import *
-
-    class SubkernelExperiment(EnvExperiment):
-        def build(self):
-            self.setattr_device("core")
-            self.setattr_device("ttl0")
-            self.setattr_device("ttl8")  # assuming it's on satellite
-
-        @subkernel(destination=1)
-        def add_and_pulse(self, a: TInt32, b: TInt32) -> TInt32:
-            c = a + b
-            self.pulse_ttl(c)
-            return c
-
-        @subkernel(destination=1)
-        def pulse_ttl(self, delay: TInt32) -> TNone:
-            self.ttl8.pulse(delay*us)
-
-        @kernel
-        def run(self):
-            subkernel_preload(self.add_and_pulse)
-            self.core.reset()
-            delay(10*ms)
-            self.add_and_pulse(2, 2)
-            self.ttl0.pulse(15*us)
-            result = subkernel_await(self.add_and_pulse)
-            assert result == 4
-            self.pulse_ttl(20)
-
-Without the preload, the delay after the core reset would need to be longer. The operation may still take some time, depending on the connection. Notice that the method ``pulse_ttl()`` can be called both within a subkernel and on its own. 
-
-It is not necessary for subkernels to always be awaited, but awaiting is required to retrieve returned values and exceptions.
-
-.. note::
-    While a subkernel is running, regardless of what devices it makes use of, none of the RTIO devices on that satellite (or on any satellites downstream) will be available to the master. Control is returned to master after the subkernel completes - to be certain a device is usable, await the subkernel before performing any RTIO operations on the affected satellites. 
-
-Message passing
-^^^^^^^^^^^^^^^
-
-Apart from arguments and returns, subkernels can also pass messages between each other or the master with built-in ``subkernel_send()`` and ``subkernel_recv()`` functions. This can be used for communication between subkernels, to pass additional data, or to send partially computed data. Consider the following example: ::
-
-    from artiq.experiment import *
-
-    @subkernel(destination=1)
-    def simple_message() -> TInt32:
-        data = subkernel_recv("message", TInt32)
-        return data + 20
-
-    class MessagePassing(EnvExperiment):
-        def build(self):
-            self.setattr_device("core")
-
-        @kernel
-        def run(self):
-            simple_self()
-            subkernel_send(1, "message", 150)
-            result = subkernel_await(simple_self)
-            assert result == 170
-
-The ``subkernel_send(destination, name, value)`` function takes three arguments: a destination, a name for the message (to be used for identification in the corresponding ``subkernel_recv()``), and the passed value.
-
-The ``subkernel_recv(name, type, [timeout])`` function requires two arguments: message name (matching exactly the name provided in ``subkernel_send``) and expected type. Optionally, it accepts a third argument, a timeout for the operation in milliseconds. If this value is negative, timeout is disabled. By default, it waits as long as necessary.
-
-To avoid misinterpretation of the data the compiler type-checks the value sent by ``subkernel_send`` against the type declared in ``subkernel_recv``. To guard against common errors, it also checks that all message names are used in both a sending and receiving function.
-
-A message can only be received while a subkernel is running, and is placed into a buffer to be retrieved when required; therefore send executes independently of any receive and never deadlocks. However, a receive function may timeout or wait forever if no message with the correct name and destination is ever sent. 
--- a/doc/manual/index.rst
+++ b/doc/manual/index.rst
@ -12,6 +12,7 @@ ARTIQ documentation
    rtio
    getting_started_core
    getting_started_mgmt
+    using_drtio_subkernels
    environment
    compiler
    management_system
--- a/doc/manual/installing.rst
+++ b/doc/manual/installing.rst
@ -245,13 +245,13 @@ Reflashing core device gateware and firmware
 --------------------------------------------

 .. note::
-  If you have purchased a pre-assembled system from M-Labs or QUARTIQ, the gateware and firmware of your device will already be flashed to the newest version of ARTIQ. These steps are only necessary if you obtained your hardware in a different way, or if you want to change or upgrade your ARTIQ version after purchase.  
+  If you have purchased a pre-assembled system from M-Labs or QUARTIQ, the gateware and firmware of your devices will already be flashed to the newest version of ARTIQ. These steps are only necessary if you obtained your hardware in a different way, or if you want to change your system configuration or upgrade your ARTIQ version after the original purchase.  


 Obtaining the board binaries
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

-If you have an active firmware subscription with M-Labs or QUARTIQ, you can obtain firmware that corresponds to your currently installed version of ARTIQ using AFWS (ARTIQ firmware service). One year of subscription is included with most hardware purchases. You may purchase or extend firmware subscriptions by writing to the sales@ email.
+If you have an active firmware subscription with M-Labs or QUARTIQ, you can obtain firmware for your system that corresponds to your currently installed version of ARTIQ using AFWS (ARTIQ firmware service). One year of subscription is included with most hardware purchases. You may purchase or extend firmware subscriptions by writing to the sales@ email.

 Run the command::

@ -397,11 +397,6 @@ The startup kernel is the kernel executed once immediately whenever the core dev

 For DRTIO systems, the startup kernel should wait until the desired destinations (including local RTIO) are up, using :meth:`artiq.coredevice.Core.get_rtio_destination_status`.

-Load the DRTIO routing table
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you are using DRTIO and the default routing table (for a star topology) is not suitable to your needs, prepare and load a different routing table. See :ref:`Using DRTIO <using-drtio>`.
-
 Select the RTIO clock source
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@ -424,3 +419,9 @@ This feature allows you to print the channels' respective names alongside with t

 .. note:: More information on the ``artiq_rtiomap`` utility can be found on the :ref:`Utilities <rtiomap-tool>` page.

+Load the DRTIO routing table
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+If you are using DRTIO and the default routing table (for a star topology) is not suitable to your needs, you will first need to prepare and load a different routing table. See :ref:`Using DRTIO <drtio-routing>`.
+
+
--- a/doc/manual/using_drtio_subkernels.rst
+++ b/doc/manual/using_drtio_subkernels.rst
@ -0,0 +1,206 @@
+.. _drtio-and-subkernels: 
+
+Using DRTIO and subkernels 
+========================== 
+
+In larger or more spread-out systems, a single core device might not be suited to managing all the RTIO operations or channels necessary. For these situations ARTIQ supplies Distributed Real-Time IO, or DRTIO. This allows systems to be configured with some or all of their RTIO channels distributed to one or several *satellite* core devices, which are linked to the *master* core device. These remote channels are then accessible in kernels on the master device exactly like local channels. 
+
+The specific topology of core and satellite links is flexible and can be changed at will. It is supplied to the core device by means of a routing table. Links should be high-speed duplex serial lines operating 1Gbps or more.
+
+.. note:: 
+    As with other configuration changes (e.g. adding new hardware), if you are in possession of a non-distributed ARTIQ system and you'd like to expand it into a DRTIO setup, it's easily possible to do so, but you need to be sure that both master and satellite are (re)flashed with this in mind. As usual, if you obtained your hardware from M-Labs, you will normally be supplied with all the binaries you need, through ``awfs_client`` or otherwise.  
+
+.. note:: 
+    Do not confuse the DRTIO *master device* (used to mean  the central controlling core device of a distributed system) with the *ARTIQ master* (the central piece of software of ARTIQ's management system, which interacts with ``artiq_client`` and the dashboard.) ``artiq_run`` can be used to run experiments on DRTIO systems just as easily as non-distributed ones, and the ARTIQ master interacts with the central core device regardless of whether it's configured as a DRTIO master or standalone.
+
+Using DRTIO
+-----------
+
+.. _drtio-routing:
+
+Configuring the routing table
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+By default, DRTIO assumes a routing table for a star topology (i.e. all satellites directly connected to the master), with destination 0 being the master device's local RTIO core and destinations 1 and above corresponding to devices on the master's respective downstream ports. To use any other topology, it is necessary to supply a corresponding routing table in the form of a binary file, written to flash storage under the key ``routing_table``. The binary file is easily generated in the correct format using ``artiq_route``. This example is for a chain of 3 devices: ::   
+
+    # create an empty routing table
+    $ artiq_route rt.bin init
+
+    # set destination 0 to the local RTIO core
+    $ artiq_route rt.bin set 0 0
+
+    # for destination 1, first use hop 1 (the first downstream port)
+    # then use the local RTIO core of that second device.
+    $ artiq_route rt.bin set 1 1 0
+
+    # for destination 2, use hop 1 and reach the second device as
+    # before, then use hop 1 on that device to reach the third
+    # device, and finally use the local RTIO core (hop 0) of the
+    # third device.
+    $ artiq_route rt.bin set 2 1 1 0
+
+    $ artiq_route rt.bin show
+      0:   0
+      1:   1   0
+      2:   1   1   0
+
+    $ artiq_coremgmt config write -f routing_table rt.bin
+
+The local RTIO core of the master device is considered a destination like any other; it must be explicitly listed in the routing table to be accessible to kernels. 
+
+All routes must end with the local RTIO core of the master. Incorrect routing tables will cause ``RTIODestinationUnreachable`` exceptions.  
+
+As with other configuration changes, the core device should be restarted (``artiq_flash start``, power cycle, etc.) for changes to take effect. 
+
+Using the core language with DRTIO 
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Remote channels are accessed in precisely the same way as local channels: by calling ``self.setattr_device()`` and then referencing the channel by name. 
+
+Link establishment
+^^^^^^^^^^^^^^^^^^
+After devices have booted, it takes several seconds for all links in a DRTIO system to become established (especially with the long locking times of low-bandwidth PLLs that are used for jitter reduction purposes). Kernels should not attempt to access destinations until all required links are up (when this happens, the ``RTIODestinationUnreachable`` exception is raised). ARTIQ provides the method :meth:`~artiq.coredevice.core.Core.get_rtio_destination_status` which determines whether a destination can be reached. We recommend calling it in a loop in your startup kernel for each important destination in order to delay startup until they all can be reached.
+
+Latency
+^^^^^^^
+Each hop (link traversed) increases the RTIO latency of a destination by a significant amount; however, this latency is constant and can be compensated for in kernels. To limit latency in a system, fully utilize the downstream ports of devices to reduce the depth of the tree, instead of creating chains. In some situations, the use of subkernels (see below) may also bypass potential latency issues.
+
+Distributed Direct Memory Access (DDMA)
+---------------------------------------
+
+By default on DRTIO systems, all events recorded by the master's DMA core are kept and played back on the master. With distributed DMA, RTIO events that should be played back on remote destinations are distributed to the corresponding satellites. In some cases (typically, large buffers on several satellites with high event throughput), it allows for better performance and higher bandwidth, as the RTIO events do not have to be sent over the DRTIO link(s) during playback.
+
+To enable distributed DMA for the master, simply provide an ``enable_ddma=True`` argument for the :meth:`~artiq.coredevice.dma.CoreDMA.record` method - taking a snippet from the non-distributed example in the :ref:`core language tutorial <getting-started-dma>`: ::
+
+        @kernel
+        def record(self):
+            with self.core_dma.record("pulses", enable_ddma=True):
+                # all RTIO operations now go to the "pulses"
+                # DMA buffer, instead of being executed immediately.
+                for i in range(50):
+                    self.ttl0.pulse(100*ns)
+                    delay(100*ns)
+
+In standalone systems, as well as in subkernels (see below), this argument is ignored; in standalone systems it is meaningless and in subkernels it must always be enabled for structural reasons.  
+
+Enabling DDMA on a purely local sequence on a DRTIO system introduces an overhead during trace recording which comes from additional processing done on the record, so careful use is advised. Due to the extra time that communicating with relevant satellites takes, an additional delay before playback may be necessary to prevent a :exc:`~artiq.coredevice.exceptions.RTIOUnderflow` when playing back a DDMA-enabled sequence.
+
+Subkernels
+----------
+
+Rather than only offloading the RTIO channels to satellites and limiting all processing to the master core device, it is fully possible to run kernels directly on satellite devices. These are referred to as *subkernels*. Using subkernels to process and control remote RTIO channels can free up resources on the core device.  
+
+Subkernels behave for the most part like regular kernels; they accept arguments, can return values, and are marked by the decorator ``@subkernel(destination=i)``, where ``i`` is the satellite's destination number as used in the routing table. To call a subkernel, call it like any other function. There are however a few caveats: 
+
+   - subkernels do not support RPCs,
+   - subkernels do not support (recursive) DRTIO (but they can call other subkernels and send messages to each other, see below),
+   - they support DMA, for which DDMA is considered always enabled,  
+   - their return values must be fully annotated with an ARTIQ type,
+   - their arguments should be annotated, and only basic ARTIQ types are supported,
+   - they can raise exceptions, but the exceptions cannot be caught by the master (they can only be caught locally or propagated directly to the host), 
+   - while ``self`` is allowed as an argument, it is retrieved at compile time and a purely local object afterwards -- any changes made by other kernels will not be shown, and changes made locally will not be visible anywhere else.
+
+Subkernels in practice
+^^^^^^^^^^^^^^^^^^^^^^
+
+Subkernels begin execution as soon as possible when called. By default, they are not awaited, but awaiting is necessary to receive results or exceptions. The await function ``subkernel_await(function, [timeout])`` takes as argument the subkernel to be awaited and, optionally, a timeout value in milliseconds. If the timeout is reached without response from the subkernel, a :exc:`~artiq.coredevice.exceptions.SubkernelError` is raised. If no timeout value is supplied the function waits indefinitely. Negative timeout values are ignored. 
+
+For example, a subkernel performing integer addition: ::
+
+    from artiq.experiment import *
+
+
+    @subkernel(destination=1)
+    def subkernel_add(a: TInt32, b: TInt32) -> TInt32:
+        return a + b
+
+    class SubkernelExperiment(EnvExperiment):
+        def build(self):
+            self.setattr_device("core")
+
+        @kernel
+        def run(self):
+            subkernel_add(2, 2)
+            result = subkernel_await(subkernel_add)
+            assert result == 4
+
+Subkernels are compiled after the main kernel and immediately sent to the designated satellite. When they are called, the master simply instructs the subkernel to load and run the corresponding kernel. When ``self`` is used in subkernels, it is embedded into the compiled and uploaded data; this is the reason why changes made do not propagate between kernels.
+
+If a subkernel is called on a satellite where a kernel is already running, the newer kernel overrides silently, and the previous kernel will not be completed. 
+
+.. note::
+    Be careful with use of ``self.core.reset()`` around subkernels. Since ``self`` in subkernels is purely local, calling ``self.core.reset()`` in a subkernel will only affect that specific satellite and its own FIFOs. On the other hand, calling ``self.core.reset()`` in the master kernel will clear FIFOs in all satellites, regardless of whether a subkernel is running, but will not stop the subkernel. As a result, any event currently in a FIFO queue will be cleared, but the subkernels may continue to queue events. This is likely to result in odd behavior; it's best to avoid using ``self.core.reset()`` during the lifetime of any subkernels.  
+
+.. note:: 
+    Subkernels do not exit automatically if a master kernel exits. It is generally the responsibility of a given experiment to ensure that all its subkernels complete before exiting, by awaiting them or otherwise. If this cannot be guaranteed, it is possible to sanitize by calling trivial kernels in each satellite -- since newer kernels override, any kernels still running will be automatically cancelled. Much like RTIO events still in FIFO queues, the nature of seamless transition means subkernels left running after the end of an experiment cannot be guaranteed to complete. 
+
+If a subkernel is complex and its binary relatively large, the delay between the call and actually running the subkernel may be substantial. If it's necessary to minimize this delay, ``subkernel_preload(function)`` should be used before the call. 
+
+While a subkernel is running, the satellite is disconnected from the RTIO interface of the master. As a result, regardless of what devices the subkernel itself uses, none of the RTIO devices on that satellite will be available to the master, nor will messages be passed on to any further satellites downstream. Control is returned to the master when no subkernel is running -- to be sure that a device will be accessible, await before performing any RTIO operations on the affected satellite.
+
+Calling other kernels
+^^^^^^^^^^^^^^^^^^^^^
+
+Subkernels can call other kernels and subkernels. For a more complex example: ::
+
+    from artiq.experiment import *
+
+    class SubkernelExperiment(EnvExperiment):
+        def build(self):
+            self.setattr_device("core")
+            self.setattr_device("ttl0")
+            self.setattr_device("ttl8")  # assuming it's on satellite
+
+        @subkernel(destination=1)
+        def add_and_pulse(self, a: TInt32, b: TInt32) -> TInt32:
+            c = a + b
+            self.pulse_ttl(c)
+            return c
+
+        @subkernel(destination=1)
+        def pulse_ttl(self, delay: TInt32) -> TNone:
+            self.ttl8.pulse(delay*us)
+
+        @kernel
+        def run(self):
+            subkernel_preload(self.add_and_pulse)
+            self.core.reset()
+            delay(10*ms)
+            self.add_and_pulse(2, 2)
+            self.ttl0.pulse(15*us)
+            result = subkernel_await(self.add_and_pulse)
+            assert result == 4
+            self.pulse_ttl(20)
+
+In this case, without the preload, the delay after the core reset would need to be longer. Depending on the connection, the call may still take some time in itself. Notice that the method ``pulse_ttl()`` can be called both within a subkernel and on its own. 
+
+.. note:: 
+    Subkernels can call subkernels on any other satellite, not only their own. Care should however be taken that different kernels do not call subkernels on the same satellite, or only very cautiously. If, e.g., a newer call overrides a subkernel that another caller is awaiting, unpredictable timeouts or locks may result, as the original subkernel will never return. There is not currently any mechanism to check whether a particular satellite is 'busy'; it is up to the programmer to handle this correctly. 
+
+Message passing
+^^^^^^^^^^^^^^^
+
+Apart from arguments and returns, subkernels can also pass messages between each other or the master with built-in ``subkernel_send()`` and ``subkernel_recv()`` functions. This can be used for communication between subkernels, to pass additional data, or to send partially computed data. Consider the following example: ::
+
+    from artiq.experiment import *
+
+    @subkernel(destination=1)
+    def simple_message() -> TInt32:
+        data = subkernel_recv("message", TInt32)
+        return data + 20
+
+    class MessagePassing(EnvExperiment):
+        def build(self):
+            self.setattr_device("core")
+
+        @kernel
+        def run(self):
+            simple_self()
+            subkernel_send(1, "message", 150)
+            result = subkernel_await(simple_self)
+            assert result == 170
+
+The ``subkernel_send(destination, name, value)`` function requires three arguments: a destination, a name for the message (to be used for identification in the corresponding ``subkernel_recv()``), and the passed value.
+
+The ``subkernel_recv(name, type, [timeout])`` function requires two arguments: message name (matching exactly the name provided in ``subkernel_send``) and expected type. Optionally, it also accepts a third argument, a timeout for the operation in milliseconds. As with ``subkernel_await``, the default behavior is to wait as long as necessary, and a negative argument is ignored. 
+
+A message can only be received while a subkernel is running, and is placed into a buffer to be retrieved when required. As a result ``send`` executes independently of any receive and never deadlocks. However, a ``receive`` function may timeout or lock (wait forever) if no message with the correct name and destination is ever sent.