From 48cdf420160c66776fe702a46ab5bc88f6775354 Mon Sep 17 00:00:00 2001
From: architeuthis <am@m-labs.hk>
Date: Thu, 20 Jun 2024 16:13:55 +0800
Subject: [PATCH] docs: Add details of shallow 'with parallel' to manual

---
 doc/manual/getting_started_core.rst | 26 ++++++++++++++++++++++----
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/doc/manual/getting_started_core.rst b/doc/manual/getting_started_core.rst
index c5e02b505..e55956a00 100644
--- a/doc/manual/getting_started_core.rst
+++ b/doc/manual/getting_started_core.rst
@@ -145,7 +145,7 @@ Try reducing the period of the generated waveform until the CPU cannot keep up w
 Parallel and sequential blocks
 ------------------------------
 
-It is often necessary for several pulses to overlap one another. This can be expressed through the use of ``with parallel`` constructs, in which the events generated by the individual statements are executed at the same time. The duration of the ``parallel`` block is the duration of its longest statement.
+It is often necessary for several pulses to overlap one another. This can be expressed through the use of the ``with parallel`` construct, in which the events generated by individual statements are scheduled to execute at the same time, rather than sequentially. The duration of the ``parallel`` block is the duration of its longest statement. 
 
 Try the following code and observe the generated pulses on a 2-channel oscilloscope or logic analyzer: ::
 
@@ -167,9 +167,12 @@ Try the following code and observe the generated pulses on a 2-channel oscillosc
                 delay(4*us)
 
 ARTIQ can implement ``with parallel`` blocks without having to resort to any of the typical parallel processing approaches.
-It simply remembers the position on the timeline when entering the ``parallel`` block and then seeks back to that position after submitting the events generated by each statement.
-In other words, the statements in the ``parallel`` block are actually executed sequentially, only the RTIO events generated by them are scheduled to be executed in parallel.
-Note that accordingly if a statement takes a lot of CPU time to execute (which is different from -- and has nothing to do with! -- the events *scheduled* by the statement taking a long time), it may cause a subsequent statement to miss the deadline for timely submission of its events (and raise :exc:`~artiq.coredevice.exceptions.RTIOUnderflow`), while earlier statements in the parallel block would have submitted their events without problems.   
+It simply remembers its position on the timeline (``now``) when entering the ``parallel`` block and resets to that position after each individual statement. 
+At the end of the block, the cursor is advanced to the furthest position it reached during the block. 
+In other words, the statements in a ``parallel`` block are actually executed sequentially. 
+Only the RTIO events generated by the statements are *scheduled* in parallel. 
+
+Remember that while ``now`` resets at the beginning of each statement in a ``parallel`` block, the wall clock advances regardless. If a particular statement takes a long time to execute (which is different from -- and unrelated to! -- the events *scheduled* by the statement taking a long time), the wall clock may advance past the reset value, putting any subsequent statements inside the block into a situation of negative slack (i.e., resulting in :exc:`~artiq.coredevice.exceptions.RTIOUnderflow` ). Sometimes underflows may be avoided simply by reordering statements within the parallel block. This especially applies to input methods, which generally necessarily block CPU progress until the wall clock has caught up to or overtaken the cursor. 
 
 Within a parallel block, some statements can be scheduled sequentially again using a ``with sequential`` block. Observe the pulses generated by this code: ::
 
@@ -182,6 +185,21 @@ Within a parallel block, some statements can be scheduled sequentially again usi
             self.ttl1.pulse(4*us)
         delay(4*us)
 
+.. warning::
+    ``with parallel`` specifically 'parallelizes' the *top-level* statements inside a block. Consider as an example: ::
+
+            for i in range(1000000):
+                with parallel:
+                    self.ttl0.pulse(2*us)       # 1  
+                    if True:                    # 2 
+                        self.ttl1.pulse(2*us)   # 3
+                        self.ttl2.pulse(2*us)   # 4
+                delay(4*us)
+
+    This code will not schedule the three pulses to ``ttl0``, ``ttl1``, and ``ttl2`` in parallel. Rather, the pulse to ``ttl1`` is 'parallelized' *with the if statement*. The timeline cursor resets once, at the beginning of statement #2; it will not repeat the reset at the deeper indentation level for #3 or #4. 
+    
+    In practice, the pulses to ``ttl0`` and ``ttl1`` will execute simultaneously, and the pulse to ``ttl2`` will execute after the pulse to ``ttl1``, bringing the total duration of the ``parallel`` block to 4 us. Internally, statements #3 and #4, contained within the top-level if statement, are considered an atomic sequence and executed within an implicit ``with sequential``. To execute #3 and #4 in parallel, it is necessary to place them inside a second, nested ``parallel`` block within the if statement.   
+
 Particular care needs to be taken when working with ``parallel`` blocks which generate large numbers of RTIO events, as it is possible to create sequence errors. A sequence error is caused when the scalable event dispatcher (SED) cannot queue an RTIO event due to its timestamp being the same as or earlier than another event in its queue. By default, the SED has 8 lanes, which suffice in most cases to avoid sequence errors; however, if many (>8) events are queued with interlaced timestamps the problem can still surface. See :ref:`sequence-errors`. 
 
 Note that for performance reasons sequence errors do not halt execution of the kernel. Instead, they are reported in the core log. If the ``aqctl_corelog`` process has been started with ``artiq_ctlmgr``, then these errors will be posted to the master log. If an experiment is executed through ``artiq_run``, the errors will only be visible in the core log.