running kernel in a loop freezes #40

Closed
opened 2020-07-07 16:58:11 +08:00 by sb10q · 14 comments
from artiq.experiment import *

class Repeat(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    def run(self):
        for _ in range(100):
            print("A")
            self.dummy_kernel()
            print("B")

    @kernel
    def dummy_kernel(self):
        pass

This prints:

> artiq_run repeat.py                                                                                                                                      artiq-zynq-env 
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A

...and does not complete

Nothing suspicious in the Zynq log.

``` from artiq.experiment import * class Repeat(EnvExperiment): def build(self): self.setattr_device("core") def run(self): for _ in range(100): print("A") self.dummy_kernel() print("B") @kernel def dummy_kernel(self): pass ``` This prints: ```text > artiq_run repeat.py artiq-zynq-env A B A B A B A B A B A B A B A B A B A ``` ...and does not complete Nothing suspicious in the Zynq log.
Poster
Owner

Note that the Zynq does not crash from this; interrupting the experiment with Ctrl-C and running it again without rebooting the Zynq will give the same result.

Note that the Zynq does not crash from this; interrupting the experiment with Ctrl-C and running it again without rebooting the Zynq will give the same result.

I've added some logs:

[    15.426989s] DEBUG(runtime::kernel): kernel loaded
[    15.432699s]  INFO(runtime::kernel): kernel starting
[    15.437648s] TRACE(runtime::rpc): send<1>(None)->None
[    15.442692s]  INFO(runtime::kernel): kernel finished
[    15.498375s]  INFO(runtime::comms): Trying to load the kernel
[    15.510266s]  INFO(runtime::comms): core1 restart
[    15.515223s] DEBUG(runtime::kernel): Core1 started
[    15.520003s] DEBUG(runtime::kernel): FPU enabled on Core1
[    15.526446s] DEBUG(dyld): ELF target: 69656 bytes, align to 10000, allocated at 00170000
[    15.534517s] TRACE(dyld): Program header: 00000368+00000010 to 00170000
[    15.541113s] TRACE(dyld): Program header: 00000000+00000378 to 00170000
[    15.547724s] TRACE(dyld): Program header: 00000F80+00000094 to 00170000
[    15.554326s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[    15.560921s] TRACE(dyld): Program header: 00000000+00000000 to 00170000
[    15.567517s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[    15.574119s] DEBUG(dyld): Relocating 0 rela, 0 rel, 16 pltrel
[    15.579868s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[    15.586387s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[    15.592474s] DEBUG(runtime::kernel): kernel loaded
[    15.597912s]  INFO(runtime::kernel): kernel starting
[    15.602863s] TRACE(runtime::rpc): send<1>(None)->None
[    15.607907s]  INFO(runtime::kernel): kernel finished

It seems that it the comms module did not receive a LoadKernel request here...
Interrupting the experiment with Ctrl-C and re-run again did not give any log related to shutting down connection etc.

I've added some logs: ``` [ 15.426989s] DEBUG(runtime::kernel): kernel loaded [ 15.432699s] INFO(runtime::kernel): kernel starting [ 15.437648s] TRACE(runtime::rpc): send<1>(None)->None [ 15.442692s] INFO(runtime::kernel): kernel finished [ 15.498375s] INFO(runtime::comms): Trying to load the kernel [ 15.510266s] INFO(runtime::comms): core1 restart [ 15.515223s] DEBUG(runtime::kernel): Core1 started [ 15.520003s] DEBUG(runtime::kernel): FPU enabled on Core1 [ 15.526446s] DEBUG(dyld): ELF target: 69656 bytes, align to 10000, allocated at 00170000 [ 15.534517s] TRACE(dyld): Program header: 00000368+00000010 to 00170000 [ 15.541113s] TRACE(dyld): Program header: 00000000+00000378 to 00170000 [ 15.547724s] TRACE(dyld): Program header: 00000F80+00000094 to 00170000 [ 15.554326s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000 [ 15.560921s] TRACE(dyld): Program header: 00000000+00000000 to 00170000 [ 15.567517s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000 [ 15.574119s] DEBUG(dyld): Relocating 0 rela, 0 rel, 16 pltrel [ 15.579868s] TRACE(dyld::reloc): resolved symbol "__artiq_personality" [ 15.586387s] TRACE(dyld::reloc): resolved symbol "rpc_send_async" [ 15.592474s] DEBUG(runtime::kernel): kernel loaded [ 15.597912s] INFO(runtime::kernel): kernel starting [ 15.602863s] TRACE(runtime::rpc): send<1>(None)->None [ 15.607907s] INFO(runtime::kernel): kernel finished ``` It seems that it the comms module did not receive a LoadKernel request here... Interrupting the experiment with Ctrl-C and re-run again did not give any log related to shutting down connection etc.
Poster
Owner

Looks like a smoltcp/async problem. When this occurs, in the proto_async.rs expect function, stream.recv keeps calling the closure with buf == [90, 90, 90] even though the client had sent much more data (as seen on wireshark trace).

I had removed the RPC code when testing this (as RPCs tends to cause corruption and insane behavior) so this is probably an unrelated and regular bug.

Looks like a smoltcp/async problem. When this occurs, in the ``proto_async.rs`` ``expect`` function, ``stream.recv`` keeps calling the closure with ``buf == [90, 90, 90]`` even though the client had sent much more data (as seen on wireshark trace). I had removed the RPC code when testing this (as RPCs tends to cause corruption and insane behavior) so this is probably an unrelated and regular bug.
astro was assigned by sb10q 2020-07-14 16:47:22 +08:00
Poster
Owner

Deleted the kernel loading and execution code completely from core1, behavior is exactly the same. This continues to look like a Bohrbug.

Deleted the kernel loading and execution code completely from core1, behavior is exactly the same. This continues to look like a Bohrbug.
sb10q added the
priority:critical
label 2020-07-15 18:04:36 +08:00

Seeing this repeatedly after the kernel gets stuck when enabling feature log for smoltcp:

[   385.035827s] DEBUG(smoltcp::socket::tcp): #3:192.168.1.52:1381:192.168.1.1:44132: received a keep-alive or window probe packet, will send an ACK
[   385.048804s] DEBUG(smoltcp::socket::tcp): sending sACK option with current assembler ranges
Seeing this repeatedly after the kernel gets stuck when enabling feature `log` for smoltcp: ``` [ 385.035827s] DEBUG(smoltcp::socket::tcp): #3:192.168.1.52:1381:192.168.1.1:44132: received a keep-alive or window probe packet, will send an ACK [ 385.048804s] DEBUG(smoltcp::socket::tcp): sending sACK option with current assembler ranges ```

Two more observations:

  • Outgoing TCP packets (both replies + egress) have window_len: 0 when the connection gets stuck
  • Increasing the TcpSocket RX buffer size is a workaround
Two more observations: - Outgoing TCP packets (both replies + egress) have `window_len: 0` when the connection gets stuck - Increasing the TcpSocket RX buffer size is a workaround

The issue happens when TCP RX packets were fragmented in an unfortunate way. We may only get a tiny fragment of the RX buffer in a recv() callback.

I am considering a revision of our recv() API and how we use it. Returning Poll::Pending from the callback is not useful due to this behavior.

To demonstate, this resolves the problem for me:

diff --git a/src/runtime/src/proto_async.rs b/src/runtime/src/proto_async.rs
index 953c6f2..46250a8 100644
--- a/src/runtime/src/proto_async.rs
+++ b/src/runtime/src/proto_async.rs
@@ -10,18 +10,27 @@ use libasync::smoltcp::TcpStream;
 type Result<T> = core::result::Result<T, smoltcp::Error>;
 
 pub async fn expect(stream: &TcpStream, pattern: &[u8]) -> Result<bool> {
-    stream.recv(|buf| {
-        for (i, b) in buf.iter().enumerate() {
-            if *b == pattern[i] {
-                if i + 1 == pattern.len() {
-                    return Poll::Ready((i + 1, Ok(true)));
+    let mut offset = 0;
+    while offset < pattern.len() {
+        let new_offset = stream.recv(|buf| {
+            let len = buf.len().min(pattern.len() - offset);
+            for (i, b) in buf[..len].iter().enumerate() {
+                if *b != pattern[offset + i] {
+                    // mismatch
+                    return Poll::Ready((i + 1, None));
                 }
-            } else {
-                return Poll::Ready((i + 1, Ok(false)));
             }
+            Poll::Ready((len, Some(offset + len)))
+        }).await?;
+        match new_offset {
+            Some(new_offset) =>
+                offset = new_offset,
+            None =>
+                // mismatch
+                return Ok(false),
         }
-        Poll::Pending
-    }).await?
+    }
+    Ok(true)
 }

The issue happens when TCP RX packets were fragmented in an unfortunate way. We may only get a tiny fragment of the RX buffer in a `recv()` callback. I am considering a revision of our `recv()` API and how we use it. Returning `Poll::Pending` from the callback is not useful due to this behavior. To demonstate, this resolves the problem for me: ```patch diff --git a/src/runtime/src/proto_async.rs b/src/runtime/src/proto_async.rs index 953c6f2..46250a8 100644 --- a/src/runtime/src/proto_async.rs +++ b/src/runtime/src/proto_async.rs @@ -10,18 +10,27 @@ use libasync::smoltcp::TcpStream; type Result<T> = core::result::Result<T, smoltcp::Error>; pub async fn expect(stream: &TcpStream, pattern: &[u8]) -> Result<bool> { - stream.recv(|buf| { - for (i, b) in buf.iter().enumerate() { - if *b == pattern[i] { - if i + 1 == pattern.len() { - return Poll::Ready((i + 1, Ok(true))); + let mut offset = 0; + while offset < pattern.len() { + let new_offset = stream.recv(|buf| { + let len = buf.len().min(pattern.len() - offset); + for (i, b) in buf[..len].iter().enumerate() { + if *b != pattern[offset + i] { + // mismatch + return Poll::Ready((i + 1, None)); } - } else { - return Poll::Ready((i + 1, Ok(false))); } + Poll::Ready((len, Some(offset + len))) + }).await?; + match new_offset { + Some(new_offset) => + offset = new_offset, + None => + // mismatch + return Ok(false), } - Poll::Pending - }).await? + } + Ok(true) } ```

On the other hand, let's have another look if smoltcp shouldn't be able to merge these fragments...

On the other hand, let's have another look if smoltcp shouldn't be able to merge these fragments...

Ah, doesn't TcpSocket::recv() only return a single contiguous slice from the ring buffer? That is, don't you need either keep a buffer in TcpStream to make the TcpStream::recv API work, or change clients to be able to accept a list of buffers?

Ah, doesn't `TcpSocket::recv()` only return a single contiguous slice from the ring buffer? That is, don't you need either keep a buffer in `TcpStream` to make the `TcpStream::recv` API work, or change clients to be able to accept a list of buffers?

In other words, seems like you'll always need to expect smoltcp's recv() to return incomplete data.

See e.g. https://github.com/smoltcp-rs/smoltcp/issues/359 for a discussion about changing the buffer data structure to avoid the extra copy in client code.

In other words, seems like you'll always need to expect smoltcp's recv() to return incomplete data. See e.g. https://github.com/smoltcp-rs/smoltcp/issues/359 for a discussion about changing the buffer data structure to avoid the extra copy in client code.

@astro: Do you think you'll get around to finishing this soon? If not, I can have a look.

@astro: Do you think you'll get around to finishing this soon? If not, I can have a look.
Poster
Owner

One way of solving this would be to mandate that consumers take at least one byte when there's any data available. It shouldn't be too hard to fix the runtime's socket I/O functions to ensure that AFAICT.

One way of solving this would be to mandate that consumers take at least one byte when there's any data available. It shouldn't be too hard to fix the runtime's socket I/O functions to ensure that AFAICT.
Poster
Owner

@astro @pca006132 @dpn Thanks for finding out the root cause of the problem. I fixed it, now I can run 1000 iterations of a kernel in a loop without problem. If you think you have a better solution let me know.

@astro @pca006132 @dpn Thanks for finding out the root cause of the problem. I fixed it, now I can run 1000 iterations of a kernel in a loop without problem. If you think you have a better solution let me know.
sb10q closed this issue 2020-07-19 16:18:50 +08:00

Even if smoltcp was able to merge subsequent TCP segegments, we may still end up receiving at the tail end of the ring buffer. The current approach seems best.

Even if smoltcp was able to merge subsequent TCP segegments, we may still end up receiving at the tail end of the ring buffer. The current approach seems best.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#40
There is no content yet.