from artiq.experiment import *
class Repeat(EnvExperiment):
def build(self):
self.setattr_device("core")
def run(self):
for _ in range(100):
print("A")
self.dummy_kernel()
print("B")
@kernel
def dummy_kernel(self):
pass
This prints:
> artiq_run repeat.py artiq-zynq-env
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
...and does not complete
Nothing suspicious in the Zynq log.
```
from artiq.experiment import *
class Repeat(EnvExperiment):
def build(self):
self.setattr_device("core")
def run(self):
for _ in range(100):
print("A")
self.dummy_kernel()
print("B")
@kernel
def dummy_kernel(self):
pass
```
This prints:
```text
> artiq_run repeat.py artiq-zynq-env
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
B
A
```
...and does not complete
Nothing suspicious in the Zynq log.
Note that the Zynq does not crash from this; interrupting the experiment with Ctrl-C and running it again without rebooting the Zynq will give the same result.
Note that the Zynq does not crash from this; interrupting the experiment with Ctrl-C and running it again without rebooting the Zynq will give the same result.
[ 15.426989s] DEBUG(runtime::kernel): kernel loaded
[ 15.432699s] INFO(runtime::kernel): kernel starting
[ 15.437648s] TRACE(runtime::rpc): send<1>(None)->None
[ 15.442692s] INFO(runtime::kernel): kernel finished
[ 15.498375s] INFO(runtime::comms): Trying to load the kernel
[ 15.510266s] INFO(runtime::comms): core1 restart
[ 15.515223s] DEBUG(runtime::kernel): Core1 started
[ 15.520003s] DEBUG(runtime::kernel): FPU enabled on Core1
[ 15.526446s] DEBUG(dyld): ELF target: 69656 bytes, align to 10000, allocated at 00170000
[ 15.534517s] TRACE(dyld): Program header: 00000368+00000010 to 00170000
[ 15.541113s] TRACE(dyld): Program header: 00000000+00000378 to 00170000
[ 15.547724s] TRACE(dyld): Program header: 00000F80+00000094 to 00170000
[ 15.554326s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[ 15.560921s] TRACE(dyld): Program header: 00000000+00000000 to 00170000
[ 15.567517s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[ 15.574119s] DEBUG(dyld): Relocating 0 rela, 0 rel, 16 pltrel
[ 15.579868s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[ 15.586387s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[ 15.592474s] DEBUG(runtime::kernel): kernel loaded
[ 15.597912s] INFO(runtime::kernel): kernel starting
[ 15.602863s] TRACE(runtime::rpc): send<1>(None)->None
[ 15.607907s] INFO(runtime::kernel): kernel finished
It seems that it the comms module did not receive a LoadKernel request here...
Interrupting the experiment with Ctrl-C and re-run again did not give any log related to shutting down connection etc.
I've added some logs:
```
[ 15.426989s] DEBUG(runtime::kernel): kernel loaded
[ 15.432699s] INFO(runtime::kernel): kernel starting
[ 15.437648s] TRACE(runtime::rpc): send<1>(None)->None
[ 15.442692s] INFO(runtime::kernel): kernel finished
[ 15.498375s] INFO(runtime::comms): Trying to load the kernel
[ 15.510266s] INFO(runtime::comms): core1 restart
[ 15.515223s] DEBUG(runtime::kernel): Core1 started
[ 15.520003s] DEBUG(runtime::kernel): FPU enabled on Core1
[ 15.526446s] DEBUG(dyld): ELF target: 69656 bytes, align to 10000, allocated at 00170000
[ 15.534517s] TRACE(dyld): Program header: 00000368+00000010 to 00170000
[ 15.541113s] TRACE(dyld): Program header: 00000000+00000378 to 00170000
[ 15.547724s] TRACE(dyld): Program header: 00000F80+00000094 to 00170000
[ 15.554326s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[ 15.560921s] TRACE(dyld): Program header: 00000000+00000000 to 00170000
[ 15.567517s] TRACE(dyld): Program header: 00000F80+00000080 to 00170000
[ 15.574119s] DEBUG(dyld): Relocating 0 rela, 0 rel, 16 pltrel
[ 15.579868s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[ 15.586387s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[ 15.592474s] DEBUG(runtime::kernel): kernel loaded
[ 15.597912s] INFO(runtime::kernel): kernel starting
[ 15.602863s] TRACE(runtime::rpc): send<1>(None)->None
[ 15.607907s] INFO(runtime::kernel): kernel finished
```
It seems that it the comms module did not receive a LoadKernel request here...
Interrupting the experiment with Ctrl-C and re-run again did not give any log related to shutting down connection etc.
Looks like a smoltcp/async problem. When this occurs, in the proto_async.rsexpect function, stream.recv keeps calling the closure with buf == [90, 90, 90] even though the client had sent much more data (as seen on wireshark trace).
I had removed the RPC code when testing this (as RPCs tends to cause corruption and insane behavior) so this is probably an unrelated and regular bug.
Looks like a smoltcp/async problem. When this occurs, in the ``proto_async.rs`` ``expect`` function, ``stream.recv`` keeps calling the closure with ``buf == [90, 90, 90]`` even though the client had sent much more data (as seen on wireshark trace).
I had removed the RPC code when testing this (as RPCs tends to cause corruption and insane behavior) so this is probably an unrelated and regular bug.
Seeing this repeatedly after the kernel gets stuck when enabling feature log for smoltcp:
[ 385.035827s] DEBUG(smoltcp::socket::tcp): #3:192.168.1.52:1381:192.168.1.1:44132: received a keep-alive or window probe packet, will send an ACK
[ 385.048804s] DEBUG(smoltcp::socket::tcp): sending sACK option with current assembler ranges
Seeing this repeatedly after the kernel gets stuck when enabling feature `log` for smoltcp:
```
[ 385.035827s] DEBUG(smoltcp::socket::tcp): #3:192.168.1.52:1381:192.168.1.1:44132: received a keep-alive or window probe packet, will send an ACK
[ 385.048804s] DEBUG(smoltcp::socket::tcp): sending sACK option with current assembler ranges
```
Outgoing TCP packets (both replies + egress) have window_len: 0 when the connection gets stuck
Increasing the TcpSocket RX buffer size is a workaround
Two more observations:
- Outgoing TCP packets (both replies + egress) have `window_len: 0` when the connection gets stuck
- Increasing the TcpSocket RX buffer size is a workaround
The issue happens when TCP RX packets were fragmented in an unfortunate way. We may only get a tiny fragment of the RX buffer in a recv() callback.
I am considering a revision of our recv() API and how we use it. Returning Poll::Pending from the callback is not useful due to this behavior.
To demonstate, this resolves the problem for me:
diff --git a/src/runtime/src/proto_async.rs b/src/runtime/src/proto_async.rs
index 953c6f2..46250a8 100644
--- a/src/runtime/src/proto_async.rs
+++ b/src/runtime/src/proto_async.rs
@@ -10,18 +10,27 @@ use libasync::smoltcp::TcpStream;
type Result<T> = core::result::Result<T, smoltcp::Error>;
pub async fn expect(stream: &TcpStream, pattern: &[u8]) -> Result<bool> {
- stream.recv(|buf| {
- for (i, b) in buf.iter().enumerate() {
- if *b == pattern[i] {
- if i + 1 == pattern.len() {
- return Poll::Ready((i + 1, Ok(true)));
+ let mut offset = 0;
+ while offset < pattern.len() {
+ let new_offset = stream.recv(|buf| {
+ let len = buf.len().min(pattern.len() - offset);
+ for (i, b) in buf[..len].iter().enumerate() {
+ if *b != pattern[offset + i] {
+ // mismatch
+ return Poll::Ready((i + 1, None));
}
- } else {
- return Poll::Ready((i + 1, Ok(false)));
}
+ Poll::Ready((len, Some(offset + len)))
+ }).await?;
+ match new_offset {
+ Some(new_offset) =>
+ offset = new_offset,
+ None =>
+ // mismatch
+ return Ok(false),
}
- Poll::Pending
- }).await?
+ }
+ Ok(true)
}
The issue happens when TCP RX packets were fragmented in an unfortunate way. We may only get a tiny fragment of the RX buffer in a `recv()` callback.
I am considering a revision of our `recv()` API and how we use it. Returning `Poll::Pending` from the callback is not useful due to this behavior.
To demonstate, this resolves the problem for me:
```patch
diff --git a/src/runtime/src/proto_async.rs b/src/runtime/src/proto_async.rs
index 953c6f2..46250a8 100644
--- a/src/runtime/src/proto_async.rs
+++ b/src/runtime/src/proto_async.rs
@@ -10,18 +10,27 @@ use libasync::smoltcp::TcpStream;
type Result<T> = core::result::Result<T, smoltcp::Error>;
pub async fn expect(stream: &TcpStream, pattern: &[u8]) -> Result<bool> {
- stream.recv(|buf| {
- for (i, b) in buf.iter().enumerate() {
- if *b == pattern[i] {
- if i + 1 == pattern.len() {
- return Poll::Ready((i + 1, Ok(true)));
+ let mut offset = 0;
+ while offset < pattern.len() {
+ let new_offset = stream.recv(|buf| {
+ let len = buf.len().min(pattern.len() - offset);
+ for (i, b) in buf[..len].iter().enumerate() {
+ if *b != pattern[offset + i] {
+ // mismatch
+ return Poll::Ready((i + 1, None));
}
- } else {
- return Poll::Ready((i + 1, Ok(false)));
}
+ Poll::Ready((len, Some(offset + len)))
+ }).await?;
+ match new_offset {
+ Some(new_offset) =>
+ offset = new_offset,
+ None =>
+ // mismatch
+ return Ok(false),
}
- Poll::Pending
- }).await?
+ }
+ Ok(true)
}
```
Ah, doesn't TcpSocket::recv() only return a single contiguous slice from the ring buffer? That is, don't you need either keep a buffer in TcpStream to make the TcpStream::recv API work, or change clients to be able to accept a list of buffers?
Ah, doesn't `TcpSocket::recv()` only return a single contiguous slice from the ring buffer? That is, don't you need either keep a buffer in `TcpStream` to make the `TcpStream::recv` API work, or change clients to be able to accept a list of buffers?
In other words, seems like you'll always need to expect smoltcp's recv() to return incomplete data.
See e.g. https://github.com/smoltcp-rs/smoltcp/issues/359 for a discussion about changing the buffer data structure to avoid the extra copy in client code.
One way of solving this would be to mandate that consumers take at least one byte when there's any data available. It shouldn't be too hard to fix the runtime's socket I/O functions to ensure that AFAICT.
One way of solving this would be to mandate that consumers take at least one byte when there's any data available. It shouldn't be too hard to fix the runtime's socket I/O functions to ensure that AFAICT.
@astro@pca006132@dpn Thanks for finding out the root cause of the problem. I fixed it, now I can run 1000 iterations of a kernel in a loop without problem. If you think you have a better solution let me know.
@astro @pca006132 @dpn Thanks for finding out the root cause of the problem. I fixed it, now I can run 1000 iterations of a kernel in a loop without problem. If you think you have a better solution let me know.
Even if smoltcp was able to merge subsequent TCP segegments, we may still end up receiving at the tail end of the ring buffer. The current approach seems best.
Even if smoltcp was able to merge subsequent TCP segegments, we may still end up receiving at the tail end of the ring buffer. The current approach seems best.
This prints:
...and does not complete
Nothing suspicious in the Zynq log.
Note that the Zynq does not crash from this; interrupting the experiment with Ctrl-C and running it again without rebooting the Zynq will give the same result.
I've added some logs:
It seems that it the comms module did not receive a LoadKernel request here...
Interrupting the experiment with Ctrl-C and re-run again did not give any log related to shutting down connection etc.
Looks like a smoltcp/async problem. When this occurs, in the
proto_async.rs
expect
function,stream.recv
keeps calling the closure withbuf == [90, 90, 90]
even though the client had sent much more data (as seen on wireshark trace).I had removed the RPC code when testing this (as RPCs tends to cause corruption and insane behavior) so this is probably an unrelated and regular bug.
Deleted the kernel loading and execution code completely from core1, behavior is exactly the same. This continues to look like a Bohrbug.
Seeing this repeatedly after the kernel gets stuck when enabling feature
log
for smoltcp:Two more observations:
window_len: 0
when the connection gets stuckThe issue happens when TCP RX packets were fragmented in an unfortunate way. We may only get a tiny fragment of the RX buffer in a
recv()
callback.I am considering a revision of our
recv()
API and how we use it. ReturningPoll::Pending
from the callback is not useful due to this behavior.To demonstate, this resolves the problem for me:
On the other hand, let's have another look if smoltcp shouldn't be able to merge these fragments...
Ah, doesn't
TcpSocket::recv()
only return a single contiguous slice from the ring buffer? That is, don't you need either keep a buffer inTcpStream
to make theTcpStream::recv
API work, or change clients to be able to accept a list of buffers?In other words, seems like you'll always need to expect smoltcp's recv() to return incomplete data.
See e.g. https://github.com/smoltcp-rs/smoltcp/issues/359 for a discussion about changing the buffer data structure to avoid the extra copy in client code.
@astro: Do you think you'll get around to finishing this soon? If not, I can have a look.
One way of solving this would be to mandate that consumers take at least one byte when there's any data available. It shouldn't be too hard to fix the runtime's socket I/O functions to ensure that AFAICT.
@astro @pca006132 @dpn Thanks for finding out the root cause of the problem. I fixed it, now I can run 1000 iterations of a kernel in a loop without problem. If you think you have a better solution let me know.
Even if smoltcp was able to merge subsequent TCP segegments, we may still end up receiving at the tail end of the ring buffer. The current approach seems best.