Performance degradation for smoltcp 0.7 #127

Open
opened 2021-01-26 13:21:30 +08:00 by pca006132 · 4 comments
Contributor

I've tried the latest release for smoltcp, and got some regression:

Previously we have:

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 47.00 5.21
I32 Array (1MB) D2H 63.64 2.36
I32 Array (1KB) H2D 2.23 0.53
I32 Array (1KB) D2H 2.27 0.35
Bytes List (1MB) H2D 34.96 3.25
Bytes List (1MB) D2H 45.57 1.89
Bytes List (1KB) H2D 2.54 0.46
Bytes List (1KB) D2H 2.44 0.35
Bytes (1MB) H2D 53.61 3.64
Bytes (1MB) D2H 63.76 2.32
Bytes (1KB) H2D 1.98 0.38
Bytes (1KB) D2H 2.26 0.40
I32 List (1MB) H2D 39.56 5.82
I32 List (1MB) D2H 51.25 5.51
I32 List (1KB) H2D 1.98 0.21
I32 List (1KB) D2H 2.27 0.34

After updating smoltcp:

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 29.42 9.25
I32 Array (1MB) D2H 64.06 2.48
I32 Array (1KB) H2D 2.15 0.30
I32 Array (1KB) D2H 1.93 0.14
Bytes List (1MB) H2D 24.26 5.21
Bytes List (1MB) D2H 39.51 4.78
Bytes List (1KB) H2D 2.11 0.31
Bytes List (1KB) D2H 2.04 0.23
Bytes (1MB) H2D 37.31 9.97
Bytes (1MB) D2H 63.79 2.24
Bytes (1KB) H2D 1.97 0.08
Bytes (1KB) D2H 2.01 0.21
I32 List (1MB) H2D 22.40 8.79
I32 List (1MB) D2H 43.16 6.78
I32 List (1KB) H2D 1.88 0.07
I32 List (1KB) D2H 2.09 0.17

Could this be related to smoltcp/smoltcp#408? I haven't check it with wireshark.

I've tried the latest release for smoltcp, and got some regression: Previously we have: | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 47.00 | 5.21 | | I32 Array (1MB) D2H | 63.64 | 2.36 | | I32 Array (1KB) H2D | 2.23 | 0.53 | | I32 Array (1KB) D2H | 2.27 | 0.35 | | Bytes List (1MB) H2D | 34.96 | 3.25 | | Bytes List (1MB) D2H | 45.57 | 1.89 | | Bytes List (1KB) H2D | 2.54 | 0.46 | | Bytes List (1KB) D2H | 2.44 | 0.35 | | Bytes (1MB) H2D | 53.61 | 3.64 | | Bytes (1MB) D2H | 63.76 | 2.32 | | Bytes (1KB) H2D | 1.98 | 0.38 | | Bytes (1KB) D2H | 2.26 | 0.40 | | I32 List (1MB) H2D | 39.56 | 5.82 | | I32 List (1MB) D2H | 51.25 | 5.51 | | I32 List (1KB) H2D | 1.98 | 0.21 | | I32 List (1KB) D2H | 2.27 | 0.34 | After updating smoltcp: | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 29.42 | 9.25 | | I32 Array (1MB) D2H | 64.06 | 2.48 | | I32 Array (1KB) H2D | 2.15 | 0.30 | | I32 Array (1KB) D2H | 1.93 | 0.14 | | Bytes List (1MB) H2D | 24.26 | 5.21 | | Bytes List (1MB) D2H | 39.51 | 4.78 | | Bytes List (1KB) H2D | 2.11 | 0.31 | | Bytes List (1KB) D2H | 2.04 | 0.23 | | Bytes (1MB) H2D | 37.31 | 9.97 | | Bytes (1MB) D2H | 63.79 | 2.24 | | Bytes (1KB) H2D | 1.97 | 0.08 | | Bytes (1KB) D2H | 2.01 | 0.21 | | I32 List (1MB) H2D | 22.40 | 8.79 | | I32 List (1MB) D2H | 43.16 | 6.78 | | I32 List (1KB) H2D | 1.88 | 0.07 | | I32 List (1KB) D2H | 2.09 | 0.17 | Could this be related to [smoltcp/smoltcp#408](https://github.com/smoltcp-rs/smoltcp/issues/408)? I haven't check it with wireshark.
Contributor

It's actually due to https://github.com/smoltcp-rs/smoltcp/issues/404

TCP delayed ack is now disabled with fcb38fae6c.

It's actually due to https://github.com/smoltcp-rs/smoltcp/issues/404 TCP delayed ack is now disabled with fcb38fae6cb3d57ec406e83827b2a65558543bb3.
Owner

OK, thanks for tracking this down!

OK, thanks for tracking this down!
Owner

Now:

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 83.80 2.30
I32 Array (1MB) D2H 65.47 2.90
I32 Array (1KB) H2D 6.21 1.39
I32 Array (1KB) D2H 5.86 0.85
Bytes List (1MB) H2D 50.97 5.73
Bytes List (1MB) D2H 45.05 9.73
Bytes List (1KB) H2D 5.89 1.38
Bytes List (1KB) D2H 6.30 0.93
Bytes (1MB) H2D 83.27 4.06
Bytes (1MB) D2H 63.14 2.38
Bytes (1KB) H2D 4.41 2.17
Bytes (1KB) D2H 4.70 2.16
I32 List (1MB) H2D 54.89 12.67
I32 List (1MB) D2H 52.10 8.48
I32 List (1KB) H2D 3.69 1.76
I32 List (1KB) D2H 3.93 1.46
Now: | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 83.80 | 2.30 | | I32 Array (1MB) D2H | 65.47 | 2.90 | | I32 Array (1KB) H2D | 6.21 | 1.39 | | I32 Array (1KB) D2H | 5.86 | 0.85 | | Bytes List (1MB) H2D | 50.97 | 5.73 | | Bytes List (1MB) D2H | 45.05 | 9.73 | | Bytes List (1KB) H2D | 5.89 | 1.38 | | Bytes List (1KB) D2H | 6.30 | 0.93 | | Bytes (1MB) H2D | 83.27 | 4.06 | | Bytes (1MB) D2H | 63.14 | 2.38 | | Bytes (1KB) H2D | 4.41 | 2.17 | | Bytes (1KB) D2H | 4.70 | 2.16 | | I32 List (1MB) H2D | 54.89 | 12.67 | | I32 List (1MB) D2H | 52.10 | 8.48 | | I32 List (1KB) H2D | 3.69 | 1.76 | | I32 List (1KB) D2H | 3.93 | 1.46 |
Contributor

The more I think about delayed acks and read up on them, the more I'm getting the impression that this is the expected downside. We bet the delayed ack timeout for adding window updates and response data to acking received data. There are gains if we're decide to optimize settings instead of disabling delayed acks.

Cora Z7 baseline with delayed acks disabled

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 59.64 2.24
I32 Array (1MB) D2H 51.85 0.38
I32 Array (1KB) H2D 4.51 1.01
I32 Array (1KB) D2H 4.52 0.83
Bytes List (1MB) H2D 42.36 1.65
Bytes List (1MB) D2H 41.80 1.37
Bytes List (1KB) H2D 3.79 0.79
Bytes List (1KB) D2H 4.17 0.67
Bytes (1MB) H2D 64.36 1.97
Bytes (1MB) D2H 51.98 0.63
Bytes (1KB) H2D 4.03 1.18
Bytes (1KB) D2H 4.23 1.02
I32 List (1MB) H2D 51.02 1.57
I32 List (1MB) D2H 46.88 0.24
I32 List (1KB) H2D 3.78 0.45
I32 List (1KB) D2H 4.41 0.69

Cora Z7 with 256KB TCP RX buffers and delayed acks at 1ms

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 79.13 1.52
I32 Array (1MB) D2H 50.82 0.55
I32 Array (1KB) H2D 4.42 0.61
I32 Array (1KB) D2H 3.71 0.20
Bytes List (1MB) H2D 42.33 2.11
Bytes List (1MB) D2H 38.87 1.26
Bytes List (1KB) H2D 5.05 1.18
Bytes List (1KB) D2H 5.00 1.04
Bytes (1MB) H2D 89.84 3.08
Bytes (1MB) D2H 51.21 0.68
Bytes (1KB) H2D 4.86 1.09
Bytes (1KB) D2H 5.07 0.83
I32 List (1MB) H2D 59.27 1.28
I32 List (1MB) D2H 44.35 0.25
I32 List (1KB) H2D 4.98 1.27
I32 List (1KB) D2H 5.08 1.23

Larger RX buffers let the window grow larger. Delayed acks at 1ms instead of smoltcp's default 10ms leaves opportunity to combine acks while not wasting too much time.

There's not an improvement in every case but in some. YMMV.

diff --git src/runtime/src/comms.rs src/runtime/src/comms.rs
index da9d8ac..ab14407 100644
--- src/runtime/src/comms.rs
+++ src/runtime/src/comms.rs
@@ -13,7 +13,7 @@ use libboard_zynq::{
         self,
         wire::IpCidr,
         iface::{NeighborCache, EthernetInterfaceBuilder},
-        time::Instant,
+        time::{Duration, Instant},
     },
     timer::GlobalTimer,
 };
@@ -314,7 +314,7 @@ async fn load_kernel(buffer: &Vec<u8>, control: &Rc<RefCell<kernel::Control>>, s
 }
 
 async fn handle_connection(stream: &mut TcpStream, control: Rc<RefCell<kernel::Control>>) -> Result<()> {
-    stream.set_ack_delay(None);
+    stream.set_ack_delay(Some(Duration::from_millis(1)));
 
     if !expect(stream, b"ARTIQ coredev\n").await? {
         return Err(Error::UnexpectedPattern);
@@ -409,7 +409,7 @@ pub fn main(timer: GlobalTimer, cfg: Config) {
         let connection = Rc::new(Semaphore::new(1, 1));
         let terminate = Rc::new(Semaphore::new(0, 1));
         loop {
-            let mut stream = TcpStream::accept(1381, 0x10_000, 0x10_000).await.unwrap();
+            let mut stream = TcpStream::accept(1381, 0x4_0000, 0x1_0000).await.unwrap();
 
             if connection.try_wait().is_none() {
                 // there is an existing connection
The more I think about delayed acks and read up on them, the more I'm getting the impression that this is the expected downside. We bet the delayed ack timeout for adding window updates and response data to acking received data. There are gains if we're decide to optimize settings instead of disabling delayed acks. ## Cora Z7 baseline with delayed acks disabled | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 59.64 | 2.24 | | I32 Array (1MB) D2H | 51.85 | 0.38 | | I32 Array (1KB) H2D | 4.51 | 1.01 | | I32 Array (1KB) D2H | 4.52 | 0.83 | | Bytes List (1MB) H2D | 42.36 | 1.65 | | Bytes List (1MB) D2H | 41.80 | 1.37 | | Bytes List (1KB) H2D | 3.79 | 0.79 | | Bytes List (1KB) D2H | 4.17 | 0.67 | | Bytes (1MB) H2D | 64.36 | 1.97 | | Bytes (1MB) D2H | 51.98 | 0.63 | | Bytes (1KB) H2D | 4.03 | 1.18 | | Bytes (1KB) D2H | 4.23 | 1.02 | | I32 List (1MB) H2D | 51.02 | 1.57 | | I32 List (1MB) D2H | 46.88 | 0.24 | | I32 List (1KB) H2D | 3.78 | 0.45 | | I32 List (1KB) D2H | 4.41 | 0.69 | ## Cora Z7 with 256KB TCP RX buffers and delayed acks at 1ms | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 79.13 | 1.52 | | I32 Array (1MB) D2H | 50.82 | 0.55 | | I32 Array (1KB) H2D | 4.42 | 0.61 | | I32 Array (1KB) D2H | 3.71 | 0.20 | | Bytes List (1MB) H2D | 42.33 | 2.11 | | Bytes List (1MB) D2H | 38.87 | 1.26 | | Bytes List (1KB) H2D | 5.05 | 1.18 | | Bytes List (1KB) D2H | 5.00 | 1.04 | | Bytes (1MB) H2D | 89.84 | 3.08 | | Bytes (1MB) D2H | 51.21 | 0.68 | | Bytes (1KB) H2D | 4.86 | 1.09 | | Bytes (1KB) D2H | 5.07 | 0.83 | | I32 List (1MB) H2D | 59.27 | 1.28 | | I32 List (1MB) D2H | 44.35 | 0.25 | | I32 List (1KB) H2D | 4.98 | 1.27 | | I32 List (1KB) D2H | 5.08 | 1.23 | Larger RX buffers let the window grow larger. Delayed acks at 1ms instead of smoltcp's default 10ms leaves opportunity to combine acks while not wasting too much time. There's not an improvement in every case but in some. YMMV. ```diff diff --git src/runtime/src/comms.rs src/runtime/src/comms.rs index da9d8ac..ab14407 100644 --- src/runtime/src/comms.rs +++ src/runtime/src/comms.rs @@ -13,7 +13,7 @@ use libboard_zynq::{ self, wire::IpCidr, iface::{NeighborCache, EthernetInterfaceBuilder}, - time::Instant, + time::{Duration, Instant}, }, timer::GlobalTimer, }; @@ -314,7 +314,7 @@ async fn load_kernel(buffer: &Vec<u8>, control: &Rc<RefCell<kernel::Control>>, s } async fn handle_connection(stream: &mut TcpStream, control: Rc<RefCell<kernel::Control>>) -> Result<()> { - stream.set_ack_delay(None); + stream.set_ack_delay(Some(Duration::from_millis(1))); if !expect(stream, b"ARTIQ coredev\n").await? { return Err(Error::UnexpectedPattern); @@ -409,7 +409,7 @@ pub fn main(timer: GlobalTimer, cfg: Config) { let connection = Rc::new(Semaphore::new(1, 1)); let terminate = Rc::new(Semaphore::new(0, 1)); loop { - let mut stream = TcpStream::accept(1381, 0x10_000, 0x10_000).await.unwrap(); + let mut stream = TcpStream::accept(1381, 0x4_0000, 0x1_0000).await.unwrap(); if connection.try_wait().is_none() { // there is an existing connection ```
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#127
No description provided.