poor Ethernet performance #30

Closed
opened 2020-05-25 17:44:41 +08:00 by sb10q · 8 comments

IIRC just a couple MB/s from @astro's tests.

IIRC just a couple MB/s from @astro's tests.

2 MB/s

2 MB/s

First approach: set the DDR pages bufferable (writeback) in the MMU. Unfortunately, the eth descriptors (not the buffers) need to be placed into non-bufferable pages as they're smaller than a cacheline. That means they cannot be invalidated from L1 individually.

First approach: set the DDR pages bufferable (writeback) in the MMU. Unfortunately, the eth descriptors (not the buffers) need to be placed into non-bufferable pages as they're smaller than a cacheline. That means they cannot be invalidated from L1 individually.
Poster
Owner

What is the Xilinx code doing?

What is the Xilinx code doing?

The Xilinx embeddedsw doesn't contain cache/barrier instructions. It probably runs w/o bufferable pages.

The Linux driver uses only barriers which I am so far unable to adopt towards reliable behavior.

One non-bufferable MMU page just for the descriptors seems to be a promising solution...

The Xilinx embeddedsw doesn't contain cache/barrier instructions. It probably runs w/o bufferable pages. The Linux driver uses only barriers which I am so far unable to adopt towards reliable behavior. One non-bufferable MMU page just for the descriptors seems to be a promising solution...

Enabling MMU bufferable pages has doubled throughput.

Enabling MMU bufferable pages has doubled throughput.

M-Labs/artiq-zynq#55

The L2 cache would completely break the ethernet driver. That is probably related to cache invalidation issue, as a lot of the polling returns buffer with 0 length. There are several seconds of delay.

By adding dcci_slice for rx buffer, and return Err(None) when len is 0 for recv_next function, the delay is reduced, but there is still occasional latency spike of up to ~500ms which would not occur with L2 cache turned off.

Also, there is still problem with RTIO analyzer buffer transmission, the stream is not closed even if I flush the data at the end of the transmission. The stream would close if I open another connection though.

https://git.m-labs.hk/M-Labs/artiq-zynq/issues/55 The L2 cache would completely break the ethernet driver. That is probably related to cache invalidation issue, as a lot of the polling returns buffer with 0 length. There are several seconds of delay. By adding `dcci_slice` for rx buffer, and return `Err(None)` when len is 0 for `recv_next` function, the delay is reduced, but there is still occasional latency spike of up to ~500ms which would not occur with L2 cache turned off. Also, there is still problem with RTIO analyzer buffer transmission, the stream is not closed even if I flush the data at the end of the transmission. The stream would close if I open another connection though.

The L2 cache problem is fixed by modifying the MMU setting for uncached slice. However, the rx speed is still pretty slow comparing to tx speed. There are also a few changes, like modified the cache flush for rx and prevented duplicates in the waker queue.

Branch:

> python -m unittest test_performance -v
test_kernel_overhead (test_performance.KernelOverheadTest) ... 0.017314874350558965 s
ok
test_device_to_host (test_performance.TransferTest) ... 31.725000159885067 MiB/s
ok
test_device_to_host_array (test_performance.TransferTest) ... 5.5078113984377195 MiB/s
ok
test_host_to_device (test_performance.TransferTest) ... 3.6111348770332103 MiB/s
ok
The L2 cache problem is fixed by modifying the MMU setting for uncached slice. However, the rx speed is still pretty slow comparing to tx speed. There are also a few changes, like modified the cache flush for rx and prevented duplicates in the waker queue. Branch: * https://git.m-labs.hk/pca006132/zynq-rs/src/branch/l2-cache (note that L2 cache is not enabled by default) * https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/l2-cache ``` > python -m unittest test_performance -v test_kernel_overhead (test_performance.KernelOverheadTest) ... 0.017314874350558965 s ok test_device_to_host (test_performance.TransferTest) ... 31.725000159885067 MiB/s ok test_device_to_host_array (test_performance.TransferTest) ... 5.5078113984377195 MiB/s ok test_host_to_device (test_performance.TransferTest) ... 3.6111348770332103 MiB/s ok ```

We get good performance by changing memcpy implementation and optimization level to 's'/2. Those would be done in artiq-zynq, closing now

We get good performance by changing memcpy implementation and optimization level to 's'/2. Those would be done in artiq-zynq, closing now
Sign in to join this conversation.
No Label
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/zynq-rs#30
There is no content yet.