Low memcpy throughput #75

Open
opened 2021-01-22 14:54:20 +08:00 by pca006132 · 4 comments
Contributor

Tested with the following code in artiq-zynq:

--- a/src/runtime/src/main.rs
+++ b/src/runtime/src/main.rs
@@ -179,6 +179,16 @@ pub fn main_core0() {
     init_gateware();
     info!("detected gateware: {}", identifier_read(&mut [0; 64]));
 
+    use alloc::vec;
+    let mut v1 = vec![123u8; 1024*1024*16];
+    let mut v2 = vec![0u8; 1024*1024*16];
+    v1[0] = 45u8;
+    let t1 = timer.get_us().0 as f64;
+    v2.copy_from_slice(v1.as_slice());
+    let t2 = timer.get_us().0 as f64;
+    info!("memcpy throughput: {}MiB/s", 16.0 * 1_000_000.0 / (t2 - t1));
+    info!("v2: {}", v2[0]);
+
     i2c::init();

With ~630MiB/s throughput. Same for using u32, so should not be alignment issue.

Seems a bit low with branch prediction, all cache turned on, etc.

Tested with the following code in artiq-zynq: ```diff --- a/src/runtime/src/main.rs +++ b/src/runtime/src/main.rs @@ -179,6 +179,16 @@ pub fn main_core0() { init_gateware(); info!("detected gateware: {}", identifier_read(&mut [0; 64])); + use alloc::vec; + let mut v1 = vec![123u8; 1024*1024*16]; + let mut v2 = vec![0u8; 1024*1024*16]; + v1[0] = 45u8; + let t1 = timer.get_us().0 as f64; + v2.copy_from_slice(v1.as_slice()); + let t2 = timer.get_us().0 as f64; + info!("memcpy throughput: {}MiB/s", 16.0 * 1_000_000.0 / (t2 - t1)); + info!("v2: {}", v2[0]); + i2c::init(); ``` With ~630MiB/s throughput. Same for using `u32`, so should not be alignment issue. Seems a bit low with branch prediction, all cache turned on, etc.
Author
Contributor

The throughput is not a lot higher even when the data resides in cache.
Tested with 10KB arrays, the throughput is roughly 2000MiB/s. Definitely not normal.

The throughput is not a lot higher even when the data resides in cache. Tested with 10KB arrays, the throughput is roughly 2000MiB/s. Definitely not normal.
Owner

If the cache is a SRAM with a 64-bit bus and 800MHz frequency, the maximum total throughput is 800e6*64/(8*1024*1024) = 6103 MiB/s. And you need to read and write at the same time. It's not that far off...

If the cache is a SRAM with a 64-bit bus and 800MHz frequency, the maximum total throughput is `800e6*64/(8*1024*1024) = 6103 MiB/s`. And you need to read and write at the same time. It's not *that* far off...
Author
Contributor

If the cache is a SRAM with a 64-bit bus and 800MHz frequency, the maximum total throughput is 800e6*64/(8*1024*1024) = 6103 MiB/s. And you need to read and write at the same time. It's not that far off...

OK, forgot that the frequency is lower than a typical PC CPU...

> If the cache is a SRAM with a 64-bit bus and 800MHz frequency, the maximum total throughput is `800e6*64/(8*1024*1024) = 6103 MiB/s`. And you need to read and write at the same time. It's not *that* far off... OK, forgot that the frequency is lower than a typical PC CPU...
Author
Contributor

By tweaking prefetch offset and alloc_one_way to avoid cache pollution for memcpy, the throughput is increased to about 780MiB/s. Considering it is memcpy, the throughput should be doubled, so real throughput should be about 1500MiB/s.

By tweaking prefetch offset and `alloc_one_way` to avoid cache pollution for `memcpy`, the throughput is increased to about 780MiB/s. Considering it is memcpy, the throughput should be doubled, so real throughput should be about 1500MiB/s.
Sign in to join this conversation.
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/zynq-rs#75
No description provided.