Break when using custom compiler-builtins + opt-level change #93

Closed
opened 2020-08-21 17:29:28 +08:00 by pca006132 · 7 comments

The current master would break when the optimization level is changed to 's', with out custom compiler-builtins providing faster memcpy. We have to use faster optimization for getting higher speed from the ethernet driver as tested in zynq-rs experiments.

Behavior:

  • Ethernet has high latency. (10s for artiq_coremgmt log response, halt when doing RPC transfer)
  • Panic during RPC transfer. (not in this master but in my own branch)

Reproduce:

  1. Change the optimization level, modify the linker script for szl if needed to get more space.
  2. Substitute compiler-builtins for our own version with faster memcpy.
    1. Patch cargo-xbuild: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/l2-cache/xbuild.patch
    2. Patch crates.io in Cargo.toml.

The behavior is much worse in my branch which did some performance improvement for RPC, and upgraded to the latest zynq-rs dependency. Note that my branch modified RPC protocol for faster performance, you have to use my artiq fork if you need to do list/array RPC.

Things I've found:

  • It is perfectly fine when optimization level is 'z'.
  • What memcpy you use does not matter (the implementation is fine). I've tried to change the memcpy implementation but it does not help.
  • It is not related to L2 cache, as it is not enabled in the current master.

NOTE: If you use the patched cargo-xbuild, you have to remove the sysroot manually if you want to rebuild it in case of changes in the dependency.

The current master would break when the optimization level is changed to 's', with out custom `compiler-builtins` providing faster memcpy. We have to use faster optimization for getting higher speed from the ethernet driver as tested in zynq-rs experiments. Behavior: - Ethernet has high latency. (10s for `artiq_coremgmt log` response, halt when doing RPC transfer) - Panic during RPC transfer. (not in this master but in my own branch) Reproduce: 1. Change the optimization level, modify the linker script for szl if needed to get more space. 2. Substitute compiler-builtins for our own version with faster memcpy. 1. Patch cargo-xbuild: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/l2-cache/xbuild.patch 2. Patch `crates.io` in `Cargo.toml`. The behavior is much worse in my branch which did some performance improvement for RPC, and upgraded to the latest zynq-rs dependency. Note that my branch modified RPC protocol for faster performance, you have to use my [artiq fork](https://github.com/pca006132/artiq/tree/rpc) if you need to do list/array RPC. Things I've found: - It is perfectly fine when optimization level is 'z'. - What memcpy you use does not matter (the implementation is fine). I've tried to change the memcpy implementation but it does not help. - It is not related to L2 cache, as it is not enabled in the current master. > NOTE: If you use the patched cargo-xbuild, you have to remove the sysroot manually if you want to rebuild it in case of changes in the dependency.

For the current master, we can override selected crates (smoltcp, libboard_zynq, libasync) with opt-level s/2 without causing any bug (I've only did a few tests, not sure if it is perfectly fine), but it would break in my branch.

For the current master, we can override selected crates (smoltcp, libboard_zynq, libasync) with opt-level s/2 without causing any bug (I've only did a few tests, not sure if it is perfectly fine), but it would break in my branch.

This sounds like some classic race condition somewhere, e.g. missing barriers in the Ethernet driver…

This sounds like some classic race condition somewhere, e.g. missing barriers in the Ethernet driver…

Fixed in 671968bac3

Fixed in https://git.m-labs.hk/M-Labs/zynq-rs/commit/671968bac32ad4e98b1cbd6a117f41461cd44971

Well done!
Why does the comment say "start tcp transfer"? The ethernet layer doesn't know about TCP. Maybe it should say "start packet transfer"?

Well done! Why does the comment say "start tcp transfer"? The ethernet layer doesn't know about TCP. Maybe it should say "start packet transfer"?

nice! Out of curiosity, what are the current ethernet performance numbers?

nice! Out of curiosity, what are the current ethernet performance numbers?

Well done!
Why does the comment say "start tcp transfer"? The ethernet layer doesn't know about TCP. Maybe it should say "start packet transfer"?

yes, would fix that later

> Well done! > Why does the comment say "start tcp transfer"? The ethernet layer doesn't know about TCP. Maybe it should say "start packet transfer"? yes, would fix that later

nice! Out of curiosity, what are the current ethernet performance numbers?

I have some optimization plans later, including changing the optimization level and RPC protocol, and some RPC implementation. Currently the numbers are:

test_device_to_host (test_performance.TransferTest) ... 39.544179232226405 MiB/s                                       
ok                                                                                                                     
test_device_to_host_list (test_performance.TransferTest) ... 1.0995738157449257 MiB/s                                  
ok                                                                                                                     
test_host_to_device (test_performance.TransferTest) ... 40.07660003132631 MiB/s                                                                                                                                                               
ok                                                                                                                                                                                                                                            
test_host_to_device_byte_list (test_performance.TransferTest) ... 0.1314385411215297 MiB/s                                                                                                                                                    
FAIL                                                                                                                   
test_host_to_device_list (test_performance.TransferTest) ... 0.5194979638173408 MiB/s                                                                                                                                                         
FAIL

With my patch for RPC

test_device_to_host (test_performance.TransferTest) ... 41.473910194683 MiB/s
ok
test_device_to_host_list (test_performance.TransferTest) ... 12.55758405745386 MiB/s
ok
test_host_to_device (test_performance.TransferTest) ... 45.35728618323115 MiB/s
ok
test_host_to_device_byte_list (test_performance.TransferTest) ... 20.744161099487005 MiB/s
ok
test_host_to_device_list (test_performance.TransferTest) ... 20.161186264845085 MiB/s
ok

With the test https://github.com/pca006132/artiq/blob/rpc/artiq/test/coredevice/test_performance.py

Note that the current test_performance.py in the master is broken as async RPC for zynq would not block until the buffer is sent, so we have to use normal RPC for testing the throughput.

We can absolutely do it better later, we are now optimized for size but not speed. :)
And we haven't did any hack there.

> nice! Out of curiosity, what are the current ethernet performance numbers? I have some optimization plans later, including changing the optimization level and RPC protocol, and some RPC implementation. Currently the numbers are: ``` test_device_to_host (test_performance.TransferTest) ... 39.544179232226405 MiB/s ok test_device_to_host_list (test_performance.TransferTest) ... 1.0995738157449257 MiB/s ok test_host_to_device (test_performance.TransferTest) ... 40.07660003132631 MiB/s ok test_host_to_device_byte_list (test_performance.TransferTest) ... 0.1314385411215297 MiB/s FAIL test_host_to_device_list (test_performance.TransferTest) ... 0.5194979638173408 MiB/s FAIL ``` With my patch for RPC ``` test_device_to_host (test_performance.TransferTest) ... 41.473910194683 MiB/s ok test_device_to_host_list (test_performance.TransferTest) ... 12.55758405745386 MiB/s ok test_host_to_device (test_performance.TransferTest) ... 45.35728618323115 MiB/s ok test_host_to_device_byte_list (test_performance.TransferTest) ... 20.744161099487005 MiB/s ok test_host_to_device_list (test_performance.TransferTest) ... 20.161186264845085 MiB/s ok ``` With the test https://github.com/pca006132/artiq/blob/rpc/artiq/test/coredevice/test_performance.py Note that the current `test_performance.py` in the master is broken as async RPC for zynq would not block until the buffer is sent, so we have to use normal RPC for testing the throughput. We can absolutely do it better later, we are now optimized for size but not speed. :) And we haven't did any hack there.
Sign in to join this conversation.
No Milestone
No Assignees
4 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#93
There is no content yet.