enhanced RTIO event submission (ACPKI) #55

Closed
opened 3 years ago by sb10q · 14 comments
sb10q commented 3 years ago
Owner

https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287

Implementation already done at Oxford, needs integration and testing (some potential for difficult bugs).

https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287 Implementation already done at Oxford, needs integration and testing (some potential for difficult bugs).
sb10q added the
priority:medium
label 3 years ago
Owner

https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/rtio/src

Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers.

https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/rtio/src Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers.

Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers.

That's quite slow compared to the numbers reported in https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287 any idea as to the origin of the discrepancy?

> Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers. That's quite slow compared to the numbers reported in https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287 any idea as to the origin of the discrepancy?
Owner

Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers.

That's quite slow compared to the numbers reported in https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287 any idea as to the origin of the discrepancy?

I think that is probably due to L2 cache. We have not enabled the L2 cache currently.

If I move the status checking before sending sending events to RTIO, i.e. make the time for the CPU to execute instructions overlap with the time for the gateware to send back the status buffer, I could do 300ns or less. So the time is probably spent on code execution.

I'm currently working on enabling the L2 cache, which requires some configuration in the MMU etc.

> > Currently implemented wide output. It takes ~490ns for normal output, and up to ~650ns for wide output. We may be able to reduce the latency if we enable L2 cache, but that is currently problematic as it breaks drivers. > > That's quite slow compared to the numbers reported in https://github.com/m-labs/artiq/issues/1167#issuecomment-427188287 any idea as to the origin of the discrepancy? I think that is probably due to L2 cache. We have not enabled the L2 cache currently. If I move the status checking before sending sending events to RTIO, i.e. make the time for the CPU to execute instructions overlap with the time for the gateware to send back the status buffer, I could do 300ns or less. So the time is probably spent on code execution. I'm currently working on enabling the L2 cache, which requires some configuration in the MMU etc.
Owner

With L2 cache enabled with instruction and data prefetch, the time is reduced to ~390ns, but still pretty slow comparing to the number reported in the issue.

Not much idea for the discrepancy, could that be related to static linking of the kernel? Not sure if there could be ~100ns of difference.

With L2 cache enabled with instruction and data prefetch, the time is reduced to ~390ns, but still pretty slow comparing to the number reported in the issue. Not much idea for the discrepancy, could that be related to static linking of the kernel? Not sure if there could be ~100ns of difference.
sb10q changed title from enhanced RTIO event submission to enhanced RTIO event submission (ACPKI) 2 years ago
Owner

The branch for ACPKI with L2 cache enabled: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/l2-cache

The branch for ACPKI with L2 cache enabled: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/l2-cache
Owner

With optimization level changed from z to s, the time is decreased by 10ns. I would expect some further improvement later when we change the optimization level to 2. Not sure if we can eventually get to 280ns.

With optimization level changed from `z` to `s`, the time is decreased by 10ns. I would expect some further improvement later when we change the optimization level to 2. Not sure if we can eventually get to 280ns.
Owner

Sorry for late update, I have been working on RPC optimization for some time.

After toggling some CPU options and optimization options, the sustained output rate for the ACPKI interface can reach 300ns but not lower. I would try to finish the implementation in the upcoming days.

Sorry for late update, I have been working on RPC optimization for some time. After toggling some CPU options and optimization options, the sustained output rate for the ACPKI interface can reach 300ns but not lower. I would try to finish the implementation in the upcoming days.

great! So that brings this to a par with Chris' implemenation.

I assume the position is now that there isn't likely to be any big improvement left without sacrificing some level of exception granularity (e.g. some form of batching API as Chris suggested), which would be a separate project...

great! So that brings this to a par with Chris' implemenation. I assume the position is now that there isn't likely to be any big improvement left without sacrificing some level of exception granularity (e.g. some form of batching API as Chris suggested), which would be a separate project...
Poster
Owner

"Batching" is not particularly relevant when there is regular DMA.

"Batching" is not particularly relevant when there is regular DMA.

It's generally nice to do thigns without relying on pre-recorded DMA sequences. If batching allows us to reduce the time for compound RTIO events like DDS setting down to be comparable to a single RTIO event then we wouldn't need DMA in several places, which would be a big win for us.

It's generally nice to do thigns without relying on pre-recorded DMA sequences. If batching allows us to reduce the time for compound RTIO events like DDS setting down to be comparable to a single RTIO event then we wouldn't need DMA in several places, which would be a big win for us.
Poster
Owner

Isn't batching just like DMA but with a different syntax and slightly different performance trade-offs?

Isn't batching just like DMA but with a different syntax and slightly different performance trade-offs?

That's not my understanding of Chris' proposal, but I'm also not best qualified to comment. Anyway, the main thing is that we thing that the current implementation is about as good as it's going to get for bare RTIO.

That's not my understanding of Chris' proposal, but I'm also not best qualified to comment. Anyway, the main thing is that we thing that the current implementation is about as good as it's going to get for bare RTIO.
Owner

Batching would get significant performance improvement over single RTIO event, I've tried reordering before (when the time is ~490ns...) and got pretty large improvement (~50%). However, there should be less improvement now as the overhead for instruction execution is reduced due to CPU option tweaks.

Batching would probably be faster than DMA as we don't have to send the slice to core0, and we don't have to wait for replay of the DMA sequence. However, I think this requires some benchmark to prove/disprove the claim. I could write a simple implementation for batching and do a micro benchmark if that is needed.

Batching would get significant performance improvement over single RTIO event, I've tried reordering before (when the time is ~490ns...) and got pretty large improvement (~50%). However, there should be less improvement now as the overhead for instruction execution is reduced due to CPU option tweaks. Batching would probably be faster than DMA as we don't have to send the slice to core0, and we don't have to wait for replay of the DMA sequence. However, I think this requires some benchmark to prove/disprove the claim. I could write a simple implementation for batching and do a micro benchmark if that is needed.

I could write a simple implementation for batching and do a micro benchmark if that is needed.

If that's something you feel like doing, I would be curious to see how the numbers pan out.

> I could write a simple implementation for batching and do a micro benchmark if that is needed. If that's something you feel like doing, I would be curious to see how the numbers pan out.
sb10q closed this issue 2 years ago
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#55
Loading…
There is no content yet.