Use MMU for context switch #451

Merged
sb10q merged 8 commits from mwojcik/artiq-zynq:mmu_context_switch into master 2026-01-23 18:26:28 +08:00
Owner

Replacing library rebinds for switching between normal RTIO/DMA/batch mode with virtual address remapping.

The good:
It works. Library rebinds are not necessary anymore. Even leads to some DMA benchmarks improving a little in hitl tests.

The bad:

  • Using 1MB page size at this moment, eating 3MB in total (no big deal - we're using around 279MB as verified by objdump).
  • OVERLAY cannot be used as it doesn't allow pointer operations (that I use to keep two functions at constant offsets). OVERLAY is nicely supported by default checks, but doing it manually required --no-check-sections - which may be a potential footgun. Documented yes, but I can imagine it wouldn't be obvious for someone tinkering with the linker script.

The ugly:

  • Linker didn't want to set the VMA pointer properly, so I separated the memory regions to force it to switch. Minor consequence is that linker couldn't put other things after the last overlaid function.
  • Batch page is reserved even in CSR which doesn't support it, but that's better than keeping two separate linker files.
  • Functions must be used to be kept in the binary, so I added them to api. KEEP(*(...)) did not work - I believe LLVM compiling Rust discarded these functions before they reached linker's garbage collection.

Further improvements (not a part of this PR):

  • Decreasing page size down to 64K, to get most of the memory back. Requires setting up L2 page table in zynq-rs
  • Since swapping extra functions within these contexts is now free, I can also imagine remapping input functions - which doesn't make sense in both DMA and batch modes - with something that would just raise an exception.
  • Any other ideas what we could do with this? Loading multiple kernels?
Replacing library rebinds for switching between normal RTIO/DMA/batch mode with virtual address remapping. The good: It works. Library rebinds are not necessary anymore. Even leads to some DMA benchmarks improving a little in hitl tests. The bad: * Using 1MB page size at this moment, eating 3MB in total (no big deal - we're using around 279MB as verified by objdump). * ``OVERLAY`` cannot be used as it doesn't allow pointer operations (that I use to keep two functions at constant offsets). ``OVERLAY`` is nicely supported by default checks, but doing it manually required ``--no-check-sections`` - which _may_ be a potential footgun. Documented yes, but I can imagine it wouldn't be obvious for someone tinkering with the linker script. The ugly: * Linker didn't want to set the VMA pointer properly, so I separated the memory regions to force it to switch. Minor consequence is that linker couldn't put other things after the last overlaid function. * Batch page is reserved even in CSR which doesn't support it, but that's better than keeping two separate linker files. * Functions must be used to be kept in the binary, so I added them to ``api``. ``KEEP(*(...))`` did not work - I believe LLVM compiling Rust discarded these functions before they reached linker's garbage collection. Further improvements (not a part of this PR): * Decreasing page size down to 64K, to get most of the memory back. Requires setting up L2 page table in ``zynq-rs`` * Since swapping extra functions within these contexts is now free, I can also imagine remapping input functions - which doesn't make sense in both DMA and batch modes - with something that would just raise an exception. * Any other ideas what we could do with this? Loading multiple kernels?
mwojcik added 7 commits 2026-01-15 17:07:16 +08:00
Author
Owner

The overhead didn't matter that much as it's a small part of the whole process, but on its own, it's a difference of one or two orders of magnitude - especially for batch since it is a pretty light switch, DMA has to do some extra core0 comms (actually I think it's possible to make it faster still - give some extra heap space for DMA, we could manage the data almost fully within core1, DDMA break-apart handled on context manager exit). The benchmark tests added in m-labs/artiq#1736 show a rather stark comparison:

MMU remap:
test_dma_record_overhead ... 3.9295200000000025e-06 s
test_acpki_batch_overhead ... 9.408800000000002e-07 s

Library rebind:
test_dma_record_overhead ... 5.525568000000001e-05 s
test_acpki_batch_overhead ... 2.9776880000000005e-05 s

The overhead didn't matter _that_ much as it's a small part of the whole process, but on its own, it's a difference of one or two orders of magnitude - especially for batch since it is a pretty light switch, DMA has to do some extra core0 comms (actually I think it's possible to make it faster still - give some extra heap space for DMA, we could manage the data almost fully within core1, DDMA break-apart handled on context manager exit). The benchmark tests added in m-labs/artiq#1736 show a rather stark comparison: MMU remap: test_dma_record_overhead ... 3.9295200000000025e-06 s test_acpki_batch_overhead ... 9.408800000000002e-07 s Library rebind: test_dma_record_overhead ... 5.525568000000001e-05 s test_acpki_batch_overhead ... 2.9776880000000005e-05 s
mwojcik added 1 commit 2026-01-23 11:47:48 +08:00
sb10q merged commit 1c4534637d into master 2026-01-23 18:26:28 +08:00
Sign in to join this conversation.
No Reviewers
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#451