This commit moves DMA serialization code to the kernel CPU
(to cope with the existence of rtio_output_wide) and batches
the resulting sequences. This results in less data being transferred
between kernel and comms CPUs (24 octets with one pointer before,
18 octets with no pointers now, for the common case of rtio_output),
but most importantly reduces cache flushes, which now happen
once per 64k octets.
On average, it now takes about 15us to record a single RTIO event
in a DMA trace.
Fixes#712.
Before this commit, proto::io::{read,write}_* functions were taking
a &mut {Read,Write}. This means a lot of virtual dispatch.
After this commit, all these functions are specialized for
the specific IO trait.
This could be achieved with just changing the signature from
fn read_x(reader: &mut Read)
to
fn read_x<R: Read>(reader: &mut R)
but the functions were also grouped into ReadExt and WriteExt
traits as a refactoring.
Initially, it was expected that the generic traits from
the byteorder crate could be used, but they require endianness
to be specified on every call and thus aren't very ergonomic.
They also lack the equivalent to our read_string and read_bytes.
Thus, it seems fine to just define a slightly different extension
trait.
This also optimized the test_rpc_timing test: 1.7ms→1.2ms.