First, we use Arc to store the RPC buffer in core1, so core0 won't have to copy it once more before sending it. This give ~10% performance improvement for large RPC, for example the async throughput improved from ~52MiB/s to ~57MiB/s. This requires modifying the zynq-rs RAM module to check pointer range instead of CPU id for deallocation.
Second, we poll for some amount of time (100 trials) without context switch (await) for RPC return buffer allocation. This is an attempt to improve the performance problem stated in https://github.com/m-labs/artiq/issues/1324.
With the modified test script:
fromartiq.experimentimport*classTestListRPC(EnvExperiment):defbuild(self):self.setattr_device("core")defmake_data(self)->TList(TList(TInt32)):return[]*1000@kerneldefrun(self):t0=self.core.get_rtio_counter_mu()data=self.make_data()print(self.core.mu_to_seconds(self.core.get_rtio_counter_mu()-t0))print("Got data, first element:",data)
The time taken for the make_data call drops from 0.023068s to 0.009652. >50% performance improvement in this case.