RPC Performance Optimization Take 3 #106

Closed
pca006132 wants to merge 2 commits from pca006132/artiq-zynq:runtime into master
Contributor

First, we use Arc to store the RPC buffer in core1, so core0 won't have to copy it once more before sending it. This give ~10% performance improvement for large RPC, for example the async throughput improved from ~52MiB/s to ~57MiB/s. This requires modifying the zynq-rs RAM module to check pointer range instead of CPU id for deallocation.

Second, we poll for some amount of time (100 trials) without context switch (await) for RPC return buffer allocation. This is an attempt to improve the performance problem stated in https://github.com/m-labs/artiq/issues/1324.

With the modified test script:

from artiq.experiment import *

class TestListRPC(EnvExperiment):
    def build(self):
        self.setattr_device("core")

    def make_data(self) -> TList(TList(TInt32)):
        return [[0]] * 1000

    @kernel
    def run(self):
        t0 = self.core.get_rtio_counter_mu()
        data = self.make_data()
        print(self.core.mu_to_seconds(self.core.get_rtio_counter_mu() - t0))
        print("Got data, first element:", data[0])

The time taken for the make_data call drops from 0.023068s to 0.009652. >50% performance improvement in this case.


test_performance result over 100 samples:

Test Mean (MiB/s) std (MiB/s)
I32 Array (1MB) H2D 38.65 2.40
I32 Array (1MB) D2H 35.82 0.66
I32 Array (1KB) H2D 4.29 0.43
I32 Array (1KB) D2H 3.47 0.20
Bytes List (1MB) H2D 32.71 3.07
Bytes List (1MB) D2H 34.25 2.97
Bytes List (1KB) H2D 4.49 0.38
Bytes List (1KB) D2H 3.76 0.20
Bytes (1MB) H2D 52.09 2.59
Bytes (1MB) D2H 48.91 0.79
Bytes (1KB) H2D 4.63 0.43
Bytes (1KB) D2H 3.74 0.29
I32 List (1MB) H2D 30.51 3.87
I32 List (1MB) D2H 28.30 2.57
I32 List (1KB) H2D 4.69 0.56
I32 List (1KB) D2H 3.98 0.43

Async throughput: 56.88MiB/s

First, we use Arc to store the RPC buffer in core1, so core0 won't have to copy it once more before sending it. This give ~10% performance improvement for large RPC, for example the async throughput improved from ~52MiB/s to ~57MiB/s. This requires modifying the zynq-rs RAM module to check pointer range instead of CPU id for deallocation. Second, we poll for some amount of time (100 trials) without context switch (await) for RPC return buffer allocation. This is an attempt to improve the performance problem stated in https://github.com/m-labs/artiq/issues/1324. With the modified test script: ```py from artiq.experiment import * class TestListRPC(EnvExperiment): def build(self): self.setattr_device("core") def make_data(self) -> TList(TList(TInt32)): return [[0]] * 1000 @kernel def run(self): t0 = self.core.get_rtio_counter_mu() data = self.make_data() print(self.core.mu_to_seconds(self.core.get_rtio_counter_mu() - t0)) print("Got data, first element:", data[0]) ``` The time taken for the `make_data` call drops from 0.023068s to 0.009652. >50% performance improvement in this case. --------- `test_performance` result over 100 samples: | Test | Mean (MiB/s) | std (MiB/s) | | -------------------- | ------------ | ------------ | | I32 Array (1MB) H2D | 38.65 | 2.40 | | I32 Array (1MB) D2H | 35.82 | 0.66 | | I32 Array (1KB) H2D | 4.29 | 0.43 | | I32 Array (1KB) D2H | 3.47 | 0.20 | | Bytes List (1MB) H2D | 32.71 | 3.07 | | Bytes List (1MB) D2H | 34.25 | 2.97 | | Bytes List (1KB) H2D | 4.49 | 0.38 | | Bytes List (1KB) D2H | 3.76 | 0.20 | | Bytes (1MB) H2D | 52.09 | 2.59 | | Bytes (1MB) D2H | 48.91 | 0.79 | | Bytes (1KB) H2D | 4.63 | 0.43 | | Bytes (1KB) D2H | 3.74 | 0.29 | | I32 List (1MB) H2D | 30.51 | 3.87 | | I32 List (1MB) D2H | 28.30 | 2.57 | | I32 List (1KB) H2D | 4.69 | 0.56 | | I32 List (1KB) D2H | 3.98 | 0.43 | Async throughput: 56.88MiB/s
pca006132 closed this pull request 2020-09-03 16:58:59 +08:00

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#106
No description provided.