moninj: hold aux mutex for inject for extra time #286
No reviewers
Labels
No Milestone
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#286
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "mwojcik/artiq-zynq:slow_injection"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
A customer has reported that moninj injection fails on their Kasli-SoC (master) -> Kasli (satellite) system. Provided logs showed multiple gateware errors:
These are usually a sign of packets being sent too quickly.
A closer look at remote
inject
function shows that the aux mutex is taken, request is sent and the mutex is immediately dropped after that.In any other case of aux protocol usage, a transaction is taken place, where master sends the request, satellite mulls over it and returns a response, and only then another request may be sent. That's how the simple protocol is guarded against overrunning the satellites.
With
inject
not expecting a response it would be possible that another thread (e.g. destination status request) grabs the mutex and sends a new request too soon, before the (much slower) satellite managed to serve the previous request.So a delay is added to give the satellite some breathing room. This managed to fix the issue on my setup, also consisting of Kasli-SoC master and Kasli 2.0 satellite.
While the issue affects both Release 7 and 8 beta, as 8 will be getting an aux protocol overhaul that will prevent such situations in the first place, this solution doesn't need to be added there.