moninj: hold aux mutex for inject for extra time #286

Merged
sb10q merged 1 commits from mwojcik/artiq-zynq:slow_injection into release-7 2024-02-28 18:06:15 +08:00
Owner

A customer has reported that moninj injection fails on their Kasli-SoC (master) -> Kasli (satellite) system. Provided logs showed multiple gateware errors:

[   108.513315s]  WARN(satman): aux packet error (gateware reported error)
[   108.518480s]  WARN(satman): aux packet error (gateware reported error) 

These are usually a sign of packets being sent too quickly.

A closer look at remote inject function shows that the aux mutex is taken, request is sent and the mutex is immediately dropped after that.

In any other case of aux protocol usage, a transaction is taken place, where master sends the request, satellite mulls over it and returns a response, and only then another request may be sent. That's how the simple protocol is guarded against overrunning the satellites.

With inject not expecting a response it would be possible that another thread (e.g. destination status request) grabs the mutex and sends a new request too soon, before the (much slower) satellite managed to serve the previous request.

So a delay is added to give the satellite some breathing room. This managed to fix the issue on my setup, also consisting of Kasli-SoC master and Kasli 2.0 satellite.

While the issue affects both Release 7 and 8 beta, as 8 will be getting an aux protocol overhaul that will prevent such situations in the first place, this solution doesn't need to be added there.

A customer has reported that moninj injection fails on their Kasli-SoC (master) -> Kasli (satellite) system. Provided logs showed multiple gateware errors: ``` [ 108.513315s] WARN(satman): aux packet error (gateware reported error) [ 108.518480s] WARN(satman): aux packet error (gateware reported error) ``` These are usually a sign of packets being sent too quickly. A closer look at remote ``inject`` function shows that the aux mutex is taken, request is sent and the mutex is immediately dropped after that. In any other case of aux protocol usage, a transaction is taken place, where master sends the request, satellite mulls over it and returns a response, and only then another request may be sent. That's how the simple protocol is guarded against overrunning the satellites. With ``inject`` not expecting a response it would be possible that another thread (e.g. destination status request) grabs the mutex and sends a new request too soon, before the (much slower) satellite managed to serve the previous request. So a delay is added to give the satellite some breathing room. This managed to fix the issue on my setup, also consisting of Kasli-SoC master and Kasli 2.0 satellite. While the issue affects both Release 7 and 8 beta, as 8 will be getting an aux protocol overhaul that will prevent such situations in the first place, this solution doesn't need to be added there.
mwojcik added 1 commit 2024-02-28 14:30:48 +08:00
this should prevent gateware errors on satellites
sb10q merged commit 7d8268adf2 into release-7 2024-02-28 18:06:15 +08:00
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#286
No description provided.