Moninj connection causes core panic on zc706 #189
Labels
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#189
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Using zc706 DRTIO master gateware https://nixbld.m-labs.hk/build/121755, the moninj proxy causes a core panic:
The panic occurs when a dashboard is trying to connect through the proxy; the log message
[ 2250.420178s] INFO(runtime::moninj): received connection
happens when the proxy connects and does not directly result in the crash.Hacking the device_db to bypass the proxy also results in the same crash:
(Is there a canonical way to not use the proxy? We definitely want to use it in general but may be useful for testing purposes.)
Haven't tested with non-DRTIO gateware yet, will update once I do.
ARTIQ version: ARTIQ v7.8044.c440f9f.beta
Conda list attached
I don't appear to have the same issue with https://nixbld.m-labs.hk/build/121703, so this seems to be DRTIO gateware related.
I can't seem to reproduce the issue (although I have a hunch that I may have fixed it with
e34381094b
). I think it may have been a problem when moninj tried to monitor something on the satellite. Do you have any satellite data in the device_db? Could you provide me with the device_db you've been using?This happens only with satellite data in the device_db, good catch; I've attached a minimal device_db that still elicits a crash, regardless of whether the satellite is connected.
I'll try the latest build and report on that.
Built from e34381094b to test; at the moment I have a zc706 test setup with no satellite, and our live system with a single satellite. I'm working on having a satellite for the test setup to avoid disrupting the main experiment for debugging, although I can use the live setup for today.
On the test setup, there was no immediate crash, so that's promising. I expected some errors due to not having a satellite connected, so wasn't surprised to see:
What I was surprised by was the frequency of these messages - every 8ms or so. Is there a way of rate limiting identical error messages? (Is this even a sensible suggestion?)
On the live setup: worked initially (was able to override a TTL successfully) but the core device rapidly becomes unpingable. This is definitely due to moninj in some way; I restarted the zc706 and set a continuous ping from my machine to the zc706, which succeeded for many minutes. The ping continued to succeed after submitting experiments with
artiq_run
, and then with a master connected (but no dashboard). ~10-30s after starting the dashboard the zc706 was not pingable.Hey, sorry about the delay.
I looked into this and the 8ms logs came from faulty timeout calculation which should be fixed by
b7afd149c4
- I didn't have much time to test it out though with a live system with a satellite, and I would appreciate if you tested it out for me.I also got unresponsive/unpingable device until the moninj caused an illegal operation within the TCP stack, causing it to finally stop.
The broken timeout also would cause moninj internal loop to overtake any other tasks. Not sure why this didn't happen on standalone/drtio-less system as the code is mostly the same.
If that works I think we'll just merge it into master?
Tested 1f95b597 on both a live system and another with no satellite, I'm still getting the same behaviour as before (live system: unpingable, no satellite: 8ms error messages)
Okay, now I have a pretty decent idea of what's going on, at least with the satellite disconnected.
Opening the dashboard causes a lot of moninj requests, and a lot of them aimed towards the satellite. The timeout between checks is indeed kept at 200ms now, but the sheer number of monitored channels causes it to be sent almost all the time, clogging the CPU. That log message is around 90 characters long, and for it to be pushed through serial at 115200bps speed it would take at least 6.25ms - with back of the envelope math. With async/await it's not really multithreading after all...
If that message gets caught and ignored (like I did with
a51bfc49a0
), there's no problem with getting the ping. Frankly I'm not sure if such error should be even displayed - the user should be aware that the satellite is not connected and there's certainly no need for moninj to remind him of it.I have not yet checked it with satellite connected (been trying to get a setup with kc705 working, will probably use a kasli instead) - but I wager the response times from the satellite may be also at play.
How about still logging but at the
debug!
level instead?