Moninj connection causes core panic on zc706 #189

Closed
opened 2022-05-13 02:36:53 +08:00 by ljstephenson · 8 comments

Using zc706 DRTIO master gateware https://nixbld.m-labs.hk/build/121755, the moninj proxy causes a core panic:

[  2250.420178s]  INFO(runtime::moninj): received connection
Core 0 panic at /build/cargo-vendor-dir/libasync-0.0.0/src/executor.rs:67:17: nested `block_on`
Backtrace:
0x15309c 0x159644 0x1595a8 0x113be0 0x14c984 0x14db14 0x12406c 0x116c7c 0x113c20 0x113c54 0x13f130 0x155034 0x18f3b4 0x18f368 0x18f1dc 0x18f110 Core 0 nested panic!

The panic occurs when a dashboard is trying to connect through the proxy; the log message [ 2250.420178s] INFO(runtime::moninj): received connection happens when the proxy connects and does not directly result in the crash.

Hacking the device_db to bypass the proxy also results in the same crash:

    "core_moninj": {
        "type": "controller",
        "host": core_addr
    }

(Is there a canonical way to not use the proxy? We definitely want to use it in general but may be useful for testing purposes.)

Haven't tested with non-DRTIO gateware yet, will update once I do.

ARTIQ version: ARTIQ v7.8044.c440f9f.beta
Conda list attached

Using zc706 DRTIO master gateware https://nixbld.m-labs.hk/build/121755, the moninj proxy causes a core panic: ``` [ 2250.420178s] INFO(runtime::moninj): received connection Core 0 panic at /build/cargo-vendor-dir/libasync-0.0.0/src/executor.rs:67:17: nested `block_on` Backtrace: 0x15309c 0x159644 0x1595a8 0x113be0 0x14c984 0x14db14 0x12406c 0x116c7c 0x113c20 0x113c54 0x13f130 0x155034 0x18f3b4 0x18f368 0x18f1dc 0x18f110 Core 0 nested panic! ``` The panic occurs when a dashboard is trying to connect through the proxy; the log message `[ 2250.420178s] INFO(runtime::moninj): received connection` happens when the proxy connects and does not directly result in the crash. Hacking the device_db to bypass the proxy also results in the same crash: ``` "core_moninj": { "type": "controller", "host": core_addr } ``` (Is there a canonical way to *not* use the proxy? We definitely want to use it in general but may be useful for testing purposes.) Haven't tested with non-DRTIO gateware yet, will update once I do. ARTIQ version: ARTIQ v7.8044.c440f9f.beta Conda list attached
Author

I don't appear to have the same issue with https://nixbld.m-labs.hk/build/121703, so this seems to be DRTIO gateware related.

I don't appear to have the same issue with https://nixbld.m-labs.hk/build/121703, so this seems to be DRTIO gateware related.
mwojcik self-assigned this 2022-05-13 10:09:25 +08:00
Owner

I can't seem to reproduce the issue (although I have a hunch that I may have fixed it with e34381094b). I think it may have been a problem when moninj tried to monitor something on the satellite. Do you have any satellite data in the device_db? Could you provide me with the device_db you've been using?

I can't seem to reproduce the issue (although I have a hunch that I may have fixed it with https://git.m-labs.hk/mwojcik/artiq-zynq/commit/e34381094bae17271e9f2ba66c9a2be26bb2f643). I think it may have been a problem when moninj tried to monitor something on the satellite. Do you have any satellite data in the device_db? Could you provide me with the device_db you've been using?
Author

This happens only with satellite data in the device_db, good catch; I've attached a minimal device_db that still elicits a crash, regardless of whether the satellite is connected.

I'll try the latest build and report on that.

This happens only with satellite data in the device_db, good catch; I've attached a minimal device_db that still elicits a crash, regardless of whether the satellite is connected. I'll try the latest build and report on that.
Author

Built from e34381094b to test; at the moment I have a zc706 test setup with no satellite, and our live system with a single satellite. I'm working on having a satellite for the test setup to avoid disrupting the main experiment for debugging, although I can use the live setup for today.

On the test setup, there was no immediate crash, so that's promising. I expected some errors due to not having a satellite connected, so wasn't surprised to see:

[    75.417148s] ERROR(runtime::moninj::remote_moninj): aux packet error (link went down)

What I was surprised by was the frequency of these messages - every 8ms or so. Is there a way of rate limiting identical error messages? (Is this even a sensible suggestion?)

On the live setup: worked initially (was able to override a TTL successfully) but the core device rapidly becomes unpingable. This is definitely due to moninj in some way; I restarted the zc706 and set a continuous ping from my machine to the zc706, which succeeded for many minutes. The ping continued to succeed after submitting experiments with artiq_run, and then with a master connected (but no dashboard). ~10-30s after starting the dashboard the zc706 was not pingable.

Built from e34381094b to test; at the moment I have a zc706 test setup with no satellite, and our live system with a single satellite. I'm working on having a satellite for the test setup to avoid disrupting the main experiment for debugging, although I can use the live setup for today. On the test setup, there was no immediate crash, so that's promising. I expected some errors due to not having a satellite connected, so wasn't surprised to see: ``` [ 75.417148s] ERROR(runtime::moninj::remote_moninj): aux packet error (link went down) ``` What I *was* surprised by was the frequency of these messages - every 8ms or so. Is there a way of rate limiting identical error messages? (Is this even a sensible suggestion?) On the live setup: worked initially (was able to override a TTL successfully) but the core device rapidly becomes unpingable. This is definitely due to moninj in some way; I restarted the zc706 and set a continuous ping from my machine to the zc706, which succeeded for many minutes. The ping continued to succeed after submitting experiments with `artiq_run`, and then with a master connected (but no dashboard). ~10-30s after starting the dashboard the zc706 was not pingable.
Owner

Hey, sorry about the delay.

I looked into this and the 8ms logs came from faulty timeout calculation which should be fixed by b7afd149c4 - I didn't have much time to test it out though with a live system with a satellite, and I would appreciate if you tested it out for me.

I also got unresponsive/unpingable device until the moninj caused an illegal operation within the TCP stack, causing it to finally stop.

The broken timeout also would cause moninj internal loop to overtake any other tasks. Not sure why this didn't happen on standalone/drtio-less system as the code is mostly the same.

If that works I think we'll just merge it into master?

Hey, sorry about the delay. I looked into this and the 8ms logs came from faulty timeout calculation which should be fixed by https://git.m-labs.hk/mwojcik/artiq-zynq/commit/b7afd149c4b19b0f1e80ab16526970a04dc73e31 - I didn't have much time to test it out though with a live system with a satellite, and I would appreciate if you tested it out for me. I also got unresponsive/unpingable device until the moninj caused an illegal operation within the TCP stack, causing it to *finally* stop. The broken timeout also would cause moninj internal loop to overtake any other tasks. Not sure why this didn't happen on standalone/drtio-less system as the code is mostly the same. If that works I think we'll just merge it into master?
Author

Tested 1f95b597 on both a live system and another with no satellite, I'm still getting the same behaviour as before (live system: unpingable, no satellite: 8ms error messages)

Tested 1f95b597 on both a live system and another with no satellite, I'm still getting the same behaviour as before (live system: unpingable, no satellite: 8ms error messages)
Owner

Okay, now I have a pretty decent idea of what's going on, at least with the satellite disconnected.

Opening the dashboard causes a lot of moninj requests, and a lot of them aimed towards the satellite. The timeout between checks is indeed kept at 200ms now, but the sheer number of monitored channels causes it to be sent almost all the time, clogging the CPU. That log message is around 90 characters long, and for it to be pushed through serial at 115200bps speed it would take at least 6.25ms - with back of the envelope math. With async/await it's not really multithreading after all...

If that message gets caught and ignored (like I did with a51bfc49a0), there's no problem with getting the ping. Frankly I'm not sure if such error should be even displayed - the user should be aware that the satellite is not connected and there's certainly no need for moninj to remind him of it.

I have not yet checked it with satellite connected (been trying to get a setup with kc705 working, will probably use a kasli instead) - but I wager the response times from the satellite may be also at play.

Okay, now I have a pretty decent idea of what's going on, at least with the satellite disconnected. Opening the dashboard causes a lot of moninj requests, and a lot of them aimed towards the satellite. The timeout between checks is indeed kept at 200ms now, but the sheer number of monitored channels causes it to be sent almost all the time, clogging the CPU. That log message is around 90 characters long, and for it to be pushed through serial at 115200bps speed it would take at least 6.25ms - with back of the envelope math. With async/await it's not really multithreading after all... If that message gets caught and ignored (like I did with https://git.m-labs.hk/mwojcik/artiq-zynq/commit/a51bfc49a0543d8f834479f4238b2d77d39ac0a6), there's no problem with getting the ping. Frankly I'm not sure if such error should be even displayed - the user should be aware that the satellite is not connected and there's certainly no need for moninj to remind him of it. I have not yet checked it with satellite connected (been trying to get a setup with kc705 working, will probably use a kasli instead) - but I wager the response times from the satellite may be also at play.
Owner

How about still logging but at the debug! level instead?

How about still logging but at the ``debug!`` level instead?
sb10q closed this issue 2022-05-25 12:17:24 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#189
No description provided.