Soft panic for RTIO PLL reasons #199

Merged
sb10q merged 16 commits from mwojcik/artiq-zynq:pll_error into master 2022-10-21 17:56:34 +08:00

Closes #198 #200

Making it a soft panic makes it more involved with a bit of code duplication - setting up mgmt requires setting up the interface and sockets. Maybe can be done a bit cleaner.

[spaqin@hera:~/m-labs/artiq-zynq]$ artiq_sinara_tester 
****** Sinara system tester ******
[...]
ConnectionRefusedError: [Errno 111] Connection refused

[spaqin@hera:~/m-labs/artiq-zynq]$ artiq_coremgmt -D 192.168.1.56 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005238s]  INFO(runtime): detected gateware: GenericMaster
[     0.016152s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.023004s]  WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly.
[     0.036730s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock
[     0.224443s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.256197s]  INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address
[     0.270183s] ERROR(runtime::comms): There has been an error configuring the device: RTIO PLL failed to lock. Only mgmt interface will be available.
[     4.000095s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    33.148521s]  INFO(runtime::mgmt): received connection
Closes #198 #200 Making it a soft panic makes it more involved with a bit of code duplication - setting up mgmt requires setting up the interface and sockets. Maybe can be done a bit cleaner. ``` [spaqin@hera:~/m-labs/artiq-zynq]$ artiq_sinara_tester ****** Sinara system tester ****** [...] ConnectionRefusedError: [Errno 111] Connection refused [spaqin@hera:~/m-labs/artiq-zynq]$ artiq_coremgmt -D 192.168.1.56 log [ 0.000067s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005238s] INFO(runtime): detected gateware: GenericMaster [ 0.016152s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.023004s] WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly. [ 0.036730s] WARN(runtime::rtio_clocking): error reading configuration. Falling back to default. [ 0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock [ 0.224443s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.256197s] INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address [ 0.270183s] ERROR(runtime::comms): There has been an error configuring the device: RTIO PLL failed to lock. Only mgmt interface will be available. [ 4.000095s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full } [ 33.148521s] INFO(runtime::mgmt): received connection ```

Isn't there a way to put this into the actual panic handler? (With correct handling of recursive panic, i.e. fall back on a simple panic that writes an error message to the UART and stops the CPU)

It would be a lot more general and useful then.

Isn't there a way to put this into the actual panic handler? (With correct handling of recursive panic, i.e. fall back on a simple panic that writes an error message to the UART and stops the CPU) It would be a lot more general and useful then.
Poster
Owner

Hm, doesn't seem like multiple layers of panicking is supported in Rust.

At best probably the soft_panic function could be separated further - that is, not require anything besides maybe error message - it get the config, timer itself. That could be implemented within Ok/Err checks manually, or with a checking macro to keep things simpler (but not as simple as )?;). There's also the issue of checking if mgmt was started yet or not and acting accordingly.

Hm, doesn't seem like multiple layers of panicking is supported in Rust. At best probably the soft_panic function could be separated further - that is, not require anything besides maybe error message - it get the config, timer itself. That could be implemented within Ok/Err checks manually, or with a checking macro to keep things simpler (but not as simple as ``)?;``). There's also the issue of checking if mgmt was started yet or not and acting accordingly.

Hm, doesn't seem like multiple layers of panicking is supported in Rust.

What happens if you panic in the panic handler?
Can recursive panic be detected with something as simple as static mut HAS_PANICKED = false; ... ?

> Hm, doesn't seem like multiple layers of panicking is supported in Rust. What happens if you panic in the panic handler? Can recursive panic be detected with something as simple as `static mut HAS_PANICKED = false; ...` ?
Poster
Owner

...yeah of course, runtime has its own panic handler too (https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master/src/runtime/src/panic.rs) and it already implements nested panic protection (in the same way you mentioned). I can piggyback on that, extend its behavior, but I found a rather concerning issue, it itself causes another panic before printing "End backtrace". I'll look into that a bit deeper.

...yeah of course, runtime has its own panic handler too (https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master/src/runtime/src/panic.rs) and it already implements nested panic protection (in the same way you mentioned). I can piggyback on that, extend its behavior, but I found a rather concerning issue, it itself causes another panic before printing "End backtrace". I'll look into that a bit deeper.
Poster
Owner

Now it's mixed in panic handler code.

coremgt:

[spaqin@hera:~/m-labs/artiq-zynq]$ artiq_coremgmt -D 192.168.1.56 log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005241s]  INFO(runtime): detected gateware: GenericMaster
[     0.016161s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.023004s]  WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly.
[     0.036724s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock
[     0.227672s] ERROR(runtime::panic): panic at runtime/src/main.rs:125:43
[     0.234268s] ERROR(runtime::panic): panic message: Could not set up RTIO PLL: "RTIO PLL failed to lock"
[     0.007193s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.038934s]  INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address
[     4.000094s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    70.272436s]  INFO(runtime::mgmt): received connection

excerpt from UART:

[     0.023004s]  WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly.
[     0.036724s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock
Core 0 panic at runtime/src/main.rs:125:43: Could not set up RTIO PLL: "RTIO PLL failed to lock"
[     0.227672s] ERROR(runtime::panic): panic at runtime/src/main.rs:125:43
[     0.234268s] ERROR(runtime::panic): panic message: Could not set up RTIO PLL: "RTIO PLL failed to lock"
[     0.007193s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.038934s]  INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address
[     4.000094s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    70.272436s]  INFO(runtime::mgmt): received connection

No backtrace (as mentioned in #200 it causes a nested panic). However if the soft panic panics, backtrace would be ran (and possibly cause a nested panic again).

I'm thinking I also could bring back the previous rtio_clocking, with one adjustment of panicking when RTIO PLL doesn't lock.

Now it's mixed in panic handler code. coremgt: ``` [spaqin@hera:~/m-labs/artiq-zynq]$ artiq_coremgmt -D 192.168.1.56 log [ 0.000067s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005241s] INFO(runtime): detected gateware: GenericMaster [ 0.016161s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.023004s] WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly. [ 0.036724s] WARN(runtime::rtio_clocking): error reading configuration. Falling back to default. [ 0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock [ 0.227672s] ERROR(runtime::panic): panic at runtime/src/main.rs:125:43 [ 0.234268s] ERROR(runtime::panic): panic message: Could not set up RTIO PLL: "RTIO PLL failed to lock" [ 0.007193s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.038934s] INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address [ 4.000094s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full } [ 70.272436s] INFO(runtime::mgmt): received connection ``` excerpt from UART: ``` [ 0.023004s] WARN(runtime): config initialization failed: SD error: Card initialization error: No card inserted, check if the card is inserted properly. [ 0.036724s] WARN(runtime::rtio_clocking): error reading configuration. Falling back to default. [ 0.213000s] ERROR(runtime::rtio_clocking): RTIO PLL failed to lock Core 0 panic at runtime/src/main.rs:125:43: Could not set up RTIO PLL: "RTIO PLL failed to lock" [ 0.227672s] ERROR(runtime::panic): panic at runtime/src/main.rs:125:43 [ 0.234268s] ERROR(runtime::panic): panic message: Could not set up RTIO PLL: "RTIO PLL failed to lock" [ 0.007193s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.038934s] INFO(runtime::comms): network addresses: MAC=e8-eb-1b-13-49-8b IPv4=192.168.1.56 IPv6-LL=fe80::eaeb:1bff:fe13:498b IPv6: no configured address [ 4.000094s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full } [ 70.272436s] INFO(runtime::mgmt): received connection ``` No backtrace (as mentioned in #200 it causes a nested panic). However if the soft panic panics, backtrace would be ran (and possibly cause a nested panic again). I'm thinking I also could bring back the previous rtio_clocking, with one adjustment of panicking when RTIO PLL doesn't lock.
sb10q reviewed 2022-10-05 17:02:50 +08:00
@ -493,0 +542,4 @@
mgmt::start(cfg);
//getting eth settings disables the LED - need enable it here

Why does it disable the LED?

Why does it disable the LED?
Poster
Owner

getting eth settings sets up I2C to communicate with the ethernet controller;

I2C setup also resets GPIO:
https://git.m-labs.hk/M-Labs/zynq-rs/src/branch/master/libboard_zynq/src/i2c/mod.rs#L57

and I2C GPIO (MIO 33, 50, 51) is in the same bank as the error led (MIO 37); so resetting the entire port again causes the LED to go off.

getting eth settings sets up I2C to communicate with the ethernet controller; I2C setup also resets GPIO: https://git.m-labs.hk/M-Labs/zynq-rs/src/branch/master/libboard_zynq/src/i2c/mod.rs#L57 and I2C GPIO (MIO 33, 50, 51) is in the same bank as the error led (MIO 37); so resetting the entire port again causes the LED to go off.

Okay I see. Maybe make the comment a bit more clear (e.g. "we reinitialized the GPIO controller, so only now can we turn on the error LED")
Also, the non-mgmt panic handler should also turn on the LED to make sure.

Okay I see. Maybe make the comment a bit more clear (e.g. "we reinitialized the GPIO controller, so only now can we turn on the error LED") Also, the non-mgmt panic handler should also turn on the LED to make sure.

No backtrace (as mentioned in #200 it causes a nested panic).

I don't understand why getting the backtrace would cause another panic - could you elaborate?

> No backtrace (as mentioned in #200 it causes a nested panic). I don't understand why getting the backtrace would cause another panic - could you elaborate?
sb10q reviewed 2022-10-05 17:21:33 +08:00
@ -84,6 +86,7 @@ enum Reply {
static CACHE_STORE: Mutex<BTreeMap<String, Vec<i32>>> = Mutex::new(BTreeMap::new());
static DMA_RECORD_STORE: Mutex<BTreeMap<String, (Vec<u8>, i64)>> = Mutex::new(BTreeMap::new());
static mut MGMT_STARTED: bool = false;

Where is this used?

Where is this used?
Poster
Owner

No backtrace (as mentioned in #200 it causes a nested panic).

I don't understand why getting the backtrace would cause another panic - could you elaborate?

Backtrace calls libunwind and it arrives here: https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master/src/runtime/src/kernel/core1.rs#L227

If there's no kernel loaded (and the kernel is not the cause of panic), it panics here again.

> > No backtrace (as mentioned in #200 it causes a nested panic). > > I don't understand why getting the backtrace would cause another panic - could you elaborate? Backtrace calls libunwind and it arrives here: https://git.m-labs.hk/M-Labs/artiq-zynq/src/branch/master/src/runtime/src/kernel/core1.rs#L227 If there's no kernel loaded (and the kernel is not the cause of panic), it panics here again.
sb10q reviewed 2022-10-05 17:26:09 +08:00
@ -39,1 +56,4 @@
}
fn soft_panic(info: &core::panic::PanicInfo) -> ! {
// log panic info to log (prints not visible in coremgmt logs)

Had to read that twice...

Maybe reformulate along the lines:
"Write panic info to the log, so that coremgmt can see it"

Had to read that twice... Maybe reformulate along the lines: "Write panic info to the log, so that coremgmt can see it"

If there's no kernel loaded (and the kernel is not the cause of panic), it panics here again.

Why would it take the second branch of the if in this case? It should only unwind in the text section, so the condition &__text_start as *const u32 <= pc && pc < &__text_end as *const u32 should be true, isn't it?

> If there's no kernel loaded (and the kernel is not the cause of panic), it panics here again. Why would it take the second branch of the ``if`` in this case? It should only unwind in the ``text`` section, so the condition `` &__text_start as *const u32 <= pc && pc < &__text_end as *const u32`` should be true, isn't it?
mwojcik force-pushed pll_error from 7323c1508a to c6d173a5c5 2022-10-21 17:20:53 +08:00 Compare
sb10q merged commit c834e4f503 into master 2022-10-21 17:56:34 +08:00
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#199
There is no content yet.