satellite: SYS CLK did not switch #246

Closed
opened 2023-08-09 16:57:57 +08:00 by esavkin · 23 comments
Owner

In about 70% of attempts to run the Kasli-SoC fails to start with the error:

[     0.600298s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.553143s]  INFO(libboard_artiq::si5324):   ...locked
[     2.658356s]  INFO(satman): Switching SYS clocks...
Core 0 panic at satman/src/main.rs:641:9: SYS CLK did not switch

While master firmware works fine, satellite fails. Sometimes it works better after power-cycling.
Tried latest master, 5e6dca61a9 and 165b1400ab.
master and 165b1400ab are broken, while 5e6dca61a9 works fine.

In about 70% of attempts to run the Kasli-SoC fails to start with the error: ``` [ 0.600298s] INFO(libboard_artiq::si5324): waiting for Si5324 lock... [ 2.553143s] INFO(libboard_artiq::si5324): ...locked [ 2.658356s] INFO(satman): Switching SYS clocks... Core 0 panic at satman/src/main.rs:641:9: SYS CLK did not switch ``` While master firmware works fine, satellite fails. Sometimes it works better after power-cycling. Tried latest master, 5e6dca61a9e208f52e5ed1fa743723acd11bf1e0 and 165b1400abe44d688dffca6f0e99a41793bfd287. master and 165b1400abe44d688dffca6f0e99a41793bfd287 are broken, while 5e6dca61a9e208f52e5ed1fa743723acd11bf1e0 works fine.
Owner

63594d7e3d maybe?

https://git.m-labs.hk/M-Labs/artiq-zynq/commit/63594d7e3d66bff93aa58bf77256a50b6c15ac72 maybe?
Author
Owner

63594d7e3d maybe?

well, it just doesn't build

> https://git.m-labs.hk/M-Labs/artiq-zynq/commit/63594d7e3d66bff93aa58bf77256a50b6c15ac72 maybe? well, it just doesn't build
Owner

How does changing a few IBUFDS_GTE2 parameters break the build? What is the error message?

How does changing a few IBUFDS_GTE2 parameters break the build? What is the error message?
Author
Owner

How does changing a few IBUFDS_GTE2 parameters break the build? What is the error message?

ERROR: [DRC PDRC-155] CLKSWING_CFG: Invalid IBUFDS_GTE2 configuration: Cell IBUFDS_GTE2 CLKSWING_CFG setting of 2'b01 differs from IBUFDS_GTE2_1 setting of 2'b11.  They should have the same setting.

Probably that's because of not updated dependencies at that commit

> How does changing a few IBUFDS_GTE2 parameters break the build? What is the error message? ``` ERROR: [DRC PDRC-155] CLKSWING_CFG: Invalid IBUFDS_GTE2 configuration: Cell IBUFDS_GTE2 CLKSWING_CFG setting of 2'b01 differs from IBUFDS_GTE2_1 setting of 2'b11. They should have the same setting. ``` Probably that's because of not updated dependencies at that commit
Owner

I only tried network boot when I work on this PR, which should resolve this CLK does not switch problem. SD Card Boot was not tested on this PR.

This problem did exist when I now retest my code with SD Card Boot while with flashing over the network does not have this problem.

Possible Solution

The CLK does not switch problem goes away when I increase the delay value to 50_000.
In satman/main.rs

timer.delay_us(20_000); // wait for CPLL/QPLL/MMCM lock

SD Card Boot up successfully with satellite being inserted

[     0.849714s]  INFO(szl): Preparing for runtime execution
[     0.855346s]  INFO(szl): executing payload
[     0.000067s]  INFO(satman): ARTIQ satellite manager starting...
[     0.005949s]  INFO(satman): gateware ident satellite
[     0.016164s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.599962s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.531034s]  INFO(libboard_artiq::si5324):   ...locked
[     2.636248s]  INFO(satman): Switching SYS clocks...
[     2.691110s]  INFO(satman): SYS CLK switched successfully
[    21.658827s]  INFO(satman): uplink is up, switching to recovered clock
[    21.704868s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[    23.427780s]  INFO(libboard_artiq::si5324):   ...locked
[    28.628630s]  INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 708, width: 263 (211deg)

I will test it further on the master branch or other commit.

Code Revision Tested

The code is tested on this combination of commit:
artiq: 7791f85
artiq-zynq: ee438105b2

## Related PR I only tried network boot when I work on this [PR](https://github.com/m-labs/artiq/pull/2128), which should resolve this CLK does not switch problem. SD Card Boot was not tested on this PR. This problem did exist when I now retest my code with SD Card Boot while with flashing over the network does not have this problem. ## Possible Solution The CLK does not switch problem goes away when I increase the delay value to 50_000. In [satman/main.rs](https://git.m-labs.hk/M-Labs/artiq-zynq/src/commit/583b629b40d748e776c1b714d09dbbee132b33dc/src/satman/src/main.rs#L636) ``` timer.delay_us(20_000); // wait for CPLL/QPLL/MMCM lock ``` SD Card Boot up successfully with satellite being inserted ``` [ 0.849714s] INFO(szl): Preparing for runtime execution [ 0.855346s] INFO(szl): executing payload [ 0.000067s] INFO(satman): ARTIQ satellite manager starting... [ 0.005949s] INFO(satman): gateware ident satellite [ 0.016164s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.599962s] INFO(libboard_artiq::si5324): waiting for Si5324 lock... [ 2.531034s] INFO(libboard_artiq::si5324): ...locked [ 2.636248s] INFO(satman): Switching SYS clocks... [ 2.691110s] INFO(satman): SYS CLK switched successfully [ 21.658827s] INFO(satman): uplink is up, switching to recovered clock [ 21.704868s] INFO(libboard_artiq::si5324): waiting for Si5324 lock... [ 23.427780s] INFO(libboard_artiq::si5324): ...locked [ 28.628630s] INFO(libboard_artiq::si5324::siphaser): calibration successful, lead: 708, width: 263 (211deg) ``` I will test it further on the master branch or other commit. ## Code Revision Tested The code is tested on this combination of commit: artiq: [7791f85](https://github.com/m-labs/artiq/commit/7791f85a1a3b98a6e3e6a01963dfab84d3b2dfe1) artiq-zynq: ee438105b26555b80b39a3acadaab5c3d9d1dd48
Owner

@linuswck good find!

The CLK does not switch problem goes away when I increase the delay value to 50_000.

There should be a theoretical minimum value that can be computed from the FPGA datasheets. What is it?

@linuswck good find! > The CLK does not switch problem goes away when I increase the delay value to 50_000. There should be a theoretical minimum value that can be computed from the FPGA datasheets. What is it?
Owner

I do not think the minimum value can be computed.

Firstly, the CPLL inside GTXE2_Channel does not have specification on the CPLL lock time.
(In, UG476 Page 64, "CPLL Reset" session does not provide CPLL Lock Time calculation)

Secondly, I think BruteforceClockAligner may make the minimum setup time required not deterministic as Sys Clock is considered to be switched after the GTX transceiver finished initializing.

Trace of code:

satman/main.rs:

let clk = unsafe { csr::sys_crg::current_clock_read() };

artiq/artiq/gateware/targets/kasli_generic.py

self.crg.configure(txout_buf, clk_sw=gtp.tx_init.done)

artiq/gateware/drtio/transceiver/gtx_7series_init.py

startup_fsm.act("READY",
    Xxuserrdy.eq(1),
    self.done.eq(1),
    If(self.restart, NextState("RESET_GTX"))
)
I do not think the minimum value can be computed. Firstly, the CPLL inside GTXE2_Channel does not have specification on the CPLL lock time. (In, UG476 Page 64, "CPLL Reset" session does not provide CPLL Lock Time calculation) Secondly, I think BruteforceClockAligner may make the minimum setup time required not deterministic as Sys Clock is considered to be switched after the GTX transceiver finished initializing. ### Trace of code: [satman/main.rs](https://git.m-labs.hk/M-Labs/artiq-zynq/src/commit/583b629b40d748e776c1b714d09dbbee132b33dc/src/satman/src/main.rs#L637): ``` let clk = unsafe { csr::sys_crg::current_clock_read() }; ``` [artiq/artiq/gateware/targets/kasli_generic.py](https://github.com/m-labs/artiq/blob/68dd0e029fcd3b53834e32bb532e959fbd2d88de/artiq/gateware/targets/kasli.py#L329) ``` self.crg.configure(txout_buf, clk_sw=gtp.tx_init.done) ``` [artiq/gateware/drtio/transceiver/gtx_7series_init.py](https://github.com/m-labs/artiq/blob/68dd0e029fcd3b53834e32bb532e959fbd2d88de/artiq/gateware/drtio/transceiver/gtx_7series_init.py#L234C1-L238C10) ``` startup_fsm.act("READY", Xxuserrdy.eq(1), self.done.eq(1), If(self.restart, NextState("RESET_GTX")) ) ```
Owner

Firstly, the CPLL inside GTXE2_Channel does not have specification on the CPLL lock time.

Surely, there must be some information about it somewhere?

Secondly, I think BruteforceClockAligner may make the minimum setup time required not deterministic as Sys Clock is considered to be switched after the GTX transceiver finished initializing.

It is not involved here. It only resets the RX CDR. The system clock is from the TX system.

> Firstly, the CPLL inside GTXE2_Channel does not have specification on the CPLL lock time. Surely, there must be some information about it somewhere? > Secondly, I think BruteforceClockAligner may make the minimum setup time required not deterministic as Sys Clock is considered to be switched after the GTX transceiver finished initializing. It is not involved here. It only resets the RX CDR. The system clock is from the TX system.
Owner

I will test it further on the master branch or other commit.

I can confirm that the fix works on the latest master branch commit.
artiq-zynq: 583b629b40

> > I will test it further on the master branch or other commit. > I can confirm that the fix works on the latest master branch commit. artiq-zynq: 583b629b40d748e776c1b714d09dbbee132b33dc
Owner

A question I would rather have is why would the lock time differ between SD boot and netboot? It could be useful to know of any issues where things work in a developer (netboot) but suddenly break in production (SD) settings.

And why did it show in satellite? Is master affected as well?

A question I would rather have is why would the lock time differ between SD boot and netboot? It could be useful to know of any issues where things work in a developer (netboot) but suddenly break in production (SD) settings. And why did it show in satellite? Is master affected as well?
Owner

Possibly temperature effects if the current value is marginal.

Possibly temperature effects if the current value is marginal.
Owner

And/or slightly different execution times due to the state of the caches and SDRAM.

And/or slightly different execution times due to the state of the caches and SDRAM.
Owner

Surely, there must be some information about it somewhere?

The max PLL Lock time is 1ms and this info locates inside the AC/DC Switching Characteristics datasheet on page 63.

The total min reset time required also depends on the GTX Reset time.

GTX Reset Flow ( ----> )

PLL Reset TXPMARESET TXPCARESET
1ms TXPMARESET_TIME TXPCSRESET_TIME

But, TXPMARESET_TIME and TXPCSRESET_TIME are reserved value(See UG476 Page 64 for the parameters description and Page 69 for the timing diagram with these two parameters.)

I cannot search any useful info on TXPMARESET_TIME and TXPCSRESET_TIME.

I think there is no way to compute the min reset duration required?

> > Surely, there must be some information about it somewhere? > The max PLL Lock time is 1ms and this info locates inside the AC/DC Switching Characteristics [datasheet](https://docs.xilinx.com/v/u/en-US/ds187-XC7Z010-XC7Z020-Data-Sheet) on page 63. The total min reset time required also depends on the GTX Reset time. GTX Reset Flow ( ----> ) | PLL Reset | TXPMARESET | TXPCARESET | | -------- | -------- | -------- | | 1ms| TXPMARESET_TIME| TXPCSRESET_TIME| But, TXPMARESET_TIME and TXPCSRESET_TIME are reserved value(See [UG476](https://docs.xilinx.com/v/u/en-US/ug476_7Series_Transceivers) Page 64 for the parameters description and Page 69 for the timing diagram with these two parameters.) I cannot search any useful info on TXPMARESET_TIME and TXPCSRESET_TIME. I think there is no way to compute the min reset duration required?
Owner

I'm pretty sure there is, it's just not obvious. Did you check the unisim model code?

I'm pretty sure there is, it's just not obvious. Did you check the unisim model code?
Owner

We can't spend too much time on this when there are more urgent matters (e.g. shuttler), committing the change now and then we can close this Issue after someone figures out the theoretical value.

We can't spend too much time on this when there are more urgent matters (e.g. shuttler), committing the change now and then we can close this Issue after someone figures out the theoretical value.
Owner

I'm pretty sure there is, it's just not obvious. Did you check the unisim model code?

I cannot find any info useful in the Xilinx unisim library.

Then, I look into how Xilinx writes their TX Init FSM for GTX Transceiver in their example design in Vivado.
(obtained in gtwizard_0_tx_startup_fsm.v)

State Maximum Timeout Value
WAIT_FOR_PLL_LOCK 500ns
RELEASE_PLL_RESET 2ms x MAX_RETRIES
WAIT_FOR_TXOUTCLK 500ns
RELEASE_MMCM_RESET 100us x MAX_RETRIES
WAIT_FOR_TXUSRCLK 500ns
WAIT_RESET_DONE 500us x MAX_RETRIES
DO_PHASE_ALIGNMENT 45824 TXUSRCLK Cycle(366.592us) x MAX_RETRIES

Note: All states share the same MAX_RETRIES register. Once MAX_RETRIES has reached 255, the whole reset sequence restarts again. TXUSRCLK = 125MHz

I think the MAX_RETRIES value they declared is arbitrary.

> I'm pretty sure there is, it's just not obvious. Did you check the unisim model code? I cannot find any info useful in the [Xilinx unisim library](https://github.com/Xilinx/XilinxUnisimLibrary/blob/master/verilog/src/unisims/GTPE2_CHANNEL.v). Then, I look into how Xilinx writes their TX Init FSM for GTX Transceiver in their example design in Vivado. (obtained in gtwizard_0_tx_startup_fsm.v) | State | Maximum Timeout Value | -------- | -------- | WAIT_FOR_PLL_LOCK | 500ns | RELEASE_PLL_RESET | 2ms x MAX_RETRIES | WAIT_FOR_TXOUTCLK | 500ns | RELEASE_MMCM_RESET | 100us x MAX_RETRIES | WAIT_FOR_TXUSRCLK | 500ns | WAIT_RESET_DONE | 500us x MAX_RETRIES | DO_PHASE_ALIGNMENT | 45824 TXUSRCLK Cycle(366.592us) x MAX_RETRIES Note: All states share the same MAX_RETRIES register. Once MAX_RETRIES has reached 255, the whole reset sequence restarts again. TXUSRCLK = 125MHz I think the MAX_RETRIES value they declared is arbitrary.

I'm running into this same issue on ZC706 nist qc2 master variant, attempting to boot from SD with an external clock.

The most recent build 151734 does not boot.

The first failing build is 142790 (46b2687d70). The previous build 142703 (b85c870b82) boots just fine, so I know the clock signal is good.

Interestingly the qc2 standalone gateware (e.g. build 142849) is also fine, even though the DRTIO master from that evaluation is not.

Possibly the same change of waiting longer for a PLL lock is necessary for the master as well?

I'm running into this same issue on ZC706 nist qc2 master variant, attempting to boot from SD with an external clock. The most recent build 151734 does not boot. The first failing build is 142790 (46b2687d70cf2). The previous build 142703 (b85c870b828) boots just fine, so I know the clock signal is good. Interestingly the qc2 standalone gateware (e.g. build 142849) is also fine, even though the DRTIO master from that evaluation is not. Possibly the same change of waiting longer for a PLL lock is necessary for the master as well?

@sb10q I've confirmed (unsurprisingly) that this affects Kasli-SoC DRTIO masters also; the same commit introduces the breakage. This is with the external clock running at 100 MHz.

Just to push this home; no-one using external clocking and DRTIO can currently use the latest gateware since February - I did mean to raise this earlier but things got in the way.

Side note - I'm not sure what hardware unit testing gets done in general, but booting with an external clock should probably be part of the test suite.

@sb10q I've confirmed (unsurprisingly) that this affects Kasli-SoC DRTIO masters also; the same commit introduces the breakage. This is with the external clock running at 100 MHz. Just to push this home; no-one using external clocking and DRTIO can currently use the latest gateware since February - I did mean to raise this earlier but things got in the way. Side note - I'm not sure what hardware unit testing gets done in general, but booting with an external clock should probably be part of the test suite.
Owner

I just checked with a 1.1.1 Kasli-SoC on newest master:

[     0.005237s]  INFO(runtime): gateware ident: master
[     0.015289s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.173596s]  INFO(runtime::rtio_clocking): using 100MHz reference to make 125MHz RTIO clock with PLL
[     0.546730s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     2.851048s]  INFO(libboard_artiq::si5324):   ...locked
[     2.876995s]  INFO(runtime::rtio_clocking): SYS CLK switched successfully
[     2.888969s]  INFO(libboard_zynq::i2c): PCA9548 detected

I even made the part of the clock switch wait/check into a loop instead of a straight up panic so I could verify how long it takes to lock... but it did it within the first iteration anyway, about 10ms.

With older (1.0) Kasli-SoCs I would be concerned about capacitors, that seemed to affect the DRTIO, see: https://github.com/sinara-hw/Kasli-SOC/issues/68

No idea about ZC706 though.

However I understand that with some slight hardware differences it may take longer for the clock to switch, so let's bring master's waiting time up to the same level as satellite's, which seems to have helped in that case.

I just checked with a 1.1.1 Kasli-SoC on newest master: ``` [ 0.005237s] INFO(runtime): gateware ident: master [ 0.015289s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.173596s] INFO(runtime::rtio_clocking): using 100MHz reference to make 125MHz RTIO clock with PLL [ 0.546730s] INFO(libboard_artiq::si5324): waiting for Si5324 lock... [ 2.851048s] INFO(libboard_artiq::si5324): ...locked [ 2.876995s] INFO(runtime::rtio_clocking): SYS CLK switched successfully [ 2.888969s] INFO(libboard_zynq::i2c): PCA9548 detected ``` I even made the part of the clock switch wait/check into a loop instead of a straight up panic so I could verify how long it takes to lock... but it did it within the first iteration anyway, about 10ms. With older (1.0) Kasli-SoCs I would be concerned about capacitors, that seemed to affect the DRTIO, see: https://github.com/sinara-hw/Kasli-SOC/issues/68 No idea about ZC706 though. However I understand that with some slight hardware differences it may take longer for the clock to switch, so let's bring master's waiting time up to the same level as satellite's, which seems to have helped in that case.

We use the 100 MHz directly with rtio_clock = ext0_bypass, i.e. we aren't making a 125 MHz RTIO clock with PLL (I wasn't aware this was an option, although it's not one we're likely to use anyway). The Kasli SoC that I tested with was indeed a v1.0.1, so it's possible that it would work better with capacitors reworked, although it's built into someone else's experiment so that's a bit more downtime than just rebooting with different gateware.

It's unlikely that capacitors are the solution for ZC706, so the wait time fix seems appropriate.

We use the 100 MHz directly with `rtio_clock = ext0_bypass`, i.e. we aren't making a 125 MHz RTIO clock with PLL (I wasn't aware this was an option, although it's not one we're likely to use anyway). The Kasli SoC that I tested with was indeed a v1.0.1, so it's possible that it would work better with capacitors reworked, although it's built into someone else's experiment so that's a bit more downtime than just rebooting with different gateware. It's unlikely that capacitors are the solution for ZC706, so the wait time fix seems appropriate.
Owner

Oh. Yeah that's probably not going to work, when you switch the frequencies from 125MHz to 100MHz for the delicate machine tuned for a specific frequency that are the GTX transceivers.

The good news is that for zc706 you only need to build a master variant with _100mhz suffix, although that is not exposed (but should be easily supported) in the flake.nix at the moment.

For Kasli-SoC, @sb10q should we support 100MHz (with PLL bypass) RTIO/SYS clock? That would require an additional PLL to get the on-board 125MHz oscillator down to 100 for bootstrap, and some way of indicating such setup within the JSON, and that's about it, I think.

Oh. Yeah that's probably not going to work, when you switch the frequencies from 125MHz to 100MHz for the delicate machine tuned for a specific frequency that are the GTX transceivers. The good news is that for zc706 you only need to build a master variant with ``_100mhz`` suffix, although that is not exposed (but should be easily supported) in the ``flake.nix`` at the moment. For Kasli-SoC, @sb10q should we support 100MHz (with PLL bypass) RTIO/SYS clock? That would require an additional PLL to get the on-board 125MHz oscillator down to 100 for bootstrap, and some way of indicating such setup within the JSON, and that's about it, I think.
Owner

For Kasli-SoC, @sb10q should we support 100MHz (with PLL bypass) RTIO/SYS clock? That would require an additional PLL to get the on-board 125MHz oscillator down to 100 for bootstrap, and some way of indicating such setup within the JSON, and that's about it, I think.

Yes I think so, sounds good.

> For Kasli-SoC, @sb10q should we support 100MHz (with PLL bypass) RTIO/SYS clock? That would require an additional PLL to get the on-board 125MHz oscillator down to 100 for bootstrap, and some way of indicating such setup within the JSON, and that's about it, I think. Yes I think so, sounds good.

Just tested the newly built zc706 boot.bin for 125 and 100 MHz separately, both booting fine, thanks!

Just tested the newly built zc706 `boot.bin` for 125 and 100 MHz separately, both booting fine, thanks!
sb10q closed this issue 2023-11-07 18:55:11 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
5 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#246
No description provided.