satellite: SYS CLK did not switch #246
Labels
No Milestone
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#246
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
In about 70% of attempts to run the Kasli-SoC fails to start with the error:
While master firmware works fine, satellite fails. Sometimes it works better after power-cycling.
Tried latest master,
5e6dca61a9
and165b1400ab
.master and
165b1400ab
are broken, while5e6dca61a9
works fine.63594d7e3d
maybe?well, it just doesn't build
How does changing a few IBUFDS_GTE2 parameters break the build? What is the error message?
Probably that's because of not updated dependencies at that commit
Related PR
I only tried network boot when I work on this PR, which should resolve this CLK does not switch problem. SD Card Boot was not tested on this PR.
This problem did exist when I now retest my code with SD Card Boot while with flashing over the network does not have this problem.
Possible Solution
The CLK does not switch problem goes away when I increase the delay value to 50_000.
In satman/main.rs
SD Card Boot up successfully with satellite being inserted
I will test it further on the master branch or other commit.
Code Revision Tested
The code is tested on this combination of commit:
artiq: 7791f85
artiq-zynq:
ee438105b2
@linuswck good find!
There should be a theoretical minimum value that can be computed from the FPGA datasheets. What is it?
I do not think the minimum value can be computed.
Firstly, the CPLL inside GTXE2_Channel does not have specification on the CPLL lock time.
(In, UG476 Page 64, "CPLL Reset" session does not provide CPLL Lock Time calculation)
Secondly, I think BruteforceClockAligner may make the minimum setup time required not deterministic as Sys Clock is considered to be switched after the GTX transceiver finished initializing.
Trace of code:
satman/main.rs:
artiq/artiq/gateware/targets/kasli_generic.py
artiq/gateware/drtio/transceiver/gtx_7series_init.py
Surely, there must be some information about it somewhere?
It is not involved here. It only resets the RX CDR. The system clock is from the TX system.
I can confirm that the fix works on the latest master branch commit.
artiq-zynq:
583b629b40
A question I would rather have is why would the lock time differ between SD boot and netboot? It could be useful to know of any issues where things work in a developer (netboot) but suddenly break in production (SD) settings.
And why did it show in satellite? Is master affected as well?
Possibly temperature effects if the current value is marginal.
And/or slightly different execution times due to the state of the caches and SDRAM.
The max PLL Lock time is 1ms and this info locates inside the AC/DC Switching Characteristics datasheet on page 63.
The total min reset time required also depends on the GTX Reset time.
GTX Reset Flow ( ----> )
But, TXPMARESET_TIME and TXPCSRESET_TIME are reserved value(See UG476 Page 64 for the parameters description and Page 69 for the timing diagram with these two parameters.)
I cannot search any useful info on TXPMARESET_TIME and TXPCSRESET_TIME.
I think there is no way to compute the min reset duration required?
I'm pretty sure there is, it's just not obvious. Did you check the unisim model code?
We can't spend too much time on this when there are more urgent matters (e.g. shuttler), committing the change now and then we can close this Issue after someone figures out the theoretical value.
I cannot find any info useful in the Xilinx unisim library.
Then, I look into how Xilinx writes their TX Init FSM for GTX Transceiver in their example design in Vivado.
(obtained in gtwizard_0_tx_startup_fsm.v)
Note: All states share the same MAX_RETRIES register. Once MAX_RETRIES has reached 255, the whole reset sequence restarts again. TXUSRCLK = 125MHz
I think the MAX_RETRIES value they declared is arbitrary.
I'm running into this same issue on ZC706 nist qc2 master variant, attempting to boot from SD with an external clock.
The most recent build 151734 does not boot.
The first failing build is 142790 (
46b2687d70
). The previous build 142703 (b85c870b82
) boots just fine, so I know the clock signal is good.Interestingly the qc2 standalone gateware (e.g. build 142849) is also fine, even though the DRTIO master from that evaluation is not.
Possibly the same change of waiting longer for a PLL lock is necessary for the master as well?
@sb10q I've confirmed (unsurprisingly) that this affects Kasli-SoC DRTIO masters also; the same commit introduces the breakage. This is with the external clock running at 100 MHz.
Just to push this home; no-one using external clocking and DRTIO can currently use the latest gateware since February - I did mean to raise this earlier but things got in the way.
Side note - I'm not sure what hardware unit testing gets done in general, but booting with an external clock should probably be part of the test suite.
I just checked with a 1.1.1 Kasli-SoC on newest master:
I even made the part of the clock switch wait/check into a loop instead of a straight up panic so I could verify how long it takes to lock... but it did it within the first iteration anyway, about 10ms.
With older (1.0) Kasli-SoCs I would be concerned about capacitors, that seemed to affect the DRTIO, see: https://github.com/sinara-hw/Kasli-SOC/issues/68
No idea about ZC706 though.
However I understand that with some slight hardware differences it may take longer for the clock to switch, so let's bring master's waiting time up to the same level as satellite's, which seems to have helped in that case.
We use the 100 MHz directly with
rtio_clock = ext0_bypass
, i.e. we aren't making a 125 MHz RTIO clock with PLL (I wasn't aware this was an option, although it's not one we're likely to use anyway). The Kasli SoC that I tested with was indeed a v1.0.1, so it's possible that it would work better with capacitors reworked, although it's built into someone else's experiment so that's a bit more downtime than just rebooting with different gateware.It's unlikely that capacitors are the solution for ZC706, so the wait time fix seems appropriate.
Oh. Yeah that's probably not going to work, when you switch the frequencies from 125MHz to 100MHz for the delicate machine tuned for a specific frequency that are the GTX transceivers.
The good news is that for zc706 you only need to build a master variant with
_100mhz
suffix, although that is not exposed (but should be easily supported) in theflake.nix
at the moment.For Kasli-SoC, @sb10q should we support 100MHz (with PLL bypass) RTIO/SYS clock? That would require an additional PLL to get the on-board 125MHz oscillator down to 100 for bootstrap, and some way of indicating such setup within the JSON, and that's about it, I think.
Yes I think so, sounds good.
Just tested the newly built zc706
boot.bin
for 125 and 100 MHz separately, both booting fine, thanks!