ZC706/QC2 accessing devices from list causes hang/crash #188
Labels
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#188
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Using NIST QC2 DRTIO master gateware from https://nixbld.m-labs.hk/build/118752, initialization of all DDSs either hangs or crashes. This happens regardless of whether the hardware is connected or not.
MWE:
Core log:
Increasing the delay between calls to
init()
makes it more likely that the coredevice will hang rather than crashing outright - it's not clear to me what happens in that situation since submitting subsequent experiments still works.10 ms delay should already be plenty, since
self.core.mu_to_seconds(dds.init_duration_mu)
is 1.47 ms with 125 MHz clock, and 1.18 ms with 100 MHz. (NB reading the AD9914 documentation, I believe thatdac_cal_duration_mu
has been miscalculated and is much larger than the actual required value, but it doesn't matter sinceinit()
is called so rarely.)Will test with a recent build of the standalone version to see if it's a DRTIO gateware difference. We have been seeing occasional similar behaviour (hanging, no crashes) with older builds of the standalone version, but I wasn't sure whether that was on my end without a separate test setup.
I originally hypothesized that this might be due to
SYNC_CLK
glitches during initialization causing the RTIO clock to fail in weird ways, but debunked this both by testing with an independent external clock and the internal clock, with none of the hardware connected.Same story with https://nixbld.m-labs.hk/build/121055 (again no hardware connected)
BUT
unrolling the loop and accessing atributes directly works slightly better i.e.
This hangs about 30% of the time with https://nixbld.m-labs.hk/build/118752, no hangs in ~20 attempts with https://nixbld.m-labs.hk/build/121055, both clocked internally and no hardware connected.
This seems like unexpected behaviour to me, although definitely in a different way.
Explicitly unrolling the loop rather than accessing attributes directly i.e.
does NOT work, this seems at least consistent with iterating over the list not working.
DataAbort points to a problem within the PS, which is not clocked by RTIO.
Does the problem also occur with ACPKI?
Same problem with ACPKI (https://nixbld.m-labs.hk/build/121087). Also not specific to DDSs; an exactly analogous experiment using 12 TTLs and
ttl.output()
in place ofdds.init()
failed in the same way.Renaming the issue to reflect non-specificity to DDSs.
I'm having DDS hardware issues too; will raise a separate issue for that once I can accurately describe it.
ZC706/QC2 AD9914 DDS initialization causes hang/crashto ZC706/QC2 accessing devices from list causes hang/crash@sb10q the kernel hangs are pretty catastrophic: at the moment the DRTIO gateware is unusable because of the hangs. There doesn't appear to be a workaround at the moment, since even accessing the device attribute directly causes hangs.
I didn't actually have any DDS issues, but the kernel hangs had left them in unrecoverable states with no output.
When testing with DRTIO gateware, I don't have a satellite connected, but I'm not trying to access any devices on satellites - this shouldn't be causing any issues correct?
I'm not sure the hang is connected to the crash; I can work around the crash, but with the kernel hanging I can't get anything done at all.
I have reproduced it locally (thanks to your code). Looking for the error was quite demanding as I found that the hangs happen randomly when writing to any RTIO target without a specific pattern.
Either way, there was a little mixup with the gateware code that went through unnoticed while porting - one element that was supposed to be satellite-only also found its way into master. After amending that, I could not reproduce the issue again - in many tries. So, hopefully it fixed the issue.
Still, do let us know if this helps the issue or if it prevails.
efc432352e
appears to have fixed the hangs so far - thanks, sounds like it was a bear to identify! I'll continue to keep an eye out for any recurrences.@mwojcik @sb10q we just saw another hang with a similar experiment (accessing devices from a list) - going back to a test setup to try to reproduce consistently. Using zc706 DRTIO master build 123830.