ZC706/QC2 accessing devices from list causes hang/crash #188

Open
opened 2022-04-27 06:01:46 +08:00 by ljstephenson · 8 comments

Using NIST QC2 DRTIO master gateware from https://nixbld.m-labs.hk/build/118752, initialization of all DDSs either hangs or crashes. This happens regardless of whether the hardware is connected or not.

MWE:

from artiq.experiment import *


class DDSCrash(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.dds_list = [self.get_device(f"ad9914dds{i}") for i in range(12)]

    @kernel
    def run(self):
        self.core.break_realtime()
        delay(10 * ms)

        for i in range(len(self.dds_list)):
            dds = self.dds_list[i]
            dds.init()
            delay(10 * ms)

Core log:

[    67.266603s]  INFO(runtime::kernel::core1): kernel starting
DataAbort on core 1
DFSR: 001

Increasing the delay between calls to init() makes it more likely that the coredevice will hang rather than crashing outright - it's not clear to me what happens in that situation since submitting subsequent experiments still works.

10 ms delay should already be plenty, since self.core.mu_to_seconds(dds.init_duration_mu) is 1.47 ms with 125 MHz clock, and 1.18 ms with 100 MHz. (NB reading the AD9914 documentation, I believe that dac_cal_duration_mu has been miscalculated and is much larger than the actual required value, but it doesn't matter since init() is called so rarely.)

Will test with a recent build of the standalone version to see if it's a DRTIO gateware difference. We have been seeing occasional similar behaviour (hanging, no crashes) with older builds of the standalone version, but I wasn't sure whether that was on my end without a separate test setup.

I originally hypothesized that this might be due to SYNC_CLK glitches during initialization causing the RTIO clock to fail in weird ways, but debunked this both by testing with an independent external clock and the internal clock, with none of the hardware connected.

Using NIST QC2 DRTIO master gateware from https://nixbld.m-labs.hk/build/118752, initialization of all DDSs either hangs or crashes. This happens regardless of whether the hardware is connected or not. MWE: ``` from artiq.experiment import * class DDSCrash(EnvExperiment): def build(self): self.setattr_device("core") self.dds_list = [self.get_device(f"ad9914dds{i}") for i in range(12)] @kernel def run(self): self.core.break_realtime() delay(10 * ms) for i in range(len(self.dds_list)): dds = self.dds_list[i] dds.init() delay(10 * ms) ``` Core log: ``` [ 67.266603s] INFO(runtime::kernel::core1): kernel starting DataAbort on core 1 DFSR: 001 ``` Increasing the delay between calls to `init()` makes it more likely that the coredevice will hang rather than crashing outright - it's not clear to me what happens in that situation since submitting subsequent experiments still works. 10 ms delay should already be plenty, since `self.core.mu_to_seconds(dds.init_duration_mu)` is 1.47 ms with 125 MHz clock, and 1.18 ms with 100 MHz. (NB reading the AD9914 documentation, I believe that `dac_cal_duration_mu` has been miscalculated and is much larger than the actual required value, but it doesn't matter since `init()` is called so rarely.) Will test with a recent build of the standalone version to see if it's a DRTIO gateware difference. We have been seeing occasional similar behaviour (hanging, no crashes) with older builds of the standalone version, but I wasn't sure whether that was on my end without a separate test setup. I originally hypothesized that this might be due to `SYNC_CLK` glitches during initialization causing the RTIO clock to fail in weird ways, but debunked this both by testing with an independent external clock and the internal clock, with none of the hardware connected.
Author

Same story with https://nixbld.m-labs.hk/build/121055 (again no hardware connected)

BUT

unrolling the loop and accessing atributes directly works slightly better i.e.

from artiq.experiment import *


class TestDDS(EnvExperiment):
    def build(self):
        self.setattr_device("core")

        self.dds_list = []
        for i in range(12):
            name = f"ad9914dds{i}"
            dev = self.get_device(name)
            self.dds_list.append(dev)
            setattr(self, name, dev)

    def prepare(self):
        print(self.core.mu_to_seconds(self.dds_list[0].init_duration_mu))

    @kernel
    def run(self):
        self.core.break_realtime()
        delay(10 * ms)

        self.ad9914dds0.init()
        delay(10 * ms)
        self.ad9914dds1.init()
        delay(10 * ms)
        self.ad9914dds2.init()
        delay(10 * ms)
        self.ad9914dds3.init()
        delay(10 * ms)
        self.ad9914dds4.init()
        delay(10 * ms)
        self.ad9914dds5.init()
        delay(10 * ms)
        self.ad9914dds6.init()
        delay(10 * ms)
        self.ad9914dds7.init()
        delay(10 * ms)
        self.ad9914dds8.init()
        delay(10 * ms)
        self.ad9914dds9.init()
        delay(10 * ms)
        self.ad9914dds10.init()
        delay(10 * ms)
        self.ad9914dds11.init()
        delay(10 * ms)

This hangs about 30% of the time with https://nixbld.m-labs.hk/build/118752, no hangs in ~20 attempts with https://nixbld.m-labs.hk/build/121055, both clocked internally and no hardware connected.

This seems like unexpected behaviour to me, although definitely in a different way.

Explicitly unrolling the loop rather than accessing attributes directly i.e.

        ...
        self.dds_list[0].init()
        delay(10 * ms)
        self.dds_list[1].init()
        delay(10 * ms)
        self.dds_list[2].init()
        delay(10 * ms)
        ...

does NOT work, this seems at least consistent with iterating over the list not working.

Same story with https://nixbld.m-labs.hk/build/121055 (again no hardware connected) BUT unrolling the loop and accessing atributes directly works slightly better i.e. ``` from artiq.experiment import * class TestDDS(EnvExperiment): def build(self): self.setattr_device("core") self.dds_list = [] for i in range(12): name = f"ad9914dds{i}" dev = self.get_device(name) self.dds_list.append(dev) setattr(self, name, dev) def prepare(self): print(self.core.mu_to_seconds(self.dds_list[0].init_duration_mu)) @kernel def run(self): self.core.break_realtime() delay(10 * ms) self.ad9914dds0.init() delay(10 * ms) self.ad9914dds1.init() delay(10 * ms) self.ad9914dds2.init() delay(10 * ms) self.ad9914dds3.init() delay(10 * ms) self.ad9914dds4.init() delay(10 * ms) self.ad9914dds5.init() delay(10 * ms) self.ad9914dds6.init() delay(10 * ms) self.ad9914dds7.init() delay(10 * ms) self.ad9914dds8.init() delay(10 * ms) self.ad9914dds9.init() delay(10 * ms) self.ad9914dds10.init() delay(10 * ms) self.ad9914dds11.init() delay(10 * ms) ``` This hangs about 30% of the time with https://nixbld.m-labs.hk/build/118752, no hangs in ~20 attempts with https://nixbld.m-labs.hk/build/121055, both clocked internally and no hardware connected. This seems like unexpected behaviour to me, although definitely in a different way. Explicitly unrolling the loop rather than accessing attributes directly i.e. ``` ... self.dds_list[0].init() delay(10 * ms) self.dds_list[1].init() delay(10 * ms) self.dds_list[2].init() delay(10 * ms) ... ``` does NOT work, this seems at least consistent with iterating over the list not working.
Owner

I originally hypothesized that this might be due to SYNC_CLK glitches during initialization causing the RTIO clock to fail in weird ways

DataAbort points to a problem within the PS, which is not clocked by RTIO.

> I originally hypothesized that this might be due to SYNC_CLK glitches during initialization causing the RTIO clock to fail in weird ways DataAbort points to a problem within the PS, which is not clocked by RTIO.
Owner

Does the problem also occur with ACPKI?

Does the problem also occur with ACPKI?
mwojcik was assigned by sb10q 2022-04-27 10:49:09 +08:00
Author

Same problem with ACPKI (https://nixbld.m-labs.hk/build/121087). Also not specific to DDSs; an exactly analogous experiment using 12 TTLs and ttl.output() in place of dds.init() failed in the same way.

Renaming the issue to reflect non-specificity to DDSs.

I'm having DDS hardware issues too; will raise a separate issue for that once I can accurately describe it.

Same problem with ACPKI (https://nixbld.m-labs.hk/build/121087). Also not specific to DDSs; an exactly analogous experiment using 12 TTLs and `ttl.output()` in place of `dds.init()` failed in the same way. Renaming the issue to reflect non-specificity to DDSs. I'm having DDS hardware issues too; will raise a separate issue for that once I can accurately describe it.
ljstephenson changed title from ZC706/QC2 AD9914 DDS initialization causes hang/crash to ZC706/QC2 accessing devices from list causes hang/crash 2022-04-28 00:27:27 +08:00
Author

@sb10q the kernel hangs are pretty catastrophic: at the moment the DRTIO gateware is unusable because of the hangs. There doesn't appear to be a workaround at the moment, since even accessing the device attribute directly causes hangs.

I didn't actually have any DDS issues, but the kernel hangs had left them in unrecoverable states with no output.

When testing with DRTIO gateware, I don't have a satellite connected, but I'm not trying to access any devices on satellites - this shouldn't be causing any issues correct?

I'm not sure the hang is connected to the crash; I can work around the crash, but with the kernel hanging I can't get anything done at all.

@sb10q the kernel hangs are pretty catastrophic: at the moment the DRTIO gateware is unusable because of the hangs. There doesn't appear to be a workaround at the moment, since even accessing the device attribute directly causes hangs. I didn't actually have any DDS issues, but the kernel hangs had left them in unrecoverable states with no output. When testing with DRTIO gateware, I don't have a satellite connected, but I'm not trying to access any devices on satellites - this shouldn't be causing any issues correct? I'm not sure the hang is connected to the crash; I can work around the crash, but with the kernel hanging I can't get anything done at all.
Owner

I have reproduced it locally (thanks to your code). Looking for the error was quite demanding as I found that the hangs happen randomly when writing to any RTIO target without a specific pattern.

Either way, there was a little mixup with the gateware code that went through unnoticed while porting - one element that was supposed to be satellite-only also found its way into master. After amending that, I could not reproduce the issue again - in many tries. So, hopefully it fixed the issue.

Still, do let us know if this helps the issue or if it prevails.

I have reproduced it locally (thanks to your code). Looking for the error was quite demanding as I found that the hangs happen randomly when writing to any RTIO target without a specific pattern. Either way, there was a little mixup with the gateware code that went through unnoticed while porting - one element that was supposed to be satellite-only also found its way into master. After amending that, I could not reproduce the issue again - in many tries. So, hopefully it fixed the issue. Still, do let us know if this helps the issue or if it prevails.
Author

efc432352e appears to have fixed the hangs so far - thanks, sounds like it was a bear to identify! I'll continue to keep an eye out for any recurrences.

efc432352e appears to have fixed the hangs so far - thanks, sounds like it was a bear to identify! I'll continue to keep an eye out for any recurrences.
sb10q closed this issue 2022-05-04 18:03:19 +08:00
Author

@mwojcik @sb10q we just saw another hang with a similar experiment (accessing devices from a list) - going back to a test setup to try to reproduce consistently. Using zc706 DRTIO master build 123830.

@mwojcik @sb10q we just saw another hang with a similar experiment (accessing devices from a list) - going back to a test setup to try to reproduce consistently. Using zc706 DRTIO master build 123830.
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#188
No description provided.