List temporaries/literals in loops can cause stack overflows without compiler diagnostics #1724
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Running the attached sample code will freeze the ARTIQ core device. This demo code repeatedly calls a function that returns the sum of the first element of two lists.
Expected behavior
Numbers printed out on the computer in 10000 increments, until self.nLoops. Once the total delay of 2*us*self.nLoops reached, "Finished" is printed out.
Actual behavior
When self.nLoops is set to large enough number, the core device freezes, and "Finished" is never printed out. The only way to recover seems to be power cycling the crate. By recompiling a firmware with larger stack1 size, the number self.nLoops that can trigger the problem gets bigger.
I checked LLVM IR dump, it seems that the two lists are allocated on the stack, but the function call is optimized away. These memory allocated on the stack is not released until the end of the main function, before when the stack already overflows.
Although this test case seems quite arbitrary, but is actually quite relevant to some core device drivers. The Zotino core device driver contains several functions like set_dac() that have lists as input arguments. An infinite loop of updating Zotino outputs using set_dac() can easily crash the core device.
Logging info
No logging message is printed out on the computer when core device is frozen. "UndefinedInstruction" is printed out on Kasli-SoC UART.
System
Components involved: ARTIQ core device, compiler?
Operating system used: Windows 11 with MSYS2
ARTIQ version: ARTIQ v8.9006+17be223
Hardware involved: Kasli-SoC
Returning lists from a kernel is not currently supported (the bug is actually that the compiler should have thrown an error but did not). It will be supported after the implementation of the CTRC proposal in NAC3.
Are you sure?
I believe this function is returning an integer value, not a list.
The following code also crashes the device.
Oh I see, that's because you allocate a list by writing [0.0] and they aren't freed until the function returns. I suppose the MWE could be this?
I tried this briefly. It does run without crashing the core device, but the IR shows that allocation of the local variable a is completely optimized away. Not surprising, maybe this is too 'minimal'...
un_opt version
optimized version
This is indeed quite a brittle area in the current implementation, in that the compiler barely helps in ensuring the correct lifetimes. That being said, the aspects at play can be understood from a typical C/C++/… perspective, and as long as they are taken into account, code quite extensively reliant on arrays can be used in practice without issues. There are two different considerations here
(1) the lifetime tracking of parameters/return values is subtly faulty (e.g. https://github.com/m-labs/artiq/issues/1497 and https://github.com/m-labs/artiq/issues/1677), which can affect lists, and
(2) temporary lists (from literals, array operations, etc.) are allocated on the caller stack using the equivalent of C's
alloca(), and so only deallocated on function exit.(1) is not an issue here, but (2) is.
a = [0.0]; b = [20]; while True: self.zotino.set_dac(a, b)should work just fine, as wouldwhile True: foo()withdef foo(): self.zotino.set_dac([0.0], [20]).