Regression: exception causes abort #49
Labels
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#49
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
For exception.py I get a very unexpected panic from a repeated block_on call, but only after the LoadKernel read_chunk has already received 4087/6700 bytes. Seems like consistent memory corruption.
This seems fixed in M-Labs/zc706#48 and the associated branch
better-reset
by @pca006132.Can you give a more detailed procedure for reproducing the problem? I want to look into it. The exception handling code should only use static or stack memory, not sure if I get some of the unsafe code wrong and accessed wrong location...
If the problem is due to memory leak or memory corruption, this is possibly not fixed, the allocator just get reset everytime kernel reset. Running the try except in a loop in 1 kernel may reproduce the problem.
I've spot a memory leak in the RPC exception system, but I'm not sure if there is a good way to fix that. I'm not sure if that is the same bug @astro observed, as this requires a large number of RPC exceptions to cause an allocation error...
RPC exception would be thrown directly by
eh_artiq::raise
, without dropping the memory as we still have to reference the fields. It is possible to clone the fields and drop the message before raising the exception, but we still have to somehow free the memory in the catch portion of the code, and it is hard to determine when to free the memory in Python...Also, the current exception implementation would store the exception data in stack memory for raise within the kernel but not RPC. If the user catch the exception and call functions afterwards, it may be possible to cause memory corruption if the user reaches that part of the stack? I'm not sure about this.
https://github.com/m-labs/artiq/issues/1491
And if that is fixed by disallowing users to capture the exception object, the RPC memory leak can be fixed by getting a dangling pointer to the exception object and free it. The exception would be caught before any other core1 allocation, so the memory the pointer is pointed to would not be corrupted by other allocations and should be sound. This requires a separate allocator for core1, otherwise it is possible for core0 to allocate something which overwrites the memory of the exception object (possible in theory, but should be unlikely).
Is there anything Zynq-specific that still needs to be done here, or is it "just" https://github.com/m-labs/artiq/issues/1491?
Not sure, I haven't been able to reproduce the panic for a long time. Maybe we can close this and reopen if we found the bug again.