Regression: exception causes abort #49

Closed
opened 2020-07-13 15:33:03 +08:00 by pca006132 · 7 comments
Contributor
No description provided.
pca006132 self-assigned this 2020-07-13 15:33:03 +08:00
sb10q added the
priority:critical
label 2020-07-15 18:04:13 +08:00
Contributor

For exception.py I get a very unexpected panic from a repeated block_on call, but only after the LoadKernel read_chunk has already received 4087/6700 bytes. Seems like consistent memory corruption.

For exception.py I get a very unexpected panic from a repeated block_on call, but only after the LoadKernel read_chunk has already received 4087/6700 bytes. Seems like consistent memory corruption.
Contributor

This seems fixed in M-Labs/zc706#48 and the associated branch better-reset by @pca006132.

This seems fixed in https://git.m-labs.hk/M-Labs/zc706/pulls/48 and the associated branch `better-reset` by @pca006132.
Author
Contributor

For exception.py I get a very unexpected panic from a repeated block_on call, but only after the LoadKernel read_chunk has already received 4087/6700 bytes. Seems like consistent memory corruption.

Can you give a more detailed procedure for reproducing the problem? I want to look into it. The exception handling code should only use static or stack memory, not sure if I get some of the unsafe code wrong and accessed wrong location...

This seems fixed in M-Labs/zc706#48 and the associated branch better-reset by @pca006132.

If the problem is due to memory leak or memory corruption, this is possibly not fixed, the allocator just get reset everytime kernel reset. Running the try except in a loop in 1 kernel may reproduce the problem.

> For exception.py I get a very unexpected panic from a repeated block_on call, but only after the LoadKernel read_chunk has already received 4087/6700 bytes. Seems like consistent memory corruption. Can you give a more detailed procedure for reproducing the problem? I want to look into it. The exception handling code should only use static or stack memory, not sure if I get some of the unsafe code wrong and accessed wrong location... > This seems fixed in https://git.m-labs.hk/M-Labs/zc706/pulls/48 and the associated branch `better-reset` by @pca006132. If the problem is due to memory leak or memory corruption, this is possibly not fixed, the allocator just get reset everytime kernel reset. Running the try except in a loop in 1 kernel may reproduce the problem.
Author
Contributor

I've spot a memory leak in the RPC exception system, but I'm not sure if there is a good way to fix that. I'm not sure if that is the same bug @astro observed, as this requires a large number of RPC exceptions to cause an allocation error...

pub extern fn rpc_recv(slot: *mut ()) -> usize {
    let core1_rx: &mut sync_channel::Receiver<Message> = unsafe { mem::transmute(KERNEL_CHANNEL_0TO1) };
    let core1_tx: &mut sync_channel::Sender<Message> = unsafe { mem::transmute(KERNEL_CHANNEL_1TO0) };
    core1_tx.send(Message::RpcRecvRequest(slot));
    let reply = core1_rx.recv();
    match *reply {
        Message::RpcRecvReply(Ok(alloc_size)) => alloc_size,
        Message::RpcRecvReply(Err(exception)) => unsafe {
            eh_artiq::raise(&eh_artiq::Exception {
                name:     exception.name.as_bytes().as_c_slice(),
                file:     exception.file.as_bytes().as_c_slice(),
                line:     exception.line as u32,
                column:   exception.column as u32,
                function: exception.function.as_bytes().as_c_slice(),
                message:  exception.message.as_bytes().as_c_slice(),
                param:    exception.param
            })
        },
        _ => panic!("received unexpected reply to RpcRecvRequest: {:?}", reply)
    }
}

RPC exception would be thrown directly by eh_artiq::raise, without dropping the memory as we still have to reference the fields. It is possible to clone the fields and drop the message before raising the exception, but we still have to somehow free the memory in the catch portion of the code, and it is hard to determine when to free the memory in Python...

Also, the current exception implementation would store the exception data in stack memory for raise within the kernel but not RPC. If the user catch the exception and call functions afterwards, it may be possible to cause memory corruption if the user reaches that part of the stack? I'm not sure about this.

I've spot a memory leak in the RPC exception system, but I'm not sure if there is a good way to fix that. I'm not sure if that is the same bug @astro observed, as this requires a large number of RPC exceptions to cause an allocation error... ```rust pub extern fn rpc_recv(slot: *mut ()) -> usize { let core1_rx: &mut sync_channel::Receiver<Message> = unsafe { mem::transmute(KERNEL_CHANNEL_0TO1) }; let core1_tx: &mut sync_channel::Sender<Message> = unsafe { mem::transmute(KERNEL_CHANNEL_1TO0) }; core1_tx.send(Message::RpcRecvRequest(slot)); let reply = core1_rx.recv(); match *reply { Message::RpcRecvReply(Ok(alloc_size)) => alloc_size, Message::RpcRecvReply(Err(exception)) => unsafe { eh_artiq::raise(&eh_artiq::Exception { name: exception.name.as_bytes().as_c_slice(), file: exception.file.as_bytes().as_c_slice(), line: exception.line as u32, column: exception.column as u32, function: exception.function.as_bytes().as_c_slice(), message: exception.message.as_bytes().as_c_slice(), param: exception.param }) }, _ => panic!("received unexpected reply to RpcRecvRequest: {:?}", reply) } } ``` RPC exception would be thrown directly by `eh_artiq::raise`, without dropping the memory as we still have to reference the fields. It is possible to clone the fields and drop the message before raising the exception, but we still have to somehow free the memory in the catch portion of the code, and it is hard to determine when to free the memory in Python... Also, the current exception implementation would store the exception data in stack memory for raise within the kernel but not RPC. If the user catch the exception and call functions afterwards, it may be possible to cause memory corruption if the user reaches that part of the stack? I'm not sure about this.
Author
Contributor

Also, the current exception implementation would store the exception data in stack memory for raise within the kernel but not RPC. If the user catch the exception and call functions afterwards, it may be possible to cause memory corruption if the user reaches that part of the stack? I'm not sure about this.

https://github.com/m-labs/artiq/issues/1491

And if that is fixed by disallowing users to capture the exception object, the RPC memory leak can be fixed by getting a dangling pointer to the exception object and free it. The exception would be caught before any other core1 allocation, so the memory the pointer is pointed to would not be corrupted by other allocations and should be sound. This requires a separate allocator for core1, otherwise it is possible for core0 to allocate something which overwrites the memory of the exception object (possible in theory, but should be unlikely).

> Also, the current exception implementation would store the exception data in stack memory for raise within the kernel but not RPC. If the user catch the exception and call functions afterwards, it may be possible to cause memory corruption if the user reaches that part of the stack? I'm not sure about this. https://github.com/m-labs/artiq/issues/1491 And if that is fixed by disallowing users to capture the exception object, the RPC memory leak can be fixed by getting a dangling pointer to the exception object and free it. The exception would be caught before any other core1 allocation, so the memory the pointer is pointed to would not be corrupted by other allocations and should be sound. This requires a separate allocator for core1, otherwise it is possible for core0 to allocate something which overwrites the memory of the exception object (possible in theory, but should be unlikely).
Owner

Is there anything Zynq-specific that still needs to be done here, or is it "just" https://github.com/m-labs/artiq/issues/1491?

Is there anything Zynq-specific that still needs to be done here, or is it "just" https://github.com/m-labs/artiq/issues/1491?
Author
Contributor

Is there anything Zynq-specific that still needs to be done here, or is it "just" https://github.com/m-labs/artiq/issues/1491?

Not sure, I haven't been able to reproduce the panic for a long time. Maybe we can close this and reopen if we found the bug again.

> Is there anything Zynq-specific that still needs to be done here, or is it "just" https://github.com/m-labs/artiq/issues/1491? Not sure, I haven't been able to reproduce the panic for a long time. Maybe we can close this and reopen if we found the bug again.
sb10q closed this issue 2020-08-04 13:58:44 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#49
No description provided.