kernel takeover + idle kernel #32

Closed
opened 2020-07-06 00:02:07 +08:00 by sb10q · 6 comments
  • When a control connection is established (i.e. immediately after it has sent the ARTIQ coredev magic string), then any other established control connections must be terminated, including any kernels they may have started.
  • If an idle kernel is defined in the configuration, it must be run when there is no established control connection.
* When a control connection is established (i.e. immediately after it has sent the ``ARTIQ coredev`` magic string), then any other established control connections must be terminated, including any kernels they may have started. * If an idle kernel is defined in the configuration, it must be run when there is no established control connection.
sb10q added the
priority:high
label 2020-07-15 18:04:44 +08:00
Poster
Owner

Current path being explored is to send a IRQ to core1 to terminate the kernel, and call the unwinder from the ISR.

Advantage of this technique is we can use the multicore Rust features throughout the runtime, which otherwise break if we reset core1 (for example, Rust cannot handle core1 being reset while it's holding a mutex).

Current path being explored is to send a IRQ to core1 to terminate the kernel, and call the unwinder from the ISR. Advantage of this technique is we can use the multicore Rust features throughout the runtime, which otherwise break if we reset core1 (for example, Rust cannot handle core1 being reset while it's holding a mutex).

An alternative pathway:

  1. Allow only 1 connection at a time, drop the previous connection (handle_connection future) before handling the new connection. This is implemented in PR #45.
  2. Separate the allocator for each core, reset core1 allocator everytime when core1 is restarted. See M-Labs/zc706#47 for details.
  3. Core1 should not own or reference to memory/resources allocated in core0, except within critical sections.
  4. Use a mutex (or whatever more ergonomic to use) to protect critical sections in core1, acquire the mutex before resetting core1 from core0.

All kernel interaction should be handled in handle_connection, dropping the handle_connection should drop all handles related to core1 (memory, Rc, mutex).
Later reset should be safe as we do not have dangling pointer pointing to memory allocated by core1.

Rules 3 and 4 make sure that we would not have resource leakage or corruption in core0 resources when we reset core1. Core1 resource corruption is not a problem as there is no reference to core1 resources by rule 1.

Critical sections that I can currently think of include:

  • Core 1 allocation/deallocation. This is to prevent the mutex for core1 being locked during core1 reset, which would inhibit later core1 allocator init.
    Possible workaround: introduce an unsafe function for resetting the mutex, and reset the mutex when we initialize the allocator.
  • Core1 kernel loading. This section would use a Arc<Vec<u8>> from core0, so we have to wait until kernel finishes loading to decrement the reference count.
  • Logger access. We use mutex already so this should not be a problem.
An alternative pathway: 1. Allow only 1 connection at a time, drop the previous connection (`handle_connection` future) before handling the new connection. This is implemented in PR #45. 2. Separate the allocator for each core, reset core1 allocator everytime when core1 is restarted. See https://git.m-labs.hk/M-Labs/zc706/pulls/47 for details. 3. Core1 should not own or reference to memory/resources allocated in core0, except within critical sections. 4. Use a mutex (or whatever more ergonomic to use) to protect critical sections in core1, acquire the mutex before resetting core1 from core0. All kernel interaction should be handled in `handle_connection`, dropping the `handle_connection` should drop all handles related to core1 (memory, Rc, mutex). Later reset should be safe as we do not have dangling pointer pointing to memory allocated by core1. Rules 3 and 4 make sure that we would not have resource leakage or corruption in core0 resources when we reset core1. Core1 resource corruption is not a problem as there is no reference to core1 resources by rule 1. Critical sections that I can currently think of include: * Core 1 allocation/deallocation. This is to prevent the mutex for core1 being locked during core1 reset, which would inhibit later core1 allocator init. Possible workaround: introduce an unsafe function for resetting the mutex, and reset the mutex when we initialize the allocator. * Core1 kernel loading. This section would use a `Arc<Vec<u8>>` from core0, so we have to wait until kernel finishes loading to decrement the reference count. * Logger access. We use mutex already so this should not be a problem.

sync_channel would leak memory in this scheme, as sync_channel uses Into<Box<T>> which would allocate the message in core0 if sent from core0.
If core1 get reset when handling the message (before dropping it), the memory would be leaked.

I wonder if it would be possible to make the box in core1 heap from core0, maybe manually calling the allocator or something.

`sync_channel` would leak memory in this scheme, as `sync_channel` uses `Into<Box<T>>` which would allocate the message in core0 if sent from core0. If core1 get reset when handling the message (before dropping it), the memory would be leaked. I wonder if it would be possible to make the box in core1 heap from core0, maybe manually calling the allocator or something.

Implementation: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/better-reset

Not really covered all the cases mentioned above yet, such as the allocator lock.

Implementation: https://git.m-labs.hk/pca006132/artiq-zynq/src/branch/better-reset Not really covered all the cases mentioned above yet, such as the allocator lock.

Another point for doing reset: if the kernel encounters OOM, we could just somehow signal core0 from the panic handler in core1, and just resume operation by resetting core1, without affecting core0.

Another point for doing reset: if the kernel encounters OOM, we could just somehow signal core0 from the panic handler in core1, and just resume operation by resetting core1, without affecting core0.

Done in #79

Done in #79
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#32
There is no content yet.