core0 memory leak #85

Closed
opened 2020-08-05 10:10:45 +08:00 by pca006132 · 3 comments

When core1 get restarted after core0 putting some messages in the channel, the channel would be reset and the messages are lost, so we cannot free them.
This can be solved by exposing an unsafe API to drive the read index forward and free the messages until it reaches the write index. This should be done before asking core1 to restart.

When core1 get restarted after core0 putting some messages in the channel, the channel would be reset and the messages are lost, so we cannot free them. This can be solved by exposing an unsafe API to drive the read index forward and free the messages until it reaches the write index. This should be done before asking core1 to restart.

Can this be done before restarting core1, where it could still be reading from the queue?

I'd imagined we'd do all such maintenance (resetting queue state, in particular freeing memory) after resetting core1, where it would wait for core0 to signal a condition. The flow would look like this:

  1. core0 resets core1
  2. core0 waits for "I am starting up" flag to be set from core1.
  3. Meanwhile, core1 does any cache/… setup, and then sets the flag.
  4. core1 now waits until core0 sets a second flag, giving the go-ahead for startup.
  5. core0, after having read the flag, now knows that core1 isn't doing anything, and all memory writes from before reset are visible. It can reset all the queue pointers, etc. safely now (in particular, "unsafely" free any messages not consumed by core1's previous incarnation).
  6. core0 sets the second flag, sending core1 on its merry way.

This way, any cache coherence/… weirdness during startup is neatly encapsulated away, and things are easy to reason about – during step 5, core0 is effectively a single-threaded program with exclusive access to all shared resources.

Can this be done before restarting core1, where it could still be reading from the queue? I'd imagined we'd do all such maintenance (resetting queue state, in particular freeing memory) after resetting core1, where it would wait for core0 to signal a condition. The flow would look like this: 1. core0 resets core1 2. core0 waits for "I am starting up" flag to be set from core1. 3. Meanwhile, core1 does any cache/… setup, and then sets the flag. 4. core1 now waits until core0 sets a second flag, giving the go-ahead for startup. 5. core0, after having read the flag, now knows that core1 isn't doing anything, and all memory writes from before reset are visible. It can reset all the queue pointers, etc. safely now (in particular, "unsafely" free any messages not consumed by core1's previous incarnation). 6. core0 sets the second flag, sending core1 on its merry way. This way, any cache coherence/… weirdness during startup is neatly encapsulated away, and things are easy to reason about – during step 5, core0 is effectively a single-threaded program with exclusive access to all shared resources.

Can this be done before restarting core1, where it could still be reading from the queue?

I'd imagined we'd do all such maintenance (resetting queue state, in particular freeing memory) after resetting core1, where it would wait for core0 to signal a condition. The flow would look like this:

  1. core0 resets core1
  2. core0 waits for "I am starting up" flag to be set from core1.
  3. Meanwhile, core1 does any cache/… setup, and then sets the flag.
  4. core1 now waits until core0 sets a second flag, giving the go-ahead for startup.
  5. core0, after having read the flag, now knows that core1 isn't doing anything, and all memory writes from before reset are visible. It can reset all the queue pointers, etc. safely now (in particular, "unsafely" free any messages not consumed by core1's previous incarnation).
  6. core0 sets the second flag, sending core1 on its merry way.

This way, any cache coherence/… weirdness during startup is neatly encapsulated away, and things are easy to reason about – during step 5, core0 is effectively a single-threaded program with exclusive access to all shared resources.

Currently we are not actually doing a core reset, we just trigger an interrupt and jump into the main function in core1. The cache maintainance operations are not needed, and we have cache coherency provided by the SCU.

And yes, maybe we should add some init routine to clean up the memory instead of modifying the queue while core1 is possibly doing some memory read.
My previous solution could cause race condition when we are dropping the memory while core1 is copying from it.

A simpler routine maybe:

  1. Core 0 acquire a startup mutex.
  2. Core 0 sends an interrupt to ask for core1 restart.
  3. Core 0 wait for core1 restart flag.
  4. Core 1 restart and set restart flag.
  5. Core 1 acquire (wait for) startup mutex.
  6. Core 0 clean up the message queue.
  7. Core 0 release the startup mutex.
  8. Core 1 continue its startup, like cleaning up the static thing.

Most are done now, we only have to add the startup mutex and do some cleanup in core0.

> Can this be done before restarting core1, where it could still be reading from the queue? > > I'd imagined we'd do all such maintenance (resetting queue state, in particular freeing memory) after resetting core1, where it would wait for core0 to signal a condition. The flow would look like this: > > 1. core0 resets core1 > 2. core0 waits for "I am starting up" flag to be set from core1. > 3. Meanwhile, core1 does any cache/… setup, and then sets the flag. > 4. core1 now waits until core0 sets a second flag, giving the go-ahead for startup. > 5. core0, after having read the flag, now knows that core1 isn't doing anything, and all memory writes from before reset are visible. It can reset all the queue pointers, etc. safely now (in particular, "unsafely" free any messages not consumed by core1's previous incarnation). > 6. core0 sets the second flag, sending core1 on its merry way. > > This way, any cache coherence/… weirdness during startup is neatly encapsulated away, and things are easy to reason about – during step 5, core0 is effectively a single-threaded program with exclusive access to all shared resources. Currently we are not actually doing a core reset, we just trigger an interrupt and jump into the main function in core1. The cache maintainance operations are not needed, and we have cache coherency provided by the SCU. And yes, maybe we should add some init routine to clean up the memory instead of modifying the queue while core1 is possibly doing some memory read. My previous solution could cause race condition when we are dropping the memory while core1 is copying from it. A simpler routine maybe: 1. Core 0 acquire a startup mutex. 2. Core 0 sends an interrupt to ask for core1 restart. 3. Core 0 wait for core1 restart flag. 4. Core 1 restart and set restart flag. 5. Core 1 acquire (wait for) startup mutex. 6. Core 0 clean up the message queue. 7. Core 0 release the startup mutex. 8. Core 1 continue its startup, like cleaning up the static thing. Most are done now, we only have to add the startup mutex and do some cleanup in core0.

Ah, yes, I didn't yet have a look at the latest reset implementation changes.

Your routine is as I described, only making the "reset" of core0's flag explicit (acquiring the mutex) – it should definitely work.

Ah, yes, I didn't yet have a look at the latest reset implementation changes. Your routine is as I described, only making the "reset" of core0's flag explicit (acquiring the mutex) – it should definitely work.
sb10q closed this issue 2020-08-06 10:05:45 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#85
There is no content yet.