Scheduler strict priority option #1306

Open
opened 2026-01-18 19:00:34 +08:00 by bradbqc · 6 comments

Migrated from GitHub: #1546


ARTIQ Feature Request

Problem this request addresses

Currently, the scheduler only looks at prepared experiments when deciding which one to run() next. While this behavior makes sense in terms of maximizing the use of the core device in terms of wall-clock time, it doesn't guarantee strict priority enforcement of all experiments in the pipeline (i.e. experiments that may still be pending or preparing).

Example scenario:
Experiments A and B are scheduled with RIDs 1 and 2 respectively, and with the same priority, let's say priority = 0. Experiment A will prepare, then run, and B will prepare while A is running. Suppose now another experiment, C, is submitted with priority = 1. C will take precedence over B (which I'm assuming is at prepare_done now) and start preparing, but if A finishes running before C finishes preparing, then B will run before C even though it has a lower priority. This example is somewhat of an edge case, but it is simplest demonstration of this possibly undesired behavior - there are more realistic cases in which this could occur. It has become an issue for us as we've started to create experiments that submit other (higher priority) experiments while they're running.

Describe the solution you'd like

IMO the most obvious/intuitive, but also probably the most intrusive solution would be to add an optional flag (set to False by default, of course, so as not to silently change the scheduler behavior) when starting the scheduler for strict_priority or something to that effect. If the flag is True, then when the scheduler decides what to run next, it will look at pending/preparing experiments in addition to prepare_done and, if there is an experiment in the pipeline that would take precedence over any prepare_done experiments, then the scheduler will wait for that experiment to become ready to run.

Another option would be to modify the behavior of the flush flag. The current behavior actually might be considered a bug - there isn't much documentation on the flush flag so I'm not sure exactly what the intended behavior is. Currently, once an experiment enters the flushing "stage", it prevents any experiments behind it in the pipeline (even experiments with the same priority, but a higher RID) from preparing (and thus from running). That also includes higher priority experiments that are submitted after the first experiment enters the flushing stage. My proposed change would make the flushing stage non-blocking, i.e. stop it from preventing same/higher priority experiments from entering the prepare stage. How this relates to strict priority scheduling: if the user were to set flush=True for all experiments (or at least all experiments they want to guarantee strict scheduling for), then this non-blocking behavior would make it so that experiments which are submitted while another experiment is running would all accumulate in a sort of "queue" of flushing experiments, and then once the first experiment finished running they would prepare, and subsequently run, in strict priority order.

Additional context

While I did say that adding a flag to the scheduler seemed like the most intuitive option to me, I think the best solution in terms of efficacy and minimizing changes to the scheduler would be to change/fix the flushing behavior. It seems unlikely to me that many users (if any) are depending on the current behavior, although if I'm wrong about that then of course I would reconsider my opinion.

> **Migrated from GitHub:** [#1546](https://github.com/m-labs/artiq/issues/1546) --- <!-- Hi there! Thank you for wanting to make ARTIQ better. Before you submit this, make sure that this feature wasn't already requested or if it is not already implemented in the master branch. Based on pylint: https://raw.githubusercontent.com/PyCQA/pylint/master/.github/ISSUE_TEMPLATE/2_Feature_request.md --> # ARTIQ Feature Request ## Problem this request addresses Currently, the scheduler only looks at prepared experiments when deciding which one to `run()` next. While this behavior makes sense in terms of maximizing the use of the core device in terms of wall-clock time, it doesn't guarantee strict priority enforcement of all experiments in the pipeline (i.e. experiments that may still be pending or preparing). Example scenario: Experiments A and B are scheduled with RIDs 1 and 2 respectively, and with the same priority, let's say priority = 0. Experiment A will prepare, then run, and B will prepare while A is running. Suppose now another experiment, C, is submitted with priority = 1. C will take precedence over B (which I'm assuming is at prepare_done now) and start preparing, but if A finishes running before C finishes preparing, then B will run before C even though it has a lower priority. This example is somewhat of an edge case, but it is simplest demonstration of this possibly undesired behavior - there are more realistic cases in which this could occur. It has become an issue for us as we've started to create experiments that submit other (higher priority) experiments while they're running. ## Describe the solution you'd like IMO the most obvious/intuitive, but also probably the most intrusive solution would be to add an optional flag (set to False by default, of course, so as not to silently change the scheduler behavior) when starting the scheduler for `strict_priority` or something to that effect. If the flag is True, then when the scheduler decides what to run next, it will look at pending/preparing experiments in addition to prepare_done and, if there is an experiment in the pipeline that would take precedence over any prepare_done experiments, then the scheduler will wait for that experiment to become ready to run. Another option would be to modify the behavior of the flush flag. The current behavior actually might be considered a bug - there isn't much documentation on the flush flag so I'm not sure exactly what the intended behavior is. Currently, once an experiment enters the flushing "stage", it prevents any experiments behind it in the pipeline (even experiments with the same priority, but a higher RID) from preparing (and thus from running). That also includes higher priority experiments that are submitted after the first experiment enters the flushing stage. My proposed change would make the flushing stage non-blocking, i.e. stop it from preventing same/higher priority experiments from entering the prepare stage. How this relates to strict priority scheduling: if the user were to set flush=True for all experiments (or at least all experiments they want to guarantee strict scheduling for), then this non-blocking behavior would make it so that experiments which are submitted while another experiment is running would all accumulate in a sort of "queue" of flushing experiments, and then once the first experiment finished running they would prepare, and subsequently run, in strict priority order. ## Additional context While I did say that adding a flag to the scheduler seemed like the most intuitive option to me, I think the best solution in terms of efficacy and minimizing changes to the scheduler would be to change/fix the flushing behavior. It seems unlikely to me that many users (if any) are depending on the current behavior, although if I'm wrong about that then of course I would reconsider my opinion.

Thanks for posting this, @b-bondurant. I've seen similar issues locally at UMD. @sbourdeauducq @dnadlinger.

Possibly related: 966ed5d013 by @dnadlinger.

Thanks for posting this, @b-bondurant. I've seen similar issues locally at UMD. @sbourdeauducq @dnadlinger. Possibly related: 966ed5d0135cd32f7f4cdbba049cc28a394c6884 by @dnadlinger.
Contributor

My referenced commit shouldn't be related, as it only fixed cases where runs were mistakenly not prepared at all (whereas here, the issue is with the intended priority semantics of the scheduler).

Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs.

I wonder whether flush is actually in use (perhaps at NIST)? I've been avoiding to think about changing its behaviour for exactly the reasons you mention – it's badly documented, and we aren't actually using it at all.

My referenced commit shouldn't be related, as it only fixed cases where runs were mistakenly not prepared at all (whereas here, the issue is with the intended priority semantics of the scheduler). Another, very simple solution would be to add a mode in which the `prepare` phase is skipped entirely, and `prepare()` is just called when the experiment runs. I wonder whether flush is actually in use (perhaps at NIST)? I've been avoiding to think about changing its behaviour for exactly the reasons you mention – it's badly documented, and we aren't actually using it at all.
Author

Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs.

Yeah, that sounds very similar to the behavior I was describing, but more explicit than using the flush flag which is nice.

One characteristic that both methods share, though, is the subversion of the pipelining. In the scenario I'm running into, it's really just the run phase that I care about running in strict priority order, so there really isn't any need to prevent the rest of the pipeline from operating the way it currently does. However for completely strict priority order (i.e. including the prepare phase), something like what you're suggesting seems necessary. And we might even consider including the analyze phase as well, effectively removing all pipelining from the scheduler.

> Another, very simple solution would be to add a mode in which the prepare phase is skipped entirely, and prepare() is just called when the experiment runs. Yeah, that sounds very similar to the behavior I was describing, but more explicit than using the flush flag which is nice. One characteristic that both methods share, though, is the subversion of the pipelining. In the scenario I'm running into, it's really just the `run` phase that I care about running in strict priority order, so there really isn't any need to prevent the rest of the pipeline from operating the way it currently does. However for completely strict priority order (i.e. including the `prepare` phase), something like what you're suggesting seems necessary. And we might even consider including the `analyze` phase as well, effectively removing all pipelining from the scheduler.

In the original design discussions for ARTIQ, the purpose of flush was to ensure that no experiments were prepared during the run of the preceding experiment, for example if you want to guarantee that dataset values modified by the running experiment were fully updated before any subsequent experiments pulled their values in their prepare() stage. This issue of experiments being prepared with old values of datasets before the preceding experiments can finish updating them is perhaps less important now, with the ability to store some values on the core device that persist across kernels, but in general it is handy. There is a time cost for the loss of pipelining, of course. At the time, we were not really considering the case that @b-bondurant is describing, which is certainly valid. But hopefully this sheds some light on the rationale for the current flush behavior.

I think the idea of a strict_priority flag for the scheduler that considers both experiments that are awaiting prepare, as well as have prepare_done, seems like a reasonable option (defaulting to False).

In the original design discussions for ARTIQ, the purpose of `flush` was to ensure that no experiments were prepared during the run of the preceding experiment, for example if you want to guarantee that dataset values modified by the running experiment were fully updated before any subsequent experiments pulled their values in their `prepare()` stage. This issue of experiments being prepared with old values of datasets before the preceding experiments can finish updating them is perhaps less important now, with the ability to store some values on the core device that persist across kernels, but in general it is handy. There is a time cost for the loss of pipelining, of course. At the time, we were not really considering the case that @b-bondurant is describing, which is certainly valid. But hopefully this sheds some light on the rationale for the current `flush` behavior. I think the idea of a `strict_priority` flag for the scheduler that considers both experiments that are awaiting prepare, as well as have prepare_done, seems like a reasonable option (defaulting to False).

ping @b-bondurant is a strict_priority flag still something that feels important?

ping @b-bondurant is a `strict_priority` flag still something that feels important?
Author

@dhslichter oops, sorry for letting this thread die. I developed a workaround that we're pretty happy with - although I think it wouldn't actually be relevant for the specific example scenario I described since it requires an explicit call in order for a lower-priority experiment to give way to higher priority ones.

In general I think a strict_priority flag in the scheduler itself could still be useful, but afaik it's not something we desperately need anymore. For any experiments that we know might need to be superceded, we can just use the above workaround.

@dhslichter oops, sorry for letting this thread die. I developed a [workaround](https://gitlab.com/duke-artiq/dax/-/blob/master/dax/util/artiq.py#L492) that we're pretty happy with - although I think it wouldn't actually be relevant for the specific example scenario I described since it requires an explicit call in order for a lower-priority experiment to give way to higher priority ones. In general I think a `strict_priority` flag in the scheduler itself could still be useful, but afaik it's not something we desperately need anymore. For any experiments that we know might need to be superceded, we can just use the above workaround.
Sign in to join this conversation.