See test case – previously, the highest-priority pending run would
be used to calculate the timeout, rather than the earliest one.
This probably managed to go undetected for that long as any unrelated
changes to the pipeline (e.g. new submissions, or experiments pausing)
would also cause _get_run() to be re-evaluated.
Previously, a significant risk of losing experimental results would
be associated with long-running experiments, as any stray exceptions
while run()ing the experiment – for instance, due to infrequent
network glitches or hardware reliability issue – would cause no
HDF5 file to be written. This was especially troublesome as long
experiments would suffer from a higher probability of unanticipated
failures, while at the same time being more costly to re-take in
terms of wall-clock time.
Unanticipated uncaught exceptions like that were enough of an issue
that several Oxford codebases had come up with their own half-baked
mitigation strategies, from swallowing all exceptions in run() by
convention, to always broadcasting all results to uniquely named
datasets such that the partial results could be recovered and written
to HDF5 by manually run recovery experiments.
This commit addresses the problem at its source, changing the worker
behaviour such that an HDF5 file is always written as soon as run()
starts.
From looking at the code, it wasn't obvious to me that this is
supposed to handle multiple calls to delete(). This is the case,
however, when for instance Scheduler.delete()ing a run, which
will then also be deleted again from AnalyzeStage.
Introduces a watchdog context manager to use in the experiment code that
terminates the process with an error if it times out. The syntax is:
with self.scheduler.watchdog(20*s):
...
Watchdogs timers are implemented by the master process (and the worker
communicates the necessary information about them) so that they can be
enforced even if the worker crashes. They can be nested arbitrarily.
During yields, all watchdog timers for the yielding worker are
suspended [TODO]. Setting up watchdogs is not supported in kernels,
however, a kernel can be called within watchdog contexts (and terminating
the worker will terminate the kernel [TODO]).
It is possible to implement a heartbeat mechanism using a watchdog, e.g.:
for i in range(...):
with self.scheduler.watchdog(...):
....
Crashes/freezes within the iterator or the loop management would not be
detected, but they should be rare enough.