Previously, a significant risk of losing experimental results would
be associated with long-running experiments, as any stray exceptions
while run()ing the experiment – for instance, due to infrequent
network glitches or hardware reliability issue – would cause no
HDF5 file to be written. This was especially troublesome as long
experiments would suffer from a higher probability of unanticipated
failures, while at the same time being more costly to re-take in
terms of wall-clock time.
Unanticipated uncaught exceptions like that were enough of an issue
that several Oxford codebases had come up with their own half-baked
mitigation strategies, from swallowing all exceptions in run() by
convention, to always broadcasting all results to uniquely named
datasets such that the partial results could be recovered and written
to HDF5 by manually run recovery experiments.
This commit addresses the problem at its source, changing the worker
behaviour such that an HDF5 file is always written as soon as run()
starts.
From looking at the code, it wasn't obvious to me that this is
supposed to handle multiple calls to delete(). This is the case,
however, when for instance Scheduler.delete()ing a run, which
will then also be deleted again from AnalyzeStage.
This helps debugging the cause of TypeErrors arising from types
not handled by the HDF5 serializer, as the backtrace doesn't
otherwise include any useful information.
The docstrings are quite minimal still, but should already
help with navigating the different layers when getting
accustomed with the code base.
RIDCounter was moved to its own module, as it isn't really
related to the other classes (used from master only).
In larger experiments, it is quite natural for the same dataset
to be read from multiple unrelated components. The only situation
where multiple reads from an archived dataset are problematic is
when the valeu actually changes between reads. Hence, this commit
restricts the warning to the latter situation.
* just returning `None` as dummy device (like ExamineDeviceMgr)
is not explicit enough, certainly hard to debug
* introducing a special flag for the `build` action does not
seem the right place
This only makes a difference on POSIX. It prevents subprocesses
from receiving the signals that the parent receives. For ctlmgr
and master is cuts down on spam on the console (KeyboardInterrupt
tracebacks from all controllers) and enforces that proper
termination is followed.
This does not help if the parent gets SIGKILL (subprocesses
may linger).
* master: (44 commits)
Revert "conda: restrict binutils-or1k-linux dependency to linux."
manual/installing: refresh
use https for m-labs.hk
gui/log: top cell alignment
master/log: do not break lines
conda: fix pyqt package name
gui/applets: log warning if IPC address not in command
applets: make sure pyqtgraph imports qt5
applets: avoid argparse subparser mess
examples/histogram: artiq -> artiq.experiment
gui/applets: save dock UID in state
setup.py: give up trying to check for PyQt
setup.py: fix PyQt5 package name
Use Qt5
applets: fix error message text
applets: handle dataset mutations
applets: properly name docks to support state save/restore
applets: clean shutdown
protocols/pyon: set support
protocols/pyon: remove FlatFileDB
...