Commit Graph

52 Commits

Author SHA1 Message Date
David Nadlinger dd928fc014 master: Fixup 32db6ff978 (argument_ui support)
This was lost in the ndscan diff upstreaming process
due to other Oxford-local changes in artiq.master.worker.
2022-06-19 11:33:40 +01:00
Sebastien Bourdeauducq 80d412a8bf support submitting experiments by content 2022-03-20 12:58:55 +08:00
David Nadlinger 7955b63b00 master: Always write results to HDF5 once run stage is reached
Previously, a significant risk of losing experimental results would
be associated with long-running experiments, as any stray exceptions
while run()ing the experiment – for instance, due to infrequent
network glitches or hardware reliability issue – would cause no
HDF5 file to be written. This was especially troublesome as long
experiments would suffer from a higher probability of unanticipated
failures, while at the same time being more costly to re-take in
terms of wall-clock time.

Unanticipated uncaught exceptions like that were enough of an issue
that several Oxford codebases had come up with their own half-baked
mitigation strategies, from swallowing all exceptions in run() by
convention, to always broadcasting all results to uniquely named
datasets such that the partial results could be recovered and written
to HDF5 by manually run recovery experiments.

This commit addresses the problem at its source, changing the worker
behaviour such that an HDF5 file is always written as soon as run()
starts.
2020-06-18 17:47:26 +01:00
Sebastien Bourdeauducq 3fd6962bd2 use sipyco (#585) 2019-11-10 15:55:17 +08:00
Chris Ballance 8659c769cb master/language: add methods to set experiment pipeline/priority/flush defaults 2019-03-12 10:54:15 +01:00
David Nadlinger 0dab7ecd73 master: Include RID in worker exception messages
This helps when debugging unexpected shutdown problems
after the fact.
2019-01-20 19:45:50 +00:00
Sebastien Bourdeauducq 387688354c master: optimize repository scan, closes #546 2016-09-09 19:19:01 +08:00
Sebastien Bourdeauducq 4c8a8357b0 worker: increase send_timeout (Windows can be really slow) 2016-07-03 12:18:34 +08:00
Sebastien Bourdeauducq aa61c29efb transfer Python builtin exceptions over pc_rpc and master/worker 2016-04-04 22:02:42 +08:00
Robert Jördens fef72506e4 ctlmgr/gui/master: start subprocesses in new pgroup
This only makes a difference on POSIX. It prevents subprocesses
from receiving the signals that the parent receives. For ctlmgr
and master is cuts down on spam on the console (KeyboardInterrupt
tracebacks from all controllers) and enforces that proper
termination is followed.

This does not help if the parent gets SIGKILL (subprocesses
may linger).
2016-02-18 23:51:12 +01:00
Sebastien Bourdeauducq 155c2ec2ef ctlmgr,worker: set PYTHONUNBUFFERED for subprocesses 2016-02-18 12:41:08 +01:00
Sebastien Bourdeauducq 6196aaf2f5 master/worker: increase timeouts. Windows VMs can be really slow. 2016-02-16 09:44:50 +01:00
Robert Jördens 53e5d0a7bb worker: flake8 style cleanup 2016-02-02 15:32:40 -07:00
Robert Jördens 55006119c8 subprocesses: unify termination logic 2016-02-02 15:32:36 -07:00
Sebastien Bourdeauducq 5076c85ed6 worker: Windows VMs are slow, increase send_timeout 2016-01-27 21:15:22 +01:00
Sebastien Bourdeauducq 5aa4de8e89 refactor logging and implement in worker 2016-01-26 20:31:42 +01:00
Sebastien Bourdeauducq a583a923d8 worker: use pipe_ipc (no log) 2016-01-26 14:59:36 +01:00
Sebastien Bourdeauducq ae19f1c75d master: add filename in worker log entries. Closes #226 2016-01-23 21:43:24 -05:00
Sebastien Bourdeauducq cc6b808bf8 master: finer control of worker exception reporting. Closes #233 2016-01-23 21:23:02 -05:00
whitequark 6bf48e60ba worker: make parent errors readable in log. 2016-01-16 02:06:40 +00:00
Sebastien Bourdeauducq 8467013160 master,gui: support recomputation+reset of arguments 2015-12-06 17:27:15 +08:00
Sebastien Bourdeauducq 32c95f24d0 worker: reduce some logging levels 2015-10-29 09:34:41 +08:00
Sebastien Bourdeauducq 0d53f7ab0d ignore ProcessLookupError when killing subprocesses. Closes #167 2015-10-28 20:57:28 +08:00
Sebastien Bourdeauducq 1ada15ae5d master: simplify worker/parent RPC 2015-10-28 17:35:57 +08:00
Sebastien Bourdeauducq d13b368a65 build logging into worker 2015-10-20 18:11:50 +08:00
Sebastien Bourdeauducq 1d14975bd5 worker: cleaner termination on exception in user code, improve unittest 2015-10-13 01:11:57 +08:00
Sebastien Bourdeauducq 139072d402 Graceful experiment termination. Closes #76 2015-10-06 13:50:00 +08:00
Sebastien Bourdeauducq b3584bc190 language,master,run: support raw access to DDB from experiments. Closes #123 2015-10-04 18:29:39 +08:00
Sebastien Bourdeauducq f552d62b69 use Python 3.5 coroutines 2015-10-03 19:28:57 +08:00
Sebastien Bourdeauducq 125503139e remove workaround for Python bug in asyncio process.wait(). Requires Python 3.5. Closes #58 2015-10-03 14:33:18 +08:00
Sebastien Bourdeauducq 7ed8fe57fa Git support 2015-08-07 15:51:56 +08:00
Sebastien Bourdeauducq 8402f1cdcd master,gui: basic log support 2015-07-22 05:13:50 +08:00
Sebastien Bourdeauducq 9ed4dcd7d1 repository: load experiments in worker, list arguments 2015-07-15 10:54:44 +02:00
Sebastien Bourdeauducq 7770ab64f2 worker: factor timeouts 2015-07-14 23:43:08 +02:00
Sebastien Bourdeauducq 96a5d73c81 worker: split build stage from prepare 2015-07-09 13:18:12 +02:00
Sebastien Bourdeauducq c71fe29792 simplify unit system and use floats by default 2015-06-26 16:34:37 +02:00
Sebastien Bourdeauducq a6a476593e worker: wait for process termination
This prevents stray SIGCHLDs from crashing the program e.g. if the asyncio event loop is closed before the process actually terminates.
2015-06-05 00:37:26 +08:00
Sebastien Bourdeauducq c843c353d7 worker: remove useless process wait 2015-06-05 00:05:38 +08:00
Yann Sionneau 60bdf74137 tests: use try/finally to close event loop + wait for process to die after killing it 2015-06-04 13:40:13 +02:00
Sebastien Bourdeauducq 78f9268277 worker: add note about correct use of close() 2015-06-04 11:30:34 +08:00
Sebastien Bourdeauducq fc449509b8 scheduler: pass priority to experiments 2015-05-24 20:37:47 +08:00
Sebastien Bourdeauducq b74b8d5826 Scheduling TNG 2015-05-17 16:11:00 +08:00
Sebastien Bourdeauducq 43a05c783d worker: split write_results action 2015-03-11 19:06:46 +01:00
Sebastien Bourdeauducq d5795fd619 master: watchdog support
Introduces a watchdog context manager to use in the experiment code that
terminates the process with an error if it times out. The syntax is:

with self.scheduler.watchdog(20*s):
   ...

Watchdogs timers are implemented by the master process (and the worker
communicates the necessary information about them) so that they can be
enforced even if the worker crashes. They can be nested arbitrarily.
During yields, all watchdog timers for the yielding worker are
suspended [TODO]. Setting up watchdogs is not supported in kernels,
however, a kernel can be called within watchdog contexts (and terminating
the worker will terminate the kernel [TODO]).

It is possible to implement a heartbeat mechanism using a watchdog, e.g.:

for i in range(...):
    with self.scheduler.watchdog(...):
        ....

Crashes/freezes within the iterator or the loop management would not be
detected, but they should be rare enough.
2015-03-11 16:43:14 +01:00
Sebastien Bourdeauducq f2134fa4b2 master,worker: split prepare/run/analyze 2015-03-09 23:34:09 +01:00
Sebastien Bourdeauducq 4c280d5fcc master: use a new worker process for each experiment 2015-03-09 16:22:41 +01:00
Sebastien Bourdeauducq ec1d082730 remove timeout from run_params (to be replaced by a better mechanism) 2015-03-09 10:51:32 +01:00
Sebastien Bourdeauducq cc172699ea master: use RID + unit class name for HDF5 filenames 2015-02-20 14:11:55 -07:00
Sebastien Bourdeauducq 4d21b78314 master,client,gui: factor timeout into run_params 2015-02-19 20:03:55 -07:00
Sebastien Bourdeauducq c69c4d5ce9 master: expose scheduler API to experiments 2015-02-19 12:09:11 -07:00