Repository scans sometimes skip non-problematic experiments #1707
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
name: Bug report
about: Report a bug in ARTIQ
Bug Report
One-Line Summary
Currently, when we rescan the experiment repository, ARTIQ master sometimes skips non-problematic experiments.
Issue Details
See other instances of the issue on the m-labs forum:
When skipping experiments, worker throws a WorkerError (“Worker ended while attempting to receive data (RID scan)”).
Can happen multiple times within a single repo scan - more likely to happen the more experiments there are.
Experiments that are skipped will vary depending on what files are inside the repo, but seems to be deterministic each time (i.e. two scans will skip same experiments if nothing is changed).
This issue seems to happen only when repo scans take over ~20s, at which point the WorkerErrors get thrown every ~20s.
Straightforward workaround is to put a “dummy” experiment immediately before (alphabetically) the skipped experiment, though this gets a bit silly.
Expected Behavior
ARTIQ master is able to rescan the repo and process all valid (i.e. no underlying problem) experiments.
Actual (undesired) Behavior
Experiments get skipped, often multiple times per repo scan, though they are the same experiment files if the underlying experiment repo stays the same.
Generally, it takes roughly ~20s (i.e. ~18s-19s by eye) between errors, kinda consistent with the timeout in Worker.examine (i.e. timeout=20).
Attached image shows the dashboard log when experiments get skipped - here, ARTIQ master skips 2 experiments.

Your System (omit irrelevant parts)
Maybe it actually takes over 20s to examine one experiment? Are you running an antivirus or other bloatware that might further slow down Windows 10? Are you importing heavy packages or running slow code at the top level in these experiments?
Thanks for the quick response!
My feeling is that this isn't the case, since the skipped experiments seem to mostly depend on the "ordering" in the exp repo (putting in a dummy experiment immediately before the target exp makes the error "skip"); if it was just a single experiment, I'd imagine it'd be the same experiment each time?
The skipped experiments are also sometimes very low complexity/overhead experiments; e.g. an "fastino set" experiment that just sets a voltage on a fastino.
We try not to run significant bloat and use relatively beefy lab computers; the Windows task manager doesn't show any significant resource consumption.
The issue is also consistent; it happens regardless of whether other software is open at all.
We generally don't have any top-level code in our experiments beyond imports - we use the "--experiment-subdir" flag to artiq_master with a directory that contains only actual experiment files.
We try not to have heavy imports beyond artiq/numpy/scipy etc., and try to import only specific modules where possible.
For some background, we have 63 experiments in our experiment subdirectory.
Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch https://github.com/m-labs/artiq/pull/2795 to make things even clearer.
Of course! I'll test that out when the experiment is free in a few hours.
Scanning doesn't use hardware, so you can also just copy the files and run them on another computer.
These are screenshots of the timestamped logs during a repo rescan, which shows that the worker is able to work its way through the repo, rather than being blocked by a single exp.
@clayton-ho Are the failing experiments consistently the same across scans, or is it random? Have you tried removing the failing experiments to isolate the issue? If a specific experiment is causing the problem, could you share its contents to help reproduce the issue.
@fsagbuya
Thanks for checking in!
The failing experiments are the same between scans if nothing is changed in the repo (i.e. two clicks of "Scan repository" will give the same results).
If the repository is rearranged (e.g. the problematic files are moved to a different subdirectory, or renamed so they appear earlier in the exp scan list), then a different file is skipped instead - generally whichever file takes its "place in line."
The skipped exps seem to consistently happen at ~20s intervals.
We also have quite long repository scan times here (>20 s definitely), but never ran into this issue (unless it was introduced in the last couple of months). Maybe this is some sort of Windows-specific issue with the way IPC is used?