Repository scans sometimes skip non-problematic experiments #1707

Open
opened 2026-01-18 19:06:18 +08:00 by clayton-ho · 9 comments
clayton-ho commented 2026-01-18 19:06:18 +08:00

Migrated from GitHub: #2799



name: Bug report
about: Report a bug in ARTIQ


Bug Report

One-Line Summary

Currently, when we rescan the experiment repository, ARTIQ master sometimes skips non-problematic experiments.

Issue Details

See other instances of the issue on the m-labs forum:

When skipping experiments, worker throws a WorkerError (“Worker ended while attempting to receive data (RID scan)”).
Can happen multiple times within a single repo scan - more likely to happen the more experiments there are.
Experiments that are skipped will vary depending on what files are inside the repo, but seems to be deterministic each time (i.e. two scans will skip same experiments if nothing is changed).
This issue seems to happen only when repo scans take over ~20s, at which point the WorkerErrors get thrown every ~20s.

Straightforward workaround is to put a “dummy” experiment immediately before (alphabetically) the skipped experiment, though this gets a bit silly.

Expected Behavior

ARTIQ master is able to rescan the repo and process all valid (i.e. no underlying problem) experiments.

Actual (undesired) Behavior

Experiments get skipped, often multiple times per repo scan, though they are the same experiment files if the underlying experiment repo stays the same.
Generally, it takes roughly ~20s (i.e. ~18s-19s by eye) between errors, kinda consistent with the timeout in Worker.examine (i.e. timeout=20).

Attached image shows the dashboard log when experiments get skipped - here, ARTIQ master skips 2 experiments.
Image

Your System (omit irrelevant parts)

  • Operating System: Windows 10
  • ARTIQ version: v8.0+unknown.beta
    • commit: 69c0f81
> **Migrated from GitHub:** [#2799](https://github.com/m-labs/artiq/issues/2799) --- --- name: Bug report about: Report a bug in ARTIQ --- <!-- Above are non-Markdown tags for Github auto-prompting issue type. Template based on pylint: https://raw.githubusercontent.com/PyCQA/pylint/master/.github/ISSUE_TEMPLATE/ --> # Bug Report <!-- Thanks for reporting a bug report to ARTIQ! You can also discuss issues and ask questions on IRC (the [#m-labs channel on freenode](https://webchat.freenode.net/?channels=m-labs) or on the [forum](https://forum.m-labs.hk). Please check Github/those forums to avoid posting a repeat issue. Context helps us fix issues faster, so please include the following when relevant: --> ## One-Line Summary Currently, when we rescan the experiment repository, ARTIQ master sometimes skips non-problematic experiments. ## Issue Details See other instances of the issue on the m-labs forum: * https://forum.m-labs.hk/d/786-scan-repository-error * https://forum.m-labs.hk/d/690-workererror-when-scanning-repository-head When skipping experiments, worker throws a WorkerError (“Worker ended while attempting to receive data (RID scan)”). Can happen multiple times within a single repo scan - more likely to happen the more experiments there are. Experiments that are skipped will vary depending on what files are inside the repo, but seems to be deterministic each time (i.e. two scans will skip same experiments if nothing is changed). This issue seems to happen only when repo scans take over ~20s, at which point the WorkerErrors get thrown every ~20s. Straightforward workaround is to put a “dummy” experiment immediately before (alphabetically) the skipped experiment, though this gets a bit silly. ### Expected Behavior ARTIQ master is able to rescan the repo and process all valid (i.e. no underlying problem) experiments. ### Actual (undesired) Behavior Experiments get skipped, often multiple times per repo scan, though they are the same experiment files if the underlying experiment repo stays the same. Generally, it takes roughly ~20s (i.e. ~18s-19s by eye) between errors, kinda consistent with the timeout in Worker.examine (i.e. timeout=20). Attached image shows the dashboard log when experiments get skipped - here, ARTIQ master skips 2 experiments. <img width="944" alt="Image" src="https://github.com/user-attachments/assets/b13551e3-5e02-4dc5-b038-8c4a8e85ca21" /> ### Your System (omit irrelevant parts) * Operating System: Windows 10 * ARTIQ version: v8.0+unknown.beta * commit: 69c0f81 <!-- For in-depth information on bug reporting, see: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html https://developer.mozilla.org/en-US/docs/Mozilla/QA/Bug_writing_guidelines -->
fsagbuya was assigned by sb10q 2026-01-18 19:06:18 +08:00

Maybe it actually takes over 20s to examine one experiment? Are you running an antivirus or other bloatware that might further slow down Windows 10? Are you importing heavy packages or running slow code at the top level in these experiments?

Maybe it actually takes over 20s to examine *one* experiment? Are you running an antivirus or other bloatware that might further slow down Windows 10? Are you importing heavy packages or running slow code at the top level in these experiments?
clayton-ho commented 2026-01-18 19:06:18 +08:00

Thanks for the quick response!

Maybe it actually takes over 20s to examine one experiment?

My feeling is that this isn't the case, since the skipped experiments seem to mostly depend on the "ordering" in the exp repo (putting in a dummy experiment immediately before the target exp makes the error "skip"); if it was just a single experiment, I'd imagine it'd be the same experiment each time?
The skipped experiments are also sometimes very low complexity/overhead experiments; e.g. an "fastino set" experiment that just sets a voltage on a fastino.

Are you running an antivirus or other bloatware that might further slow down Windows 10?

We try not to run significant bloat and use relatively beefy lab computers; the Windows task manager doesn't show any significant resource consumption.
The issue is also consistent; it happens regardless of whether other software is open at all.

Are you importing heavy packages or running slow code at the top level in these experiments?

We generally don't have any top-level code in our experiments beyond imports - we use the "--experiment-subdir" flag to artiq_master with a directory that contains only actual experiment files.
We try not to have heavy imports beyond artiq/numpy/scipy etc., and try to import only specific modules where possible.
For some background, we have 63 experiments in our experiment subdirectory.

Thanks for the quick response! > Maybe it actually takes over 20s to examine _one_ experiment? My feeling is that this isn't the case, since the skipped experiments seem to mostly depend on the "ordering" in the exp repo (putting in a dummy experiment immediately before the target exp makes the error "skip"); if it was just a single experiment, I'd imagine it'd be the same experiment each time? The skipped experiments are also sometimes very low complexity/overhead experiments; e.g. an "fastino set" experiment that just sets a voltage on a fastino. > Are you running an antivirus or other bloatware that might further slow down Windows 10? We try not to run significant bloat and use relatively beefy lab computers; the Windows task manager doesn't show any significant resource consumption. The issue is also consistent; it happens regardless of whether other software is open at all. > Are you importing heavy packages or running slow code at the top level in these experiments? We generally don't have any top-level code in our experiments beyond imports - we use the "--experiment-subdir" flag to artiq_master with a directory that contains only actual experiment files. We try not to have heavy imports beyond artiq/numpy/scipy etc., and try to import only specific modules where possible. For some background, we have 63 experiments in our experiment subdirectory.

My feeling is that this isn't the case

Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch https://github.com/m-labs/artiq/pull/2795 to make things even clearer.

> My feeling is that this isn't the case Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch https://github.com/m-labs/artiq/pull/2795 to make things even clearer.
clayton-ho commented 2026-01-18 19:06:18 +08:00

My feeling is that this isn't the case

Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch #2795 to make things even clearer.

Of course! I'll test that out when the experiment is free in a few hours.

> > My feeling is that this isn't the case > > Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch [#2795](https://github.com/m-labs/artiq/pull/2795) to make things even clearer. Of course! I'll test that out when the experiment is free in a few hours.

Scanning doesn't use hardware, so you can also just copy the files and run them on another computer.

Scanning doesn't use hardware, so you can also just copy the files and run them on another computer.
clayton-ho commented 2026-01-18 19:06:18 +08:00

Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch #2795 to make things even clearer.

These are screenshots of the timestamped logs during a repo rescan, which shows that the worker is able to work its way through the repo, rather than being blocked by a single exp.

Image

Image

Image

> Well it's not necessary to use feelings, using logging (e.g. run the master in debug mode) would provide a more definitive answer. Add the timestamps patch [#2795](https://github.com/m-labs/artiq/pull/2795) to make things even clearer. These are screenshots of the timestamped logs during a repo rescan, which shows that the worker is able to work its way through the repo, rather than being blocked by a single exp. ![Image](https://github.com/user-attachments/assets/a099b18b-ddc1-43ec-8edd-db3915e39558) ![Image](https://github.com/user-attachments/assets/3e224b06-04b5-4466-a844-96c898a4b855) ![Image](https://github.com/user-attachments/assets/498e77c1-7503-4e23-afe7-7cfd9148481f)
Member

@clayton-ho Are the failing experiments consistently the same across scans, or is it random? Have you tried removing the failing experiments to isolate the issue? If a specific experiment is causing the problem, could you share its contents to help reproduce the issue.

@clayton-ho Are the failing experiments consistently the same across scans, or is it random? Have you tried removing the failing experiments to isolate the issue? If a specific experiment is causing the problem, could you share its contents to help reproduce the issue.
clayton-ho commented 2026-01-18 19:06:19 +08:00

@fsagbuya

@clayton-ho Are the failing experiments consistently the same across scans, or is it random? Have you tried removing the failing experiments to isolate the issue? If a specific experiment is causing the problem, could you share its contents to help reproduce the issue.

Thanks for checking in!
The failing experiments are the same between scans if nothing is changed in the repo (i.e. two clicks of "Scan repository" will give the same results).
If the repository is rearranged (e.g. the problematic files are moved to a different subdirectory, or renamed so they appear earlier in the exp scan list), then a different file is skipped instead - generally whichever file takes its "place in line."
The skipped exps seem to consistently happen at ~20s intervals.

@fsagbuya > [@clayton-ho](https://github.com/clayton-ho) Are the failing experiments consistently the same across scans, or is it random? Have you tried removing the failing experiments to isolate the issue? If a specific experiment is causing the problem, could you share its contents to help reproduce the issue. Thanks for checking in! The failing experiments are the same between scans if nothing is changed in the repo (i.e. two clicks of "Scan repository" will give the same results). If the repository is rearranged (e.g. the problematic files are moved to a different subdirectory, or renamed so they appear earlier in the exp scan list), then a different file is skipped instead - generally whichever file takes its "place in line." The skipped exps seem to consistently happen at ~20s intervals.
Contributor

We also have quite long repository scan times here (>20 s definitely), but never ran into this issue (unless it was introduced in the last couple of months). Maybe this is some sort of Windows-specific issue with the way IPC is used?

We also have quite long repository scan times here (>20 s definitely), but never ran into this issue (unless it was introduced in the last couple of months). Maybe this is some sort of Windows-specific issue with the way IPC is used?
Sign in to join this conversation.