incorrect log section is intermittently sent #84

Closed
opened 2020-08-04 23:06:06 +08:00 by sb10q · 3 comments

This bug sometimes causes the test_rtio.CoredeviceTest.test_address_collision test to fail:

$ env ARTIQ_ROOT=/home/sb/artiq-zynq/examples python -m unittest test_rtio.CoredeviceTest.test_address_collision
F/nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/unittest/case.py:656: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.1', 45746), raddr=('192.168.1.52', 1380)>
  outcome.errors.clear()
ResourceWarning: Enable tracemalloc to get the object allocation traceback

======================================================================
FAIL: test_address_collision (test_rtio.CoredeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/sb/artiq/artiq/test/coredevice/test_rtio.py", line 512, in test_address_collision
    self.execute_and_test_in_log(AddressCollision, "RTIO collision")
  File "/home/sb/artiq/artiq/test/coredevice/test_rtio.py", line 502, in execute_and_test_in_log
    self.assertIn(string, log)
AssertionError: 'RTIO collision' not found in '[   154.496630s] DEBUG(runtime::kernel::core1): Core1 started\n[   154.502026s] DEBUG(runtime::kernel::core1): FPU enabled on Core1\n[   154.509309s] DEBUG(dyld): ELF target: 69680 bytes, align to 10000, allocated at 00A30000\n[   154.517445s] DEBUG(dyld): Relocating 0 rela, 0 rel, 64 pltrel\n[   154.523233s] DEBUG(runtime::kernel::core1): kernel loaded\n[   154.529002s]  INFO(runtime::kernel::core1): kernel starting\n[   154.534830s]  INFO(runtime::kernel::core1): kernel finished\n'

The collision message was printed on the UART but is not present in the log pulled by the test. It seems that sometimes the core device does not send the latest log messages, but only some earlier ones.

The issue is more frequent when running the whole ARTIQ test suite, and not this test in isolation.

This bug sometimes causes the test_rtio.CoredeviceTest.test_address_collision test to fail: ```text $ env ARTIQ_ROOT=/home/sb/artiq-zynq/examples python -m unittest test_rtio.CoredeviceTest.test_address_collision F/nix/store/r94aa2gj4drkhfvkm2p4ab6cblb6kxlq-python3-3.7.6/lib/python3.7/unittest/case.py:656: ResourceWarning: unclosed <socket.socket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.1.1', 45746), raddr=('192.168.1.52', 1380)> outcome.errors.clear() ResourceWarning: Enable tracemalloc to get the object allocation traceback ====================================================================== FAIL: test_address_collision (test_rtio.CoredeviceTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/sb/artiq/artiq/test/coredevice/test_rtio.py", line 512, in test_address_collision self.execute_and_test_in_log(AddressCollision, "RTIO collision") File "/home/sb/artiq/artiq/test/coredevice/test_rtio.py", line 502, in execute_and_test_in_log self.assertIn(string, log) AssertionError: 'RTIO collision' not found in '[ 154.496630s] DEBUG(runtime::kernel::core1): Core1 started\n[ 154.502026s] DEBUG(runtime::kernel::core1): FPU enabled on Core1\n[ 154.509309s] DEBUG(dyld): ELF target: 69680 bytes, align to 10000, allocated at 00A30000\n[ 154.517445s] DEBUG(dyld): Relocating 0 rela, 0 rel, 64 pltrel\n[ 154.523233s] DEBUG(runtime::kernel::core1): kernel loaded\n[ 154.529002s] INFO(runtime::kernel::core1): kernel starting\n[ 154.534830s] INFO(runtime::kernel::core1): kernel finished\n' ``` The collision message was printed on the UART but is not present in the log pulled by the test. It seems that sometimes the core device does not send the latest log messages, but only some earlier ones. The issue is more frequent when running the whole ARTIQ test suite, and not this test in isolation.

What is the core device log level? If it is trace, it could be caused by the logger holding the log buffer while flushing the stream.

What is the core device log level? If it is trace, it could be caused by the logger holding the log buffer while flushing the stream.
Poster
Owner

The default, DEBUG.

The default, DEBUG.

It seems that core0 and core1 are both trying to log at the same time.

if let Some(mut buffer) = self.buffer.try_lock() {
    writeln!(buffer, "[{:6}.{:06}s] {:>5}({}): {}", seconds, micros,
        record.level(), record.target(), record.args()).unwrap();
}

Currently, the logger would use try_lock to try to lock the log buffer, and give up if it is currently used by other tasks/core. This is mainly used to discard log when poll log is in operation as it would hold the buffer for a long time.

There are two possible fixes:

  1. Use lock directly and modify mgmt to reduce the time it holds the buffer. In order to reduce the time we hold the buffer, we need to copy the buffer content and drop the mutex guard before sending the log out. We could raise the log level temporarily when we are sending the log to avoid getting trace logs from the networking modules.
  2. Use lock but separate core0 and core1 buffer, and swap the mutex together with the buffer with an empty one when we poll the log. This could have better performance, but we need more static buffer.
It seems that core0 and core1 are both trying to log at the same time. ```rust if let Some(mut buffer) = self.buffer.try_lock() { writeln!(buffer, "[{:6}.{:06}s] {:>5}({}): {}", seconds, micros, record.level(), record.target(), record.args()).unwrap(); } ``` Currently, the logger would use `try_lock` to try to lock the log buffer, and give up if it is currently used by other tasks/core. This is mainly used to discard log when poll log is in operation as it would hold the buffer for a long time. There are two possible fixes: 1. Use `lock` directly and modify `mgmt` to reduce the time it holds the buffer. In order to reduce the time we hold the buffer, we need to copy the buffer content and drop the mutex guard before sending the log out. We could raise the log level temporarily when we are sending the log to avoid getting trace logs from the networking modules. 2. Use `lock` but separate core0 and core1 buffer, and swap the mutex together with the buffer with an empty one when we poll the log. This could have better performance, but we need more static buffer.
sb10q closed this issue 2020-08-05 18:51:27 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#84
There is no content yet.