crash when attempting to load kernel on one of the ZC706 #7

Closed
opened 2020-06-05 11:14:06 +08:00 by sb10q · 4 comments

One of the two ZC706 can load and start kernels fine, the other one exhibits XilinxⓇ behavior when the kernel is being loaded (i.e. during the handling of Message::LoadRequest):

[     0.000000s]  INFO(szl): Simple Zynq Loader starting...
[     0.005391s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 1600000000 Hz
[     0.010571s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz
[     0.017788s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1333333332 Hz
[     0.025253s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 666666660/444444440
[     0.032371s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10582010 Hz
[     0.041338s]  INFO(szl): decompressing payload
[     0.089714s]  INFO(szl): executing payload
[     0.000003s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005384s]  INFO(runtime): gateware already loaded
[     0.010342s]  INFO(runtime): detected gateware: Simple
[     0.018920s] DEBUG(runtime::kernel): Core1 started
[     3.690100s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half }
[     7.586649s] DEBUG(runtime::comms): received connection
G�W�
P�����U�2*`���`�V�
������ ���<��< ����
�����
��$�������������������(��M��
                            ����������������������������
                                                        ������

Might be related to #5. I will check what happens when the board is booted with the FSBL and not JTAG.

One of the two ZC706 can load and start kernels fine, the other one exhibits XilinxⓇ behavior when the kernel is being loaded (i.e. during the handling of ``Message::LoadRequest``): ```text [ 0.000000s] INFO(szl): Simple Zynq Loader starting... [ 0.005391s] DEBUG(libboard_zynq::clocks::source): Set ARM_PLL to 1600000000 Hz [ 0.010571s] DEBUG(libboard_zynq::clocks::source): Set IO_PLL to 1000000000 Hz [ 0.017788s] DEBUG(libboard_zynq::clocks::source): Set DDR_PLL to 1333333332 Hz [ 0.025253s] DEBUG(libboard_zynq::ddr): DDR 3x/2x clocks: 666666660/444444440 [ 0.032371s] DEBUG(libboard_zynq::ddr): DDR DCI clock: 10582010 Hz [ 0.041338s] INFO(szl): decompressing payload [ 0.089714s] INFO(szl): executing payload [ 0.000003s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005384s] INFO(runtime): gateware already loaded [ 0.010342s] INFO(runtime): detected gateware: Simple [ 0.018920s] DEBUG(runtime::kernel): Core1 started [ 3.690100s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half } [ 7.586649s] DEBUG(runtime::comms): received connection G�W� P�����U�2*`���`�V� ������ ���<��< ���� ����� ��$�������������������(��M�� ���������������������������� ������ ``` Might be related to https://git.m-labs.hk/M-Labs/artiq-zynq/issues/5. I will check what happens when the board is booted with the FSBL and not JTAG.
Poster
Owner

As I suspected the problematic board is behaving itself when booted from FSBL:

In FsblHookBeforeHandoff function 

SUCCESSFUL_HANDOFF

FSBL Status = 0x1

[     0.000003s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005449s]  INFO(runtime): gateware already loaded
[     0.010409s]  INFO(runtime): detected gateware: Simple
[     0.019134s] DEBUG(runtime::kernel): Core1 started
[    44.566006s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half }
[    52.234554s] DEBUG(runtime::comms): received connection
[    52.253917s] DEBUG(runtime::kernel): Core1 started
[    52.259057s]  INFO(dyld): ELF target: 12324 bytes, align to 1000, allocated at 00141000
[    52.267284s]  INFO(dyld): Relocating 0 rela, 0 rel, 40 pltrel
[    52.273142s] DEBUG(runtime::kernel): kernel loaded
[    52.278922s] DEBUG(runtime::kernel): kernel starting
[    52.283890s] DEBUG(runtime::kernel): rpc_send_async 1 [110, 58, 110]
[    52.290283s] DEBUG(runtime::kernel): kernel finished
[    52.290349s] DEBUG(runtime::comms): RPC: is_async=true data=[0, 0, 0, 1, 110, 0, 0, 0, 0, 1, 110]

So it's most likely that XilinxⓇ DDR controller acting up again.

As I suspected the problematic board is behaving itself when booted from FSBL: ```text In FsblHookBeforeHandoff function SUCCESSFUL_HANDOFF FSBL Status = 0x1 [ 0.000003s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005449s] INFO(runtime): gateware already loaded [ 0.010409s] INFO(runtime): detected gateware: Simple [ 0.019134s] DEBUG(runtime::kernel): Core1 started [ 44.566006s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half } [ 52.234554s] DEBUG(runtime::comms): received connection [ 52.253917s] DEBUG(runtime::kernel): Core1 started [ 52.259057s] INFO(dyld): ELF target: 12324 bytes, align to 1000, allocated at 00141000 [ 52.267284s] INFO(dyld): Relocating 0 rela, 0 rel, 40 pltrel [ 52.273142s] DEBUG(runtime::kernel): kernel loaded [ 52.278922s] DEBUG(runtime::kernel): kernel starting [ 52.283890s] DEBUG(runtime::kernel): rpc_send_async 1 [110, 58, 110] [ 52.290283s] DEBUG(runtime::kernel): kernel finished [ 52.290349s] DEBUG(runtime::comms): RPC: is_async=true data=[0, 0, 0, 1, 110, 0, 0, 0, 0, 1, 110] ``` So it's most likely that XilinxⓇ DDR controller acting up again.
Poster
Owner

Strangely enough, ddr.memtest() from SZL passes on that board, but it still crashes during kernel loading immediately afterwards.

1020 MB
1021 MB
1022 MB Ok
[   135.555248s]  INFO(szl): decompressing payload
[   135.605051s]  INFO(szl): executing payload
[     0.000003s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005387s]  INFO(runtime): gateware already loaded
[     0.010346s]  INFO(runtime): detected gateware: Simple
[     0.018994s] DEBUG(runtime::kernel): Core1 started
[     3.718360s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half }
[    16.753441s] DEBUG(runtime::comms): received connection

ataAbort

Strangely enough, ``ddr.memtest()`` from SZL passes on that board, but it still crashes during kernel loading immediately afterwards. ```text 1020 MB 1021 MB 1022 MB Ok [ 135.555248s] INFO(szl): decompressing payload [ 135.605051s] INFO(szl): executing payload [ 0.000003s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005387s] INFO(runtime): gateware already loaded [ 0.010346s] INFO(runtime): detected gateware: Simple [ 0.018994s] DEBUG(runtime::kernel): Core1 started [ 3.718360s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half } [ 16.753441s] DEBUG(runtime::comms): received connection ataAbort ```
Poster
Owner

This turns out to be quite random and the SZL vs. FSBL difference might be noise/coincidence. We can sometimes run kernels on that board with SZL as well...

This turns out to be quite random and the SZL vs. FSBL difference might be noise/coincidence. We can sometimes run kernels on that board with SZL as well...

Just added some logs to the comms module, it seems that one of the zc706 have problem restarting the core1:

The good one:

[    14.884584s]  INFO(runtime::comms): received connection
[    14.890630s] TRACE(runtime::comms): Load kernel
[    14.943035s] TRACE(runtime::comms): Get kernel control
[    14.948154s] TRACE(runtime::comms): Restart kernel control
[    14.953631s] TRACE(runtime::comms): Send load request
[    14.953899s] DEBUG(runtime::kernel): Core1 started
[    14.958670s] TRACE(runtime::comms): Wait for kernel reply
[    14.963439s] DEBUG(runtime::kernel): FPU enabled on Core1
[    14.976739s] DEBUG(dyld): ELF target: 168020 bytes, align to 10000, allocated at 00180000
[    14.984897s] TRACE(dyld): Program header: 00018D48+00000018 to 00180000
[    14.991494s] TRACE(dyld): Program header: 00000000+00018D60 to 00180000
[    14.999867s] TRACE(dyld): Program header: 00018F40+00000114 to 00180000
[    15.006471s] TRACE(dyld): Program header: 00018F68+00000098 to 00180000
[    15.013072s] TRACE(dyld): Program header: 00000000+00000000 to 00180000
[    15.019668s] TRACE(dyld): Program header: 00018F40+000000C0 to 00180000
[    15.026270s] DEBUG(dyld): Relocating 0 rela, 96 rel, 40 pltrel
[    15.032028s] TRACE(dyld::reloc): resolved symbol "rtio_get_counter"
[    15.038347s] TRACE(dyld::reloc): resolved symbol "rpc_recv"
[    15.043909s] TRACE(dyld::reloc): resolved symbol "rpc_send"
[    15.049478s] TRACE(dyld::reloc): resolved symbol "__artiq_personality"
[    15.055992s] TRACE(dyld::reloc): resolved symbol "rpc_send_async"
[    15.062079s] DEBUG(runtime::kernel): kernel loaded
[    15.067412s] TRACE(runtime::comms): Got kernel reply
[    15.072365s] TRACE(runtime::comms): Kernel load completed
[    15.078498s] TRACE(runtime::comms): Run kernel
[    15.082933s]  INFO(runtime::kernel): kernel starting

The bad one:

[     4.760978s]  INFO(runtime::comms): network addresses: MAC=02-00-00-00-00-53 IPv4=192.168.1.53 IPv6-LL=fe80::ff:fe00:53 IPv6=no configured address
[     4.775206s] DEBUG(runtime::kernel): Core1 started
[     4.779986s] DEBUG(runtime::kernel): FPU enabled on Core1
[     8.430582s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half }
[    18.432174s]  INFO(runtime::comms): received connection
[    18.438220s] TRACE(runtime::comms): Load kernel
[    18.687638s] TRACE(runtime::comms): Get kernel control
[    18.692759s] TRACE(runtime::comms): Restart kernel control
[    18.698236s] TRACE(runtime::comms): Send load request
[    18.703197s] TRACE(runtime::comms): Wait for kernel reply
^C
Just added some logs to the comms module, it seems that one of the zc706 have problem restarting the core1: The good one: ``` [ 14.884584s] INFO(runtime::comms): received connection [ 14.890630s] TRACE(runtime::comms): Load kernel [ 14.943035s] TRACE(runtime::comms): Get kernel control [ 14.948154s] TRACE(runtime::comms): Restart kernel control [ 14.953631s] TRACE(runtime::comms): Send load request [ 14.953899s] DEBUG(runtime::kernel): Core1 started [ 14.958670s] TRACE(runtime::comms): Wait for kernel reply [ 14.963439s] DEBUG(runtime::kernel): FPU enabled on Core1 [ 14.976739s] DEBUG(dyld): ELF target: 168020 bytes, align to 10000, allocated at 00180000 [ 14.984897s] TRACE(dyld): Program header: 00018D48+00000018 to 00180000 [ 14.991494s] TRACE(dyld): Program header: 00000000+00018D60 to 00180000 [ 14.999867s] TRACE(dyld): Program header: 00018F40+00000114 to 00180000 [ 15.006471s] TRACE(dyld): Program header: 00018F68+00000098 to 00180000 [ 15.013072s] TRACE(dyld): Program header: 00000000+00000000 to 00180000 [ 15.019668s] TRACE(dyld): Program header: 00018F40+000000C0 to 00180000 [ 15.026270s] DEBUG(dyld): Relocating 0 rela, 96 rel, 40 pltrel [ 15.032028s] TRACE(dyld::reloc): resolved symbol "rtio_get_counter" [ 15.038347s] TRACE(dyld::reloc): resolved symbol "rpc_recv" [ 15.043909s] TRACE(dyld::reloc): resolved symbol "rpc_send" [ 15.049478s] TRACE(dyld::reloc): resolved symbol "__artiq_personality" [ 15.055992s] TRACE(dyld::reloc): resolved symbol "rpc_send_async" [ 15.062079s] DEBUG(runtime::kernel): kernel loaded [ 15.067412s] TRACE(runtime::comms): Got kernel reply [ 15.072365s] TRACE(runtime::comms): Kernel load completed [ 15.078498s] TRACE(runtime::comms): Run kernel [ 15.082933s] INFO(runtime::kernel): kernel starting ``` The bad one: ``` [ 4.760978s] INFO(runtime::comms): network addresses: MAC=02-00-00-00-00-53 IPv4=192.168.1.53 IPv6-LL=fe80::ff:fe00:53 IPv6=no configured address [ 4.775206s] DEBUG(runtime::kernel): Core1 started [ 4.779986s] DEBUG(runtime::kernel): FPU enabled on Core1 [ 8.430582s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Half } [ 18.432174s] INFO(runtime::comms): received connection [ 18.438220s] TRACE(runtime::comms): Load kernel [ 18.687638s] TRACE(runtime::comms): Get kernel control [ 18.692759s] TRACE(runtime::comms): Restart kernel control [ 18.698236s] TRACE(runtime::comms): Send load request [ 18.703197s] TRACE(runtime::comms): Wait for kernel reply ^C ```
sb10q closed this issue 2020-07-07 12:47:22 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#7
There is no content yet.