Problems with DMA on Kasli-Soc #196
Labels
No Milestone
No Assignees
7 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#196
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
I've been having a play with a Kasli-SoC, and I'm having some trouble with DMA. It seems that whenever I try and playback a DMA recording containing 128 events or more it hangs. I guess this is to do with the fifo size. In addition if I try and playback substantially more events say 250, have it hang, delete the experiment then try a simple experiment that just writes to a channel I get RTIO destination unreachable errors. I put a print in the start of rtio_csr::output and csr::rtio::o_status_read was returning 4 immeditately. As far as I can tell that persisted until the kasli was power cycled.
@sb10q can you guys reproduce this?
The kernel will hang until some events in the FIFO is consumed, then a new event will be queued into the FIFO, and repeat until everything is pushed into the FIFO. The behavior should be the same as Kasli. It is not an exclusive behavior of DMA either.
So is this a DRTIO system? I tried Kasli-SoC standalone and the pairing of a Kasli-SoC Master & a Kasli 1.1 satellite. The DMA hangs as usual, but I don't see the error.
These are my experiment code:
I first run the DMA experiment, then it hangs. I then cancelled the experiemnt from dashboard, and run the LED experiment.
Did you patch the capacitors on your board? The transceiver power supplies are unstable on the first batch of TS boards and need a capacitor replacement. Without it the DRTIO links fail to establish or are not stable. M-Labs boards should not be affected.
(Sorry been on holiday)
Yes, I understand that. When I say it's hung I mean that I've waited many orders of magnitude longer than I expect the DMA sequence playout to take and it's not finished.
No, that's no DRTIO involved. This is a single Kasli-SoC.
Is this an attempt to reproduce the RTIO unreachable errors. I'm a little bit concerned that I mis-reported that number of RTIO events that I had used to generate that. Would you mind trying with 10,000 events.
We did get them from Technosystems. We're just in the process of getting the capacitors and patching the boards. I'll try reproducing this again once we've done that.
We've now tried this with the boards patched and see the same misbehaviour.
I'm sorry @occheung I had not read your experiment closely enough before to spot the size of the delays you are using. I understand that the DMA playout blocks until all of the events have been queued in the fifo. The difference between what I'm doing and what you are is that my DMA sequences should be played out in a few ms. e.g. if you change the delays in your experiments from 100ms to 30us, you'd expect the whole thing to playout in ~30ms. That's more representative of the kind of DMA sequence that I'm interested in. And this doesn't complete playout even if you wait many minutes.
I think I should be clear that when I say hanging I mean that it's taking so much longer than I think that it should take that I strongly suspect that it will never finish. In this case the DMA hanging is the primary bug that I'm reporting here, my observations about the unreachable errors are secondary and I mention them because I thought they might be helpful in diagnosing the underlying issue.
So, if we take this code as an example:
This runs on a Kasli v2.0 and prints:
Which suggests that playout takes on the order of 8ms. On the Kasli-SoC this doesn't finish in the time that I'm willing to wait. If you do get this to work on the Kasli-SoC you'll likely encounter another bug that I'm about to raise.
I've raised the other issue that I was refering to as #201 (
get_rtio_counter_mu
always returns 0).I also forgot to mention yesterday that I was reviewing the firmware code and I couldn't find anywhere that we added a sentinel event to the DMA buffer. In the Kasli v2.0 firmware we add a record with length == 0 and no other bytes here, which the gateware seems to require to signal the end of the DMA sequence (here). I couldn't find a similar line in the Kasli Zynq firmware so I added one:
This didn't change the behaviour for me. It seems likely that that byte happens to be 0. But unless I've missed something we do need a change like this.
I am trying to reproduce the issue, however I'm unable to make it hang. Moreover, I've run this code
And got nearly the same result, as you mentioned for Kasli v2.0.
@esavkin, which firmware/gateware are you using? I'd like to try the same ones just incase anything is changing.
I started out with the builds from https://nixbld.m-labs.hk/eval/4640, but I built my own from master when they didn't appear to be working.
current master's kasli_soc-master-jtag
@mbirtwell is this solved?
I've not had a chance to retest this. I'd ideally like to use a build from hydra that's known to work for someone else when I do.
There are no hydra builds for Kasli-SoC, it's all AFWS.
Other than the demo/master/satellite, but nobody uses that, it's mostly there just to check that the build doesn't break.
So I've been using the master builds just to get going and do some benchmarking ahead of trying to use this in an actual system.
I've just done some more testing and it does in fact work me now. I've tried to bisect where it starts working and I have:
(Admittedly I was using my experiment from #201, but based on brief testing that seems to be a good proxy for these issues)
It looks kind of modal. As if something has changed to fix it, but the commit between 4801 and 4862 is
4a4f7b0ddc
which updates the artiq version and doesn't seem to have any relevant changes, the only dim possibility isa3ae82502c
, but I think that's for a different board.So yay it works. I'd love to know why it didn't work, why it does now, and why it won't break again. But I suppose you don't always get what you want.
I wanted to add onto this issue. We aren't using Kasli-SoC, but a ZC706 FPGA+Evaluation board. We also observe DMA hanging like @mbirtwell was (glad it's resolved for you!). The sequence plays but won't finish or raise any errors. We're using the
zc706-nist_qc2-sd:134016
binary build.We will try flashing to 138454 this afternoon or tomorrow and will let you know if that fixes the bug for us.
There seems to be some insane Vivado behavior if you connect a BRAM to the AXI HP ports directly, which is worked around in another project in
13a44fc185
.Someone should take a look at the netlist and see if there's any such connection with the ARTIQ DMA core.
By "someone should take a look" do you mean someone at M-Labs will do this? @sb10q
We finally got around to testing this with the latest build,
zc706-nist_qc2-sd:138422
, and the issue is persisting with no change in symptoms. Are there any other debugging steps we should take @sb10q ?@sb10q I wanted to check back in about this and see if there is anything else we should try as the problem is persisting.
@sb10q This is still broken as of stable gateware build
zc706-nist_qc2-sd:145490
. It seems to work on beta build 145564, but that build causes the FPGA to disconnect from the host sporadically. Please let us know if there's anything we can do to help troubleshoot -- it's starting to limit what we can do experimentally.I tried running on our zc706 with
zc706-nist_qc2-jtag
on latest release-7 and master branches (https://nixbld.m-labs.hk/eval/5814 and https://nixbld.m-labs.hk/eval/5815 respective) this code:and got nearly the same result:
I tried the same script from my previous comment on kasli-soc for firmware/gateware revision
0812f22423
(the commit just before mentioned4a4f7b0ddc
) for such configuration with real attached hardware (I changed ttl0 to ttl4 in the script):and got the same result.
Please advise which exact configuration should I try, since now I suspect it can be some Vivado related issue.
Please pay attention to the posts. They reference certain binary builds such as zc706-nist_qc2-sd:138422 and zc706-nist_qc2-sd:145490.
I tried again builds 138422 and 145446 (same evaluation as 145490, but sd variant), downloaded from hydra. Still got the same results on my test script.
Possibly fixed by
ea9fe9b4e1
.@jfniedermeyer can you check?
Thanks for the update, @sb10q. We're about to travel for two weeks of conferences. We'll test it as soon as we're back.
@sb10q, using build 147779, which pulls in that PR, the issue persists with the same symptoms.