fix (workaround) drtioaux packets being corrupted #176
No reviewers
Labels
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#176
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "mwojcik/artiq-zynq:fix_drtio_corruption"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Following the thread on m-labs forum: https://forum.m-labs.hk/d/377-drtio-with-zc706-master-and-kasli-v11-satellite/6
Issue was reproduced on Kasli-SoC - for some reason writes to DRTIOAUX memory may fail, not changing the contents, causing the satellite to report CRC issues. Investigation turned nothing really (AXI-specific instructions/burst modes were thought to be the culprit, but it turned out there were none)
There is a little more to be said here - if there's no kernel running, master sends
DestinationStatusRequest
packets periodically to the satellite with no problems at all; but once a kernel is run, the corruption starts occuring (and does not stop on its own). More weird is that running the kernel doesn't really change the communications (there are no additional packets sent), but it does starts corrupting.Running a loop that checks the contents and rewrites if necessary solves the problem, at least from a practical standpoint.
However, this does not fix
self.core.get_rtio_destination_status
- the function in Zynq is not implemented (yet). That function currently returnstrue
only if the destination is 0 (that is, if the master is up). It will be implemented soon, however.Did you check caches/write buffers?
Cache could make sense from what I read about them - with being involved in AXI transactions quite directly, for one. However... this is not the case.
I tried disabling L2 cache (well, not enabling it in the first place) - didn't help; experimented with various combinations of flushing and invalidating the cache as well, from the parts that would be affected, to the entire content, to no avail.
The entire AXI thing is managed by a memory controller actually (according to TRM), that's why there aren't any specific instructions in the disassembly that I looked through. I think I'll have a deeper look into that later.
It could also just be some bugs with the AXI implementation of the SRAM. It's easy to get it wrong. http://zipcpu.com/ has some information about common AXI problems and verification of AXI cores.
b1d0c1d8a2
to77ca020450
Because the fix seems like it's doing nothing in particular, I feel like it requires a bit of explanation.
Alright so: the implementation of the axi2csr bridge is alright - it does the job in very specific conditions - there are no bursts, all writes are 32-bit, as there's no granularity for write enable, reads can be arbitrary. Of course throughput is limited, but it's OK for most CSR uses (single word writes and reads) and given not-so-big sizes of DRTIO AUX buffers the performance penalty is somewhat acceptable. Sure, it would be good if it behaved natively, but that would require a massive overhaul and implementing most of the AXI specification.
If bursts are generated by the core, then this module will cause issues and unwritten bytes, which would (and did in previous tests) cause issues down the line. Adding to that, there was problem with unaligned access - while permitted by the specification, caused a data abort on Zynq. Thus the work buffer was created, so the DRTIO modules could write freely in the RAM, which would be then copied over to DRTIO AUX transmitter. The good part is that Zynq TRM (ug585) mentions straightforwardly that bursts are not generated for 8- and 16-bit writes/reads, so 16-bit copy loop was used. And that worked, for startup and establishing a connection.
As you may notice, there was a lot of worry about the burst modes, and the other condition was omitted (or rather, went unnoticed) - all writes are 32-bit - because the bus is 32 bit wide, and even the 2 lsbs of the address are discarded; and AXI's
WSTRB
signal (which would indicate width of the write) is not even translated over, but a singlewe
is used.Thus, looking at it I'm not even sure how it worked in the way it did before.
The AXI master controller would send a normal 16-bit write with some garbage on the unused lanes of the data bus, and that garbage would get written as well, since the axi2csr bridge doesn't work with
WSTRB
, causing corruption (and then CRC errors). So, at least in my view, there are three ways of dealing with this:amending
axi2csr
with some interpretation ofwstrb
(probably a tiny extra "bus" parallel to internal_csr, only connected to Memory/SRAMwe
with increased granularity) - feels like a workaround still and doesn't really solve anything, just attaching a hack on top of a hack, plus would need a PR in migen-axi with me explaining why another hack is a necessityrewriting
axi2csr
(or SRAM/memory controller, or AXI slave bus) with full AXI support - that's a lot of work involved, would probably take months to get it all down correctly within migen-axiwriting 32 bits at a time in software, making sure bursts don't happen still - there's a bit of reliance on trying to confuse the memory controller on Zynq, but if this works (spoiler: it does), it will fix the original issue while also not adding more complexity to the FPGA part.
So I went with the last option and tested it with Kasli-SoC/Kasli 1.1 configuration. Neither internal checks detected any discrepancies, nor did Kasli satellite complain about CRC errors anymore.
IIRC they work on Zynq but you need the MMU enabled. The asm startup code used by ARTIQ should enable it already. Are you pulsing POR before each JTAG load?
Well then that didn't exactly work then, not sure why then. However in current situation writes and reads have to be aligned and 32 bit wide anyway, since separate bytes aren't addressable with this setup. Even if that didn't cause a data abort (maybe something axi-related is the case actually?) that could still lead to corruption. I think it's something to take a closer look at when the gateware side is rewritten.
@ -67,1 +65,4 @@
for i in (0..(len/4)).step_by(2) {
//mix offsets to prevent bursts
*dst.offset(i+1) = *src.offset(i+1);
*dst.offset(i) = *src.offset(i);
I think the compiler is allowed to reorder those, unless the memory is marked as volatile.
Looks like there's no need for volatile memory - data memory barrier (
dmb
) seems to work as a good substitute.I think bursts are not a problem with other (consecutive) CSR accesses as they're done within API usually and there's a bit of logic in between so they're not scheduled as bursts.
Except csr timestamps as their type is
i64
- they should be looked at and in case of issues this could be a culprit.That doesn't sound reliable and we may want to use
dmb
there as well.That will have to be done internally within auto-generated CSR rust file (which is done within misoc, rather than this repo) as
dmb
would have to be added between consecutiveu32
reads. Adding it post read (ie. in csr code for runtime) won't help. Actually, have been any issues noticed for that?I'm also testing returning
axi.Response.slverr
on bursts longer than 1, but that's part of migen-axi too.Sure, send a misoc patch.
I went quite deep into it (e.g. learned that despite errors, axi transactions still need to go through in full), turns out the master happily ignores the error (absolutely no change in behavior, even trying to force something out by using
u64
).While testing that I also tried a variant where it would deliberately hang if a burst was issued - and that seemed to work still. Looks like csr writes/reads over 32 bit (which are broken down beforehand) don't generate bursts after all.
I think this could be mergable for now - with CSR/Misoc fix losing priority, and not really being a part of this PR.