LLVM pass optimization #119
Labels
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/artiq-zynq#119
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
See https://github.com/dnadlinger/artiq/tree/passes.
The pass would failed the following tests for zynq:
and requires additional API (
memset
). I'm not sure if it would require more APIs.This is very interesting.
memset
I think we should just add to the exported runtime symbols, but in general, we can control the functions LLVM assumes to be available using one of the target info structs if need be.What is very curious, though, is the embedding test failure. I haven't seen a bona-fide LLVM miscompilation for "straight-ahead" code in quite a while, so my immediate reaction would be to suspect that the code we currently emit is already invalid IR, and this is just brought to light by the optimizer. (There is a small chance there is just something funny going on with our alloca-style allocations, of course.)
The
test_wait_for_rtio_counter
one is interesting;nowrite
(which is translated to "trivial" TBAA metadata) causes the wait loop to be optimized out.The
numpy_things
one is actually invalid code, and compiles only because of https://github.com/m-labs/artiq/issues/1497 (the memory just hapens to be uncorrupted with less aggresive optimisations).Shall we remove this specific test case?
Isn't there the same problem with
numpy_full
,numpy_full_matrix
andnumpy_nan
?Yes; in fact, I did that on that
passes
branch just now. (There are also a few similarly broken test cases, which I also removed – RPC of arrays is also exercised test_numpy.)I've also made
nowrite
lower to theinaccessiblememonly
LLVM function attribute instead, which fixes the RTIO counter test.Don't really have time to run the full test suite on hardware right now – if you do, feel free to go ahead.
Currently, the invariant_propagation lit test fails, as the slightly change pass order causes IPSCCP not to trigger on the
self
argument (so it then doesn't matter whetherkernel_invariant
even does what it is supposed to). Running the IR throughopt -O2
again manually cleans it right up, so it really is only a pass ordering issue. If this turns out to be a performance issue in real-world code, we could add some extra IPSSCP passes (or whatever turns out to be neeed) at the right pipeline hook points, but that would require a closer look at why it doesn't trigger first.Yes, there is/was.
I've checked with an updated llvmlite (the one for numba, in channel 20.09), and using this branch with some modification (removed the variable debug information... it breaks the compilation). It seems that the autovectorizer is not invoked either.
I wonder if the IR is too complicated for the optimizer to optimize.
Not sure if that is valid though, it would break other code like DMA...
Oh, the autovectorizer definitely does something for a few tests, like for instance some of the (IIRC integer) math throughput tests from the paper. It's just that interestingly, it leaves e.g. the floating point dot product test alone. You can check the optimizer remarks for details (either by passing in the flag through llvmlite's backend options, or running standalone opt); in the case of the floating point benchmarks, it apparently thought it wasn't profitable to vectorize according to the cost function. (I didn't investigate further.)
What are you worried about regarding DMA? Accesses to memory-mapped locations of course need to be marked as volatile, but that isn't really related to the autovectorizer or DMA.
My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue.
Would have a look at that tmr.
Ah, I understand – strange nonetheless. Makes me slightly worried about issues in the generated IR that might just not be as visible with the older LLVM, but still cause problems (cf. the memory corruption issues we've been seeing on my experiment using Kasli/or1k). This might be interesting to look into.
To be honest, understanding/tuning the autovectorizer probably wouldn't be very high up on my priority list, at least as a user – until somebody actually has a compute-bound deployment of ARTIQ/Zynq, that is. (Perhaps the vectorizer is just tuned to leave memory-bound loops alone, or something like that.)
Maybe I can have a look into the IR problem, but not sure if I can find out the cause tmr. I have classes starting next week :).