LLVM pass optimization #119

Open
opened 2021-01-15 11:39:23 +08:00 by sb10q · 14 comments
Owner
See https://github.com/dnadlinger/artiq/tree/passes.
Contributor

The pass would failed the following tests for zynq:

======================================================================
ERROR: test_wait_for_rtio_counter (artiq.test.coredevice.test_rtio.CoredeviceTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pca006132/code/artiq/artiq/test/coredevice/test_rtio.py", line 448, in test_wait_for_rtio_counter
    self.execute(WaitForRTIOCounter)
  File "/home/pca006132/code/artiq/artiq/test/hardware_testbench.py", line 63, in execute
    raise exn
  File "/home/pca006132/code/artiq/artiq/test/hardware_testbench.py", line 53, in execute
    exp.run()
  File "/home/pca006132/code/artiq/artiq/language/core.py", line 54, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "/home/pca006132/code/artiq/artiq/coredevice/core.py", line 137, in run
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "/home/pca006132/code/artiq/artiq/coredevice/comm_kernel.py", line 628, in serve
    self._serve_exception(embedding_map, symbolizer, demangler)
  File "/home/pca006132/code/artiq/artiq/coredevice/comm_kernel.py", line 620, in _serve_exception
    raise python_exn
artiq.test.coredevice.test_rtio.InvalidCounter: artiq.test.coredevice.test_rtio.InvalidCounter
Core Device Traceback (most recent call last):
  File "<artiq>/test/coredevice/test_rtio.py", line 47, in artiq.test.coredevice.test_rtio.WaitForRTIOCounter.run(..., ...) (RA=+0x518)
    raise InvalidCounter
  File "<artiq>/test/coredevice/test_rtio.py", line 47, column 13, in artiq.test.coredevice.test_rtio.WaitForRTIOCounter.run(..., ...)
    raise InvalidCounter
    ^
artiq.test.coredevice.test_rtio.InvalidCounter(2): artiq.test.coredevice.test_rtio.InvalidCounter

======================================================================
FAIL: test_args (artiq.test.coredevice.test_embedding.RPCCallsTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/pca006132/code/artiq/artiq/test/coredevice/test_embedding.py", line 286, in test_args
    self.assertEqual(exp.numpy_things(),
AssertionError: Tuples differ: (10, 20, array([287637232], dtype=int32)) != (10, 20, array([42]))

First differing element 2:
array([287637232], dtype=int32)
array([42])

- (10, 20, array([287637232], dtype=int32))
+ (10, 20, array([42]))

and requires additional API (memset). I'm not sure if it would require more APIs.

The pass would failed the following tests for zynq: ``` ====================================================================== ERROR: test_wait_for_rtio_counter (artiq.test.coredevice.test_rtio.CoredeviceTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/pca006132/code/artiq/artiq/test/coredevice/test_rtio.py", line 448, in test_wait_for_rtio_counter self.execute(WaitForRTIOCounter) File "/home/pca006132/code/artiq/artiq/test/hardware_testbench.py", line 63, in execute raise exn File "/home/pca006132/code/artiq/artiq/test/hardware_testbench.py", line 53, in execute exp.run() File "/home/pca006132/code/artiq/artiq/language/core.py", line 54, in run_on_core return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs) File "/home/pca006132/code/artiq/artiq/coredevice/core.py", line 137, in run self.comm.serve(embedding_map, symbolizer, demangler) File "/home/pca006132/code/artiq/artiq/coredevice/comm_kernel.py", line 628, in serve self._serve_exception(embedding_map, symbolizer, demangler) File "/home/pca006132/code/artiq/artiq/coredevice/comm_kernel.py", line 620, in _serve_exception raise python_exn artiq.test.coredevice.test_rtio.InvalidCounter: artiq.test.coredevice.test_rtio.InvalidCounter Core Device Traceback (most recent call last): File "<artiq>/test/coredevice/test_rtio.py", line 47, in artiq.test.coredevice.test_rtio.WaitForRTIOCounter.run(..., ...) (RA=+0x518) raise InvalidCounter File "<artiq>/test/coredevice/test_rtio.py", line 47, column 13, in artiq.test.coredevice.test_rtio.WaitForRTIOCounter.run(..., ...) raise InvalidCounter ^ artiq.test.coredevice.test_rtio.InvalidCounter(2): artiq.test.coredevice.test_rtio.InvalidCounter ====================================================================== FAIL: test_args (artiq.test.coredevice.test_embedding.RPCCallsTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/pca006132/code/artiq/artiq/test/coredevice/test_embedding.py", line 286, in test_args self.assertEqual(exp.numpy_things(), AssertionError: Tuples differ: (10, 20, array([287637232], dtype=int32)) != (10, 20, array([42])) First differing element 2: array([287637232], dtype=int32) array([42]) - (10, 20, array([287637232], dtype=int32)) + (10, 20, array([42])) ``` and requires additional API (`memset`). I'm not sure if it would require more APIs.
Collaborator

This is very interesting. memset I think we should just add to the exported runtime symbols, but in general, we can control the functions LLVM assumes to be available using one of the target info structs if need be.

What is very curious, though, is the embedding test failure. I haven't seen a bona-fide LLVM miscompilation for "straight-ahead" code in quite a while, so my immediate reaction would be to suspect that the code we currently emit is already invalid IR, and this is just brought to light by the optimizer. (There is a small chance there is just something funny going on with our alloca-style allocations, of course.)

This is very interesting. `memset` I think we should just add to the exported runtime symbols, but in general, we can control the functions LLVM assumes to be available using one of the target info structs if need be. What is very curious, though, is the embedding test failure. I haven't seen a bona-fide LLVM miscompilation for "straight-ahead" code in quite a while, so my immediate reaction would be to suspect that the code we currently emit is already invalid IR, and this is just brought to light by the optimizer. (There is a small chance there is just something funny going on with our alloca-style allocations, of course.)
Collaborator

The test_wait_for_rtio_counter one is interesting; nowrite (which is translated to "trivial" TBAA metadata) causes the wait loop to be optimized out.

The `test_wait_for_rtio_counter` one is interesting; `nowrite` (which is translated to "trivial" TBAA metadata) causes the wait loop to be optimized out.
Collaborator

The numpy_things one is actually invalid code, and compiles only because of https://github.com/m-labs/artiq/issues/1497 (the memory just hapens to be uncorrupted with less aggresive optimisations).

The `numpy_things` one is actually invalid code, and compiles only because of https://github.com/m-labs/artiq/issues/1497 (the memory just hapens to be uncorrupted with less aggresive optimisations).
Contributor

The numpy_things one is actually invalid code, and compiles only because of https://github.com/m-labs/artiq/issues/1497 (the memory just hapens to be uncorrupted with less aggresive optimisations).

Shall we remove this specific test case?

> The `numpy_things` one is actually invalid code, and compiles only because of https://github.com/m-labs/artiq/issues/1497 (the memory just hapens to be uncorrupted with less aggresive optimisations). Shall we remove this specific test case?
Author
Owner

Isn't there the same problem with numpy_full, numpy_full_matrix and numpy_nan?

Isn't there the same problem with `numpy_full`, `numpy_full_matrix` and `numpy_nan`?
Collaborator

Shall we remove this specific test case?

Yes; in fact, I did that on that passes branch just now. (There are also a few similarly broken test cases, which I also removed – RPC of arrays is also exercised test_numpy.)

I've also made nowrite lower to the inaccessiblememonly LLVM function attribute instead, which fixes the RTIO counter test.

Don't really have time to run the full test suite on hardware right now – if you do, feel free to go ahead.

Currently, the invariant_propagation lit test fails, as the slightly change pass order causes IPSCCP not to trigger on the self argument (so it then doesn't matter whether kernel_invariant even does what it is supposed to). Running the IR through opt -O2 again manually cleans it right up, so it really is only a pass ordering issue. If this turns out to be a performance issue in real-world code, we could add some extra IPSSCP passes (or whatever turns out to be neeed) at the right pipeline hook points, but that would require a closer look at why it doesn't trigger first.

> Shall we remove this specific test case? Yes; in fact, I did that on that `passes` branch just now. (There are also a few similarly broken test cases, which I also removed – RPC of arrays is also exercised test_numpy.) I've also made `nowrite` lower to the `inaccessiblememonly` LLVM function attribute instead, which fixes the RTIO counter test. Don't really have time to run the full test suite on hardware right now – if you do, feel free to go ahead. Currently, the invariant_propagation lit test fails, as the slightly change pass order causes IPSCCP not to trigger on the `self` argument (so it then doesn't matter whether `kernel_invariant` even does what it is supposed to). Running the IR through `opt -O2` again manually cleans it right up, so it really is only a pass ordering issue. If this turns out to be a performance issue in real-world code, we could add some extra IPSSCP passes (or whatever turns out to be neeed) at the right pipeline hook points, but that would require a closer look at why it doesn't trigger first.
Collaborator

Isn't there the same problem with numpy_full, numpy_full_matrix and numpy_nan?

Yes, there is/was.

> Isn't there the same problem with numpy_full, numpy_full_matrix and numpy_nan? Yes, there is/was.
Contributor

I've checked with an updated llvmlite (the one for numba, in channel 20.09), and using this branch with some modification (removed the variable debug information... it breaks the compilation). It seems that the autovectorizer is not invoked either.

I wonder if the IR is too complicated for the optimizer to optimize.

Not sure if that is valid though, it would break other code like DMA...

I've checked with an updated llvmlite (the one for numba, in channel 20.09), and using this branch with some modification (removed the variable debug information... it breaks the compilation). It seems that the autovectorizer is not invoked either. I wonder if the IR is too complicated for the optimizer to optimize. Not sure if that is valid though, it would break other code like DMA...
Collaborator

Oh, the autovectorizer definitely does something for a few tests, like for instance some of the (IIRC integer) math throughput tests from the paper. It's just that interestingly, it leaves e.g. the floating point dot product test alone. You can check the optimizer remarks for details (either by passing in the flag through llvmlite's backend options, or running standalone opt); in the case of the floating point benchmarks, it apparently thought it wasn't profitable to vectorize according to the cost function. (I didn't investigate further.)

Oh, the autovectorizer definitely does something for a few tests, like for instance some of the (IIRC integer) math throughput tests from the paper. It's just that interestingly, it leaves e.g. the floating point dot product test alone. You can check the optimizer remarks for details (either by passing in the flag through llvmlite's backend options, or running standalone opt); in the case of the floating point benchmarks, it apparently thought it wasn't profitable to vectorize according to the cost function. (I didn't investigate further.)
Collaborator

What are you worried about regarding DMA? Accesses to memory-mapped locations of course need to be marked as volatile, but that isn't really related to the autovectorizer or DMA.

What are you worried about regarding DMA? Accesses to memory-mapped locations of course need to be marked as volatile, but that isn't really related to the autovectorizer or DMA.
Contributor

What are you worried about regarding DMA? Accesses to memory-mapped locations of course need to be marked as volatile, but that isn't really related to the autovectorizer or DMA.

My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue.

Oh, the autovectorizer definitely does something for a few tests, like for instance some of the (IIRC integer) math throughput tests from the paper. It's just that interestingly, it leaves e.g. the floating point dot product test alone. You can check the optimizer remarks for details (either by passing in the flag through llvmlite's backend options, or running standalone opt); in the case of the floating point benchmarks, it apparently thought it wasn't profitable to vectorize according to the cost function. (I didn't investigate further.)

Would have a look at that tmr.

> What are you worried about regarding DMA? Accesses to memory-mapped locations of course need to be marked as volatile, but that isn't really related to the autovectorizer or DMA. My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue. > Oh, the autovectorizer definitely does something for a few tests, like for instance some of the (IIRC integer) math throughput tests from the paper. It's just that interestingly, it leaves e.g. the floating point dot product test alone. You can check the optimizer remarks for details (either by passing in the flag through llvmlite's backend options, or running standalone opt); in the case of the floating point benchmarks, it apparently thought it wasn't profitable to vectorize according to the cost function. (I didn't investigate further.) Would have a look at that tmr.
Collaborator

My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue.

Ah, I understand – strange nonetheless. Makes me slightly worried about issues in the generated IR that might just not be as visible with the older LLVM, but still cause problems (cf. the memory corruption issues we've been seeing on my experiment using Kasli/or1k). This might be interesting to look into.

Would have a look at that tmr.

To be honest, understanding/tuning the autovectorizer probably wouldn't be very high up on my priority list, at least as a user – until somebody actually has a compute-bound deployment of ARTIQ/Zynq, that is. (Perhaps the vectorizer is just tuned to leave memory-bound loops alone, or something like that.)

> My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue. Ah, I understand – strange nonetheless. Makes me slightly worried about issues in the generated IR that might just not be as visible with the older LLVM, but still cause problems (cf. the memory corruption issues we've been seeing on my experiment using Kasli/or1k). This might be interesting to look into. > Would have a look at that tmr. To be honest, understanding/tuning the autovectorizer probably wouldn't be very high up on my priority list, at least as a user – until somebody actually has a compute-bound deployment of ARTIQ/Zynq, that is. (Perhaps the vectorizer is just tuned to leave memory-bound loops alone, or something like that.)
Contributor

My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue.

Ah, I understand – strange nonetheless. Makes me slightly worried about issues in the generated IR that might just not be as visible with the older LLVM, but still cause problems (cf. the memory corruption issues we've been seeing on my experiment using Kasli/or1k). This might be interesting to look into.

Would have a look at that tmr.

To be honest, understanding/tuning the autovectorizer probably wouldn't be very high up on my priority list, at least as a user – until somebody actually has a compute-bound deployment of ARTIQ/Zynq, that is. (Perhaps the vectorizer is just tuned to leave memory-bound loops alone, or something like that.)

Maybe I can have a look into the IR problem, but not sure if I can find out the cause tmr. I have classes starting next week :).

> > My remark was about the changes to numba llvmlite, that change cause experiments requiring DMA to fail (firmware assert failed). IDK what exactly is causing this issue. > > Ah, I understand – strange nonetheless. Makes me slightly worried about issues in the generated IR that might just not be as visible with the older LLVM, but still cause problems (cf. the memory corruption issues we've been seeing on my experiment using Kasli/or1k). This might be interesting to look into. > > > Would have a look at that tmr. > > To be honest, understanding/tuning the autovectorizer probably wouldn't be very high up on my priority list, at least as a user – until somebody actually has a compute-bound deployment of ARTIQ/Zynq, that is. (Perhaps the vectorizer is just tuned to leave memory-bound loops alone, or something like that.) Maybe I can have a look into the IR problem, but not sure if I can find out the cause tmr. I have classes starting next week :).
Sign in to join this conversation.
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/artiq-zynq#119
No description provided.