First, this calling convention doesn't actually exist in OR1K
and trying to use it in Asserts build causes an UNREACHABLE.
Second, I tried to introduce it and it does not appear to produce
any measurable benefit: not only OR1K has a ton of CSRs but also
it is quite hard, if not realistically impossible, to produce
the kind of register pressure that would be relieved by sparing
a few more CSRs for our exception raising function calls, since
temporaries don't have to be preserved before a noreturn call
and spilling over ten registers across an exceptional edge
is not something that the code we care about would do.
Third, it produces measurable drawbacks: it inflates code size
of check:* functions by adding spills. Of course, this could be
alleviated by making __artiq_raise coldcc as well, but what's
the point anyway?
Fascinatingly, the fact that you can mark call instructions with
!tbaa metadata is completely undocumented. Regardless, it is true:
a !tbaa metadata for an "immutable" type will cause
AliasAnalysis::getModRefBehavior to return OnlyReadsMemory for that
call site.
Don't bother marking loads with TBAA yet since we already place
!load.invariant on them (which is as good as the TBAA "immutable"
flag) and after that we're limited by lack of !nonnull anyway.
Also, add TBAA analysis passes in our pipeline to actually engage it.
The reasons are:
1. Shadow memory manipulation added ~12 instructions to TTLOut.pulse
(without inlining), and it's already barely fast enough.
2. More importantly, code such as self.ts[1] did not trigger
attribute writeback, and there seems to be no easy way to fix
that.
After this commit, the delay instruction (again) does not generate
any LLVM IR: all heavy lifting is relegated to the delay and delay_mu
intrinsics. When the interleave transform needs to adjust the global
timeline, it synthesizes a delay_mu intrinsnic. This way,
the interleave transformation becomes composable, as the input and
the output IR invariants are the same.
Also, code generation is adjusted so that a basic block is split off
not only after a delay call, but also before one; otherwise, e.g.,
code immediately at the beginning of a `with parallel:` branch
would have no choice but to execute after another branch has already
advanced the timeline.
This takes care of issue #1 described in 50e7b44 and is a step
to solving issue #2.