The previous implementation was completely wrong: it always advanced
the global timeline by the same amount as the non-interleaved basic
block did.
The new implementation only advances the global timeline by
the difference between its current time and the virtual time of
the branch, which requires it to adjust the delay instructions.
Previously, the delay expression was present in the IR twice: once
as the iodelay.Expr transformation-visible form, and once as regular
IR instructions, with the latter form being passed to the delay_mu
builtin and advancing the runtime timeline.
As a result of this change, this strategy is no longer valid:
we can meaningfully mutate the iodelay.Expr form but not the IR
instruction form. Thus, IR instructions are no longer generated for
delay expressions, and the LLVM lowering pass now has to lower
the iodelay.Expr objects as well.
This works OK for flat `with parallel:` expressions, but breaks down
outside of `with parallel:` or when calls are present. The reasons
it breaks down are as follows:
* Outside of `with parallel:`, delay() and delay_mu() must accept
any expression, but iodelay.Expr's are not nearly expressive
enough. So, the IR instruction form must actually be kept as well.
* A delay instruction is currently inserted after a call to
a user-defined function; this delay instruction introduces
a point where basic block reordering is possible as well as
provides delay information. However, the callee knows nothing
about the context in which it is called, which means that
the runtime timeline is advanced twice. So, a new terminator
instruction must be added that combines the properties of delay
and call instructions (and another for delay and invoke as well).