Introduce IRRT and implement power operator (#50) #160

Closed
ychenfo wants to merge 1 commits from power_operator into master
Collaborator

All the functional parts is implemented using the exponentiation by squaring algorithm adapted from GNU Scientific Library for integers (the bug mentioned in #50 is also fixed):

  • int32**int32 -> int32 (only non-negative exponent is allowed)
  • double**double -> double
  • int64**int64 -> int64 (do we need to change this to int64**int32 -> int64?)
  • double**int32 -> double (it's only in our type system the exponent is int32 - llvm only support i16 in its intrinsic functions, so currently the implementation is that we should throw exception when the exponent is bigger than i16)

This is a WIP PR mainly because it seems not very appropriate to set it to merge into the list_slice branch (the irrt in that branch is also used to implemented the exponentiation by squaring algorithm). So maybe now it's for review only and later I will change the target branch to master after list_slice is merged..

All the functional parts is implemented using the exponentiation by squaring algorithm adapted from [GNU Scientific Library](https://git.savannah.gnu.org/cgit/gsl.git/tree/sys/pow_int.c) for integers (the bug mentioned in #50 is also fixed): - `int32**int32 -> int32` (only non-negative exponent is allowed) - `double**double -> double` - `int64**int64 -> int64` (do we need to change this to `int64**int32 -> int64`?) - `double**int32 -> double` (it's only in our type system the exponent is `int32` - llvm only support `i16` in its intrinsic functions, so currently the implementation is that we should throw exception when the exponent is bigger than `i16`) --- ~~This is a WIP PR mainly because it seems not very appropriate to set it to merge into the `list_slice` branch (the irrt in that branch is also used to implemented the exponentiation by squaring algorithm). So maybe now it's for review only and later I will change the target branch to master after list_slice is merged..~~

Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there.

Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there.
Poster
Collaborator

Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there.

Oh sorry I did not consider much about constant folding for this. It seems the current implementation for int**int does not do very well on that. I will look more into that.

> Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there. Oh sorry I did not consider much about constant folding for this. It seems the current implementation for `int**int` does not do very well on that. I will look more into that.

you can print the IR and run it through opt to see if it can fold well.

you can print the IR and run it through opt to see if it can fold well.

and why can't we use the intrinsic? is it removed in newer versions?

and why can't we use the intrinsic? is it removed in newer versions?
Poster
Collaborator

and why can't we use the intrinsic? is it removed in newer versions?

Yes another way is to use the double**int -> double llvm intrinsic (llvm does not provide int**int -> int intrinsic..). And to address this bug use some float to int conversion that can take care of integer overflow.

> and why can't we use the intrinsic? is it removed in newer versions? Yes another way is to use the `double**int -> double` llvm intrinsic (llvm does not provide `int**int -> int` intrinsic..). And to address [this bug](https://github.com/m-labs/artiq/issues/1505) use some float to int conversion that can take care of integer overflow.

Yes another way is to use the double**int -> double llvm intrinsic (llvm does not provide int**int -> int intrinsic..). And to address this bug use some float to int conversion that can take care of integer overflow.

OK. It seems that libm doesn't provide integer power function.

Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there.

I think this is better for constant folding if the function is inlined. I think we can just compile some examples and inspect the IR.

> Yes another way is to use the `double**int -> double` llvm intrinsic (llvm does not provide `int**int -> int` intrinsic..). And to address [this bug](https://github.com/m-labs/artiq/issues/1505) use some float to int conversion that can take care of integer overflow. OK. It seems that libm doesn't provide integer power function. > Does it play nice with constant propagation/folding? A common situation is shifting bits by multiplying by a power of 2, and this absolutely has to be folded. The code in the legacy compiler uses LLVM intrinsics so it potentially has an advantage there. I think this is better for constant folding if the function is inlined. I think we can just compile some examples and inspect the IR.

I think this is better for constant folding if the function is inlined.

We're building IRRT at the IR level precisely to allow this kind of inlining.

BTW this relatively simple PR may be a good occasion to introduce IRRT, instead of the complicated list slicing one.

> I think this is better for constant folding if the function is inlined. We're building IRRT at the IR level precisely to allow this kind of inlining. BTW this relatively simple PR may be a good occasion to introduce IRRT, instead of the complicated list slicing one.

Example use case:
https://github.com/m-labs/artiq/blob/nac3/artiq/coredevice/ttl.py#L516-L519

The user calls that function with a fixed frequency. We should be able to fold/inline everything so it results in a single call to rtio_output with constant parameters.

Example use case: https://github.com/m-labs/artiq/blob/nac3/artiq/coredevice/ttl.py#L516-L519 The user calls that function with a fixed frequency. We should be able to fold/inline everything so it results in a single call to ``rtio_output`` with constant parameters.

And to address this bug use some float to int conversion that can take care of integer overflow.

I'm not really sure if that's possible... there's also the issue of loss of precision with floats and int64. A float64 has only 52 bits of mantissa and large int64 results will be imprecise and potentially result in very confusing behavior.
If we can implement the integer case with IRRT exponentiation-by-squaring and LLVM handles well the "constant inputs" case, then we're good.

> And to address this bug use some float to int conversion that can take care of integer overflow. I'm not really sure if that's possible... there's also the issue of loss of precision with floats and int64. A float64 has only 52 bits of mantissa and large int64 results will be imprecise and potentially result in very confusing behavior. If we can implement the integer case with IRRT exponentiation-by-squaring and LLVM handles well the "constant inputs" case, then we're good.
ychenfo force-pushed power_operator from 78c0561b02 to 9586910bec 2022-01-08 06:30:25 +08:00 Compare
ychenfo changed title from WIP: implement power operator (#50) to Introduce IRRT and implement power operator (#50) 2022-01-08 06:31:31 +08:00
ychenfo changed target branch from list_slice to master 2022-01-08 06:31:31 +08:00
Poster
Collaborator

I'm not really sure if that's possible... there's also the issue of loss of precision with floats and int64. A float64 has only 52 bits of mantissa and large int64 results will be imprecise and potentially result in very confusing behavior.
If we can implement the integer case with IRRT exponentiation-by-squaring and LLVM handles well the "constant inputs" case, then we're good.

Sure, the exponentiation-by-squaring method in irrt should be ok for constant folding/inlining.

BTW this relatively simple PR may be a good occasion to introduce IRRT, instead of the complicated list slicing one.

I have removed list slice related part in this branch and changed the target branch to master, thanks for the suggestion!


I think this is better for constant folding if the function is inlined. I think we can just compile some examples and inspect the IR.

I think I must have messed up with some settings... the inline and constant folding seems to be ok. How about this result?

from numpy import int32
from min_artiq import *
from random import randint

a = randint(2, 5)
print(a)

@nac3
class Demo:
    core: KernelInvariant[Core]

    def __init__(self):
        self.core = Core()

    @kernel
    def run(self):
        self.indirect1(a**a)
        print_int32(10**10)
    
    @kernel
    def indirect1(self, v: int32):
        self.indirect2(v**2)
    
    @kernel
    def indirect2(self, v: int32):
        self.indirect3(v**2)
    
    @kernel
    def indirect3(self, v: int32):
        print_int32(v)


if __name__ == "__main__":
    Demo().run()

output:

2 ; output from python, randint returns 2

; ModuleID = 'main'
source_filename = "main"

@"139621056490272" = global { double } { double 1.000000e-09 }
@"139621056491280" = local_unnamed_addr global { { double }* } { { double }* @"139621056490272" }

define void @__modinit__() local_unnamed_addr {
init:
  tail call void @print_int32(i32 256)
  tail call void @print_int32(i32 1410065408)
  ret void
}

declare void @print_int32(i32) local_unnamed_addr
> I'm not really sure if that's possible... there's also the issue of loss of precision with floats and int64. A float64 has only 52 bits of mantissa and large int64 results will be imprecise and potentially result in very confusing behavior. If we can implement the integer case with IRRT exponentiation-by-squaring and LLVM handles well the "constant inputs" case, then we're good. Sure, the exponentiation-by-squaring method in irrt should be ok for constant folding/inlining. > BTW this relatively simple PR may be a good occasion to introduce IRRT, instead of the complicated list slicing one. I have removed list slice related part in this branch and changed the target branch to master, thanks for the suggestion! --- > I think this is better for constant folding if the function is inlined. I think we can just compile some examples and inspect the IR. I think I must have messed up with some settings... the inline and constant folding seems to be ok. How about this result? ```python from numpy import int32 from min_artiq import * from random import randint a = randint(2, 5) print(a) @nac3 class Demo: core: KernelInvariant[Core] def __init__(self): self.core = Core() @kernel def run(self): self.indirect1(a**a) print_int32(10**10) @kernel def indirect1(self, v: int32): self.indirect2(v**2) @kernel def indirect2(self, v: int32): self.indirect3(v**2) @kernel def indirect3(self, v: int32): print_int32(v) if __name__ == "__main__": Demo().run() ``` output: ```llvm 2 ; output from python, randint returns 2 ; ModuleID = 'main' source_filename = "main" @"139621056490272" = global { double } { double 1.000000e-09 } @"139621056491280" = local_unnamed_addr global { { double }* } { { double }* @"139621056490272" } define void @__modinit__() local_unnamed_addr { init: tail call void @print_int32(i32 256) tail call void @print_int32(i32 1410065408) ret void } declare void @print_int32(i32) local_unnamed_addr ```

You can't get any better than this :)

You can't get any better than this :)
sb10q reviewed 2022-01-08 21:00:46 +08:00
sb10q reviewed 2022-01-08 21:01:34 +08:00
@ -0,0 +75,4 @@
(64, 64) => IrrtSymbolTable::POWER_I64,
_ => unreachable!(),
};
let pow_fun = ctx.module.get_function(symbol).unwrap_or_else(|| load_irrt(ctx, symbol));

This creates multiple copies of IRRT with multithreaded codegen.
Why not have only one IRRT module that is simply linked in with the main thread?

This creates multiple copies of IRRT with multithreaded codegen. Why not have only one IRRT module that is simply linked in with the main thread?
sb10q reviewed 2022-01-08 21:02:26 +08:00
@ -0,0 +8,4 @@
# define MAX(a, b) (a > b ? a : b)
# define MIN(a, b) (a > b ? b : a)
int32_t __nac3_irrt_range_slice_len(const int32_t start, const int32_t end, const int32_t step) {

Let's remove list stuff and other unrelated code from this PR, which could then be merged before the more complex list features.

Let's remove list stuff and other unrelated code from this PR, which could then be merged before the more complex list features.
sb10q reviewed 2022-01-08 21:09:53 +08:00
@ -0,0 +30,4 @@
if f == &IrrtSymbolTable::POWER_I32 || f == &IrrtSymbolTable::POWER_I64 {
// add alwaysinline attributes to power function to help them get inlined
// alwaysinline enum = 1, see release/13.x/llvm/include/llvm/IR/Attributes.td
fun.add_attribute(AttributeLoc::Function, ctx.ctx.create_enum_attribute(1, 0));

Isn't there an easier way to do this with inkwell?

Isn't there an easier way to do this with inkwell?
sb10q reviewed 2022-01-08 21:13:12 +08:00
@ -0,0 +22,4 @@
);
let irrt_mod = Module::parse_bitcode_from_buffer(&bitcode_buf, ctx.ctx).unwrap();
irrt_mod.set_data_layout(&ctx.module.get_data_layout());
irrt_mod.set_triple(&ctx.module.get_triple());

Is set_data_layout and set_triple necessary?

Is set_data_layout and set_triple necessary?
sb10q reviewed 2022-01-08 21:35:48 +08:00
@ -437,3 +449,3 @@
args.insert(0, FuncArg { name: "self".into(), ty: obj.0, default_value: None });
}
let params = args.iter().map(|arg| ctx.get_llvm_type(generator, arg.ty).into()).collect_vec();
let params =

Please put unrelated style changes in separate PR (or commit them directly).

Please put unrelated style changes in separate PR (or commit them directly).
sb10q reviewed 2022-01-08 21:37:47 +08:00
@ -633,0 +662,4 @@
ctx.gen_int_ops(op, left, right)
} else if ty1 == ty2 && ctx.primitives.float == ty1 {
ctx.gen_float_ops(op, left, right)
} else if ty1 == ctx.primitives.float && ty2 == ctx.primitives.int32 {

Shouldn't we assert that op is Pow in this case?

Shouldn't we assert that ``op`` is ``Pow`` in this case?

I've reimplemented it based on your code. Can we delete the power_operator branch?

I've reimplemented it based on your code. Can we delete the ``power_operator`` branch?
Poster
Collaborator

I've reimplemented it based on your code. Can we delete the power_operator branch?

Thanks! Sure, power_operator branch can be deleted, and I will open up a new branch to handle the irrt linking in nac3standalone and nix build failure.

> I've reimplemented it based on your code. Can we delete the ``power_operator`` branch? Thanks! Sure, ``power_operator`` branch can be deleted, and I will open up a new branch to handle the irrt linking in nac3standalone and nix build failure.
ychenfo closed this pull request 2022-01-09 00:52:37 +08:00

Pull request closed

Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/nac3#160
There is no content yet.