Tuple by value (#172) #174

Merged
sb10q merged 1 commits from tuple_by_value into master 2024-08-17 17:37:19 +08:00
Collaborator

Handle nac3 python tuple always by llvm struct value as #172 specified.

Handle nac3 python tuple always by llvm struct value as #172 specified.
sb10q reviewed 2022-01-22 18:26:54 +08:00
@ -130,2 +130,4 @@
DEF_SLICE_ASSIGN(uint32_t)
DEF_SLICE_ASSIGN(uint64_t)
int32_t __nac3_list_slice_assign_var_size(
Owner

Isn't this new functionality that should go into a separate PR/commit?

Isn't this new functionality that should go into a separate PR/commit?
Author
Collaborator

I think it is not a new functionality as this is just to make list of tuples to be handled correctly. Previously lists of tuples are just list of pointers, but now after this patch lists of tuples are lists of structs.

I think it is not a new functionality as this is just to make list of tuples to be handled correctly. Previously lists of tuples are just list of pointers, but now after this patch lists of tuples are lists of structs.
Owner

We really could use some unit tests...

We really could use some unit tests...
ychenfo force-pushed tuple_by_value from b74731589c to e042db0734 2022-01-26 04:14:38 +08:00 Compare
Author
Collaborator

rebased to include tests and latest commits in master branch.

rebased to include tests and latest commits in master branch.
sb10q reviewed 2022-01-26 07:31:53 +08:00
@ -130,2 +130,4 @@
DEF_SLICE_ASSIGN(uint32_t)
DEF_SLICE_ASSIGN(uint64_t)
int32_t __nac3_list_slice_assign_var_size(
Owner

Could all the other functions be replaced with this one (and without performance impact)? Seems there's a lot of code duplication currently.

Could all the other functions be replaced with this one (and without performance impact)? Seems there's a lot of code duplication currently.
Author
Collaborator

Yes indeed there's a lot of code duplication now.. functionality wise all the other ones can be replaced by this, but this one will be a bit slower when step is not 1, since the original assignment in the for loop is replaced by memcpy called repeatedly; and since this memcpy is inside the function I am not sure whether they will eventually be inlined contionally when calling __nac3_list_slice_assign_var_size with small size.

Yes indeed there's a lot of code duplication now.. functionality wise all the other ones can be replaced by this, but this one will be a bit slower when step is not 1, since the original assignment in the for loop is replaced by `memcpy` called repeatedly; and since this `memcpy` is inside the function I am not sure whether they will eventually be inlined contionally when calling `__nac3_list_slice_assign_var_size` with small `size`.
Author
Collaborator

ah what about this? might still be a bit less efficient because of the condition but would be more compact.

diff --git a/nac3core/src/codegen/irrt/irrt.c b/nac3core/src/codegen/irrt/irrt.c
index 7381fac..2b2c4a9 100644
--- a/nac3core/src/codegen/irrt/irrt.c
+++ b/nac3core/src/codegen/irrt/irrt.c
@@ -126,9 +126,9 @@ int32_t __nac3_range_slice_len(const int32_t start, const int32_t end, const int
     return dest_arr_len; \
 } \
 
-DEF_SLICE_ASSIGN(uint8_t)
-DEF_SLICE_ASSIGN(uint32_t)
-DEF_SLICE_ASSIGN(uint64_t)
+// DEF_SLICE_ASSIGN(uint8_t)
+// DEF_SLICE_ASSIGN(uint32_t)
+// DEF_SLICE_ASSIGN(uint64_t)
 
 int32_t __nac3_list_slice_assign_var_size(
     int32_t dest_start,
@@ -185,8 +185,16 @@ int32_t __nac3_list_slice_assign_var_size(
         (src_step > 0) ? (src_ind <= src_end) : (src_ind >= src_end);
         src_ind += src_step, dest_ind += dest_step
     ) {
-        /* memcpy for var size, cannot overlap after previous alloca */
-        __builtin_memcpy(dest_arr + dest_ind * size, src_arr + src_ind * size, size);
+        if (size == 1) {
+            dest_arr[dest_ind] = src_arr[src_ind];
+        } else if (size == 4) {
+            ((uint32_t *)dest_arr)[dest_ind] = ((uint32_t *)src_arr)[src_ind];
+        } else if (size == 8) {
+            ((uint64_t *)dest_arr)[dest_ind] = ((uint64_t *)src_arr)[src_ind];
+        } else {
+            /* memcpy for var size, cannot overlap after previous alloca */
+            __builtin_memcpy(dest_arr + dest_ind * size, src_arr + src_ind * size, size);
+        }
     }
     /* only dest_step == 1 can we shrink the dest list. */
     /* size should be ensured prior to calling this function */
ah what about this? might still be a bit less efficient because of the condition but would be more compact. ```diff diff --git a/nac3core/src/codegen/irrt/irrt.c b/nac3core/src/codegen/irrt/irrt.c index 7381fac..2b2c4a9 100644 --- a/nac3core/src/codegen/irrt/irrt.c +++ b/nac3core/src/codegen/irrt/irrt.c @@ -126,9 +126,9 @@ int32_t __nac3_range_slice_len(const int32_t start, const int32_t end, const int return dest_arr_len; \ } \ -DEF_SLICE_ASSIGN(uint8_t) -DEF_SLICE_ASSIGN(uint32_t) -DEF_SLICE_ASSIGN(uint64_t) +// DEF_SLICE_ASSIGN(uint8_t) +// DEF_SLICE_ASSIGN(uint32_t) +// DEF_SLICE_ASSIGN(uint64_t) int32_t __nac3_list_slice_assign_var_size( int32_t dest_start, @@ -185,8 +185,16 @@ int32_t __nac3_list_slice_assign_var_size( (src_step > 0) ? (src_ind <= src_end) : (src_ind >= src_end); src_ind += src_step, dest_ind += dest_step ) { - /* memcpy for var size, cannot overlap after previous alloca */ - __builtin_memcpy(dest_arr + dest_ind * size, src_arr + src_ind * size, size); + if (size == 1) { + dest_arr[dest_ind] = src_arr[src_ind]; + } else if (size == 4) { + ((uint32_t *)dest_arr)[dest_ind] = ((uint32_t *)src_arr)[src_ind]; + } else if (size == 8) { + ((uint64_t *)dest_arr)[dest_ind] = ((uint64_t *)src_arr)[src_ind]; + } else { + /* memcpy for var size, cannot overlap after previous alloca */ + __builtin_memcpy(dest_arr + dest_ind * size, src_arr + src_ind * size, size); + } } /* only dest_step == 1 can we shrink the dest list. */ /* size should be ensured prior to calling this function */ ```
Owner

Isn't it the point of __builtin_memcpy that the compiler can automatically replace it with an efficient implementation for small constant sizes? Please take a look.

Isn't it the point of ``__builtin_memcpy`` that the compiler can automatically replace it with an efficient implementation for small constant sizes? Please take a look.
Contributor

I think calling memcpy will already optimize it for small constant size.

void copy(int* a, int* b) {
    memcpy(a, b, 10);
}

will be compiled as

define dso_local void @copy(i32* noundef %0, i32* noundef %1) #0 !dbg !9 {
  %3 = alloca i32*, align 4
  %4 = alloca i32*, align 4
  store i32* %0, i32** %3, align 4
  call void @llvm.dbg.declare(metadata i32** %3, metadata !16, metadata !DIExpression()), !dbg !17
  store i32* %1, i32** %4, align 4
  call void @llvm.dbg.declare(metadata i32** %4, metadata !18, metadata !DIExpression()), !dbg !19
  %5 = load i32*, i32** %3, align 4, !dbg !20
  %6 = bitcast i32* %5 to i8*, !dbg !21
  %7 = load i32*, i32** %4, align 4, !dbg !22
  %8 = bitcast i32* %7 to i8*, !dbg !21
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %6, i8* align 4 %8, i32 10, i1 false), !dbg !21
  ret void, !dbg !23
}

and lowered as (RISC-V)

copy:                                   # @copy
        addi    sp, sp, -16
        sw      ra, 12(sp)                      # 4-byte Folded Spill
        sw      s0, 8(sp)                       # 4-byte Folded Spill
        addi    s0, sp, 16
        sw      a0, -12(s0)
        sw      a1, -16(s0)
        lw      a1, -12(s0)
        lw      a0, -16(s0)
        lh      a2, 8(a0)
        sh      a2, 8(a1)
        lw      a2, 4(a0)
        sw      a2, 4(a1)
        lw      a0, 0(a0)
        sw      a0, 0(a1)
        lw      ra, 12(sp)                      # 4-byte Folded Reload
        lw      s0, 8(sp)                       # 4-byte Folded Reload
        addi    sp, sp, 16
        ret

__builtin_memcpy guarantees the memcpy will be inlined, but this is probably not we want (for larger arrays we should call the fast SIMD memcpy implementation in our runtime) Sorry forgot there is a suffix _inline there... I guess for our purpose, the compiler will emit a memcpy intrinsic call and optimize for small size already.

I think calling `memcpy` will already optimize it for small constant size. ```C void copy(int* a, int* b) { memcpy(a, b, 10); } ``` will be compiled as ```llvm define dso_local void @copy(i32* noundef %0, i32* noundef %1) #0 !dbg !9 { %3 = alloca i32*, align 4 %4 = alloca i32*, align 4 store i32* %0, i32** %3, align 4 call void @llvm.dbg.declare(metadata i32** %3, metadata !16, metadata !DIExpression()), !dbg !17 store i32* %1, i32** %4, align 4 call void @llvm.dbg.declare(metadata i32** %4, metadata !18, metadata !DIExpression()), !dbg !19 %5 = load i32*, i32** %3, align 4, !dbg !20 %6 = bitcast i32* %5 to i8*, !dbg !21 %7 = load i32*, i32** %4, align 4, !dbg !22 %8 = bitcast i32* %7 to i8*, !dbg !21 call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %6, i8* align 4 %8, i32 10, i1 false), !dbg !21 ret void, !dbg !23 } ``` and lowered as (RISC-V) ``` copy: # @copy addi sp, sp, -16 sw ra, 12(sp) # 4-byte Folded Spill sw s0, 8(sp) # 4-byte Folded Spill addi s0, sp, 16 sw a0, -12(s0) sw a1, -16(s0) lw a1, -12(s0) lw a0, -16(s0) lh a2, 8(a0) sh a2, 8(a1) lw a2, 4(a0) sw a2, 4(a1) lw a0, 0(a0) sw a0, 0(a1) lw ra, 12(sp) # 4-byte Folded Reload lw s0, 8(sp) # 4-byte Folded Reload addi sp, sp, 16 ret ``` ~~[`__builtin_memcpy`](https://clang.llvm.org/docs/LanguageExtensions.html#guaranteed-inlined-copy) guarantees the memcpy will be inlined, but this is probably not we want (for larger arrays we should call the fast SIMD memcpy implementation in our runtime)~~ Sorry forgot there is a suffix `_inline` there... I guess for our purpose, the compiler will emit a memcpy intrinsic call and optimize for small size already.
Contributor

btw I am quite concerned about this one:

+ else if (size == 4) {
+     ((uint32_t *)dest_arr)[dest_ind] = ((uint32_t *)src_arr)[src_ind];
+ } 

It is possible to have a tuple of (u8, u8, u8, u8) that aligns to byte boundary, and if you cast the array to an int32 it will cause alignment fault... I guess you should just let memcpy do its job. If you are worried about not getting constant optimizations, you can do something like this:

void copy(char* a, char* b, unsigned int size) {
    if (size == 1)
        __builtin_memcpy(a, b, 1);
    else if (size == 4)
        __builtin_memcpy(a, b, 4);
    else
        __builtin_memcpy(a, b, size);
}

Which will be optimized to

copy:                                   # @copy
        li      a3, 4
        beq     a2, a3, .LBB0_3
        li      a3, 1
        bne     a2, a3, .LBB0_4
        lb      a1, 0(a1)
        sb      a1, 0(a0)
        ret
.LBB0_3:
        lbu     a2, 2(a1)
        lbu     a3, 3(a1)
        lbu     a4, 0(a1)
        lbu     a1, 1(a1)
        sb      a2, 2(a0)
        sb      a3, 3(a0)
        sb      a4, 0(a0)
        sb      a1, 1(a0)
        ret
.LBB0_4:
        tail    memcpy@plt

(not very beautiful but works, it is common when you do performance tuning...)

btw I am quite concerned about this one: ``` + else if (size == 4) { + ((uint32_t *)dest_arr)[dest_ind] = ((uint32_t *)src_arr)[src_ind]; + } ``` It is possible to have a tuple of `(u8, u8, u8, u8)` that aligns to byte boundary, and if you cast the array to an int32 it will cause alignment fault... I guess you should just let memcpy do its job. If you are worried about not getting constant optimizations, you can do something like this: ```C void copy(char* a, char* b, unsigned int size) { if (size == 1) __builtin_memcpy(a, b, 1); else if (size == 4) __builtin_memcpy(a, b, 4); else __builtin_memcpy(a, b, size); } ``` Which will be optimized to ``` copy: # @copy li a3, 4 beq a2, a3, .LBB0_3 li a3, 1 bne a2, a3, .LBB0_4 lb a1, 0(a1) sb a1, 0(a0) ret .LBB0_3: lbu a2, 2(a1) lbu a3, 3(a1) lbu a4, 0(a1) lbu a1, 1(a1) sb a2, 2(a0) sb a3, 3(a0) sb a4, 0(a0) sb a1, 1(a0) ret .LBB0_4: tail memcpy@plt ``` (not very beautiful but works, it is common when you do performance tuning...)
Author
Collaborator

Thanks for the suggestions! Indeed using the trick to do constant optimization will be safer, and I was just a bit worried about constant optimization of the call to __builtin_memcpy intrinsic as previously I just call it with the size argument always being a variable.

sorry that I do not quite understand the (u8, u8, u8, u8) alignment error case.. I tougly test this on host (something like l = [(True,False,True,False), (True,True,True,True), (False, False, False, False)]) and it seems to be giving me the expected output?

Thanks for the suggestions! Indeed using the trick to do constant optimization will be safer, and I was just a bit worried about constant optimization of the call to `__builtin_memcpy` intrinsic as previously I just call it with the `size` argument always being a variable. sorry that I do not quite understand the `(u8, u8, u8, u8)` alignment error case.. I tougly test this on host (something like `l = [(True,False,True,False), (True,True,True,True), (False, False, False, False)]`) and it seems to be giving me the expected output?
Contributor

Thanks for the suggestions! Indeed using the trick to do constant optimization will be safer, and I was just a bit worried about constant optimization of the call to __builtin_memcpy intrinsic as previously I just call it with the size argument always being a variable.

sorry that I do not quite understand the (u8, u8, u8, u8) alignment error case.. I tougly test this on host (something like l = [(True,False,True,False), (True,True,True,True), (False, False, False, False)]) and it seems to be giving me the expected output?

It may give expected result, but it is UB.

> Thanks for the suggestions! Indeed using the trick to do constant optimization will be safer, and I was just a bit worried about constant optimization of the call to `__builtin_memcpy` intrinsic as previously I just call it with the `size` argument always being a variable. > > sorry that I do not quite understand the `(u8, u8, u8, u8)` alignment error case.. I tougly test this on host (something like `l = [(True,False,True,False), (True,True,True,True), (False, False, False, False)]`) and it seems to be giving me the expected output? It may give expected result, but it is UB.
Author
Collaborator

I see, thanks!

I see, thanks!
pca006132 marked this conversation as resolved
pca006132 reviewed 2022-01-27 16:49:51 +08:00
@ -132,0 +150,4 @@
const int32_t src_len = (src_end >= src_start) ? (src_end - src_start + 1) : 0;
const int32_t dest_len = (dest_end >= dest_start) ? (dest_end - dest_start + 1) : 0;
if (src_len > 0) {
__builtin_memmove(
Contributor

what if src_len > dest_len?

what if `src_len > dest_len`?
Author
Collaborator

I think then it would be extending the list, and we would later add the check to throw an exception prior to entering this function?

I think then it would be extending the list, and we would later add the check to throw an exception prior to entering this function?
Contributor

Maybe mark it as a TODO.

Maybe mark it as a TODO.
Author
Collaborator

I mean this irrt function then is supposed to be only run with valid input parameters, for checks on things like src_len > dest_len and throwing exceptions about them, we already have TODO here?

I mean this irrt function then is supposed to be only run with valid input parameters, for checks on things like `src_len > dest_len` and throwing exceptions about them, we already have `TODO` [here](https://git.m-labs.hk/M-Labs/nac3/src/branch/master/nac3core/src/codegen/irrt/mod.rs#L304-L306)?
pca006132 marked this conversation as resolved
ychenfo force-pushed tuple_by_value from e042db0734 to ef4598ba99 2022-02-02 03:32:52 +08:00 Compare
Author
Collaborator

rebased on the latest master branch, and use one list slice assignment function to handle various sizes of list element.

rebased on the latest master branch, and use one list slice assignment function to handle various sizes of list element.
Contributor

btw, just found some style problem in the previous code, can you fix this together in this PR?

https://git.m-labs.hk/M-Labs/nac3/src/branch/master/nac3core/src/codegen/stmt.rs#L86-L154

Please change this to a match with 3 clauses.

btw, just found some style problem in the previous code, can you fix this together in this PR? https://git.m-labs.hk/M-Labs/nac3/src/branch/master/nac3core/src/codegen/stmt.rs#L86-L154 Please change this to a match with 3 clauses.
ychenfo force-pushed tuple_by_value from ef4598ba99 to d6ab73afb0 2022-02-07 02:19:23 +08:00 Compare
Author
Collaborator

Please change this to a match with 3 clauses.

rebased as the latest master branch and fix the style problem.

> Please change this to a match with 3 clauses. rebased as the latest master branch and fix the style problem.
pca006132 approved these changes 2022-02-07 11:57:41 +08:00
sb10q merged commit d6ab73afb0 into master 2022-02-07 11:59:34 +08:00
sb10q deleted branch tuple_by_value 2022-02-07 11:59:34 +08:00
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
3 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/nac3#174
No description provided.