Fix RPC of ndarrays from host to device #527
Labels
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/nac3#527
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "fix/ndarray-rpc"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Thanks @lyken for the help debugging this issue.
Kernel used to test this functionality:
Fixes #496.
006e60bc50
tof98d91c3c6
@ -531,0 +673,4 @@
let ndarray_nbytes = ctx
.build_call_or_invoke(
rpc_recv,
&[buffer.base_ptr(ctx, generator).into()], // Reads [i32; ndims]. NOTE: We are allocated [size_t; ndims].
We should delete/change my comment here. I think it is not
[i32; ndims]
. Also I messed up the sentence afterNOTE:
.Updated.
@ -531,0 +680,4 @@
.unwrap();
// debug_assert(ndarray_nbytes > 0)
if ctx.registry.llvm_options.opt_level == OptimizationLevel::None {
Unrelated: I think there are inconsistencies with how we insert assertion tests in NAC3.
Sometimes we use
cfg!(debug_assertions)
and sometimes we are testing againstOptimizationLevel::None
. It might be necessary to standardize this approach in the future, say with a purpose-defined function likeCodeGenContext::should_add_assertions() -> bool
.IMO
debug_assertions
are for assertions that can be checked at compile-time (think function contract violations in Rust), whereasOptimizationNone
are for assertions that can only be checked at runtime...I should write some documentation on this.
@ -531,0 +716,4 @@
// `ndarray.shape` must be initialized beforehand in this implementation
// (for ndarray.create_data() to know how many elements to allocate)
let num_elements =
call_ndarray_calc_size(generator, ctx, &ndarray.dim_sizes(), (None, None));
Probably some super insignificant details: I used
ndarray_nbytes / itemsize
to derive thenum_elements
forndarray.create_data
because it is computationally less expensive. Doing it like this is a little bit less efficient but I am not sure if this is concerning.Also, it is possible to simply do
let ndarray_data = ctx.builder.build_alloca(ndarray_nbytes, i8_type)
and just setndarray->data = ndarray_data
.Unless LLVM just figures everything out and optimizes this.
Either way, probably not a concern for now.
Yeah, but using
alloca
with typeT
guarantees that objects are properly aligned. I think let's get the correct implementation out there first, then we can think about how to optimize it further.@ -531,0 +767,4 @@
let alloc_ptr =
ctx.builder.build_array_alloca(llvm_pi8, alloc_size, "rpc.alloc").unwrap();
let alloc_ptr =
ctx.builder.build_pointer_cast(alloc_ptr, llvm_pi8, "rpc.alloc.ptr").unwrap();
Redundant pointer cast I think.
Not redundant; First
alloc_ptr
isi8**
(note that we arealloca
-ingi8*
), second one casts it back toi8*
.Oh
Wait actually, why do we allocate
[i8*; alloc_size]
here? I thoughtalloc_size
is the number of bytes to allocate on the kernel's end.Ah, it's for 4-byte alignment. And it's overallocating 4 times the buffer size as well...
Let me fix that.
We now allocate
alloc_size
by allocatinground_up(alloc_size, sizeof(T)) / sizeof(T)
objects ofT
, which should also address any alignment issues.f98d91c3c6
toa59c26aa99