List Slice Support (#72) #140
No reviewers
Labels
No Milestone
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/nac3#140
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "list_slice"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
now getting list slice is done.
unsure: a good way to support dropping/extending part of the list by list slice assignment?
WIP: List Slice Supportto WIP: List Slice Support (#72)Do we have to support extending the list by slice assignment? We have to reallocate the list in that case. @sb10q
No, not very important. Not sure it can be done without fancy memory management, which is for later - there are far more pressing issues at the moment.
Should we merge this then?
We need an error message if it attempts to extend the list though. Right now it seems it would just corrupt memory?
I wonder if we need such a complicated implementation... Can't we implement a builtin function for this?
and call this builtin function when we have slice assignment? It seems to me that such complicated code should be put in another function, otherwise the generated code would be bloated and probably not efficient (in terms of i-cache...). It would also be much more human readable this way.
Oh I think I get what you mean. Should we do this on the python side or on the rust side?
For getting list slice, I propose we still do it on the rust side, since do that on python side would require users to allocate the list of correct size prior to calling the builtin function if I understand it correctly?
For list slice assignment, we can implement it in the python side, and maybe somehow hard code in codegen to call that builtin function implicitly so that users can still use the python list slice assignment syntax?
Yes I think we can do it like this.
Could it go inside the compiler just like the current
"def __modinit__():\n base.{}({})"
of nac3artiq?Maybe we can embed some python files that contain things like this? Similar to how we handle the linker script now.
As long as it is embedded in the nac3 library and doesn't create a package management mess, I'm fine with any method.
Dropping part of the list is done, need to throw exceptions later when slice assignment and access is out of bound/step and the size are mismatch/step is 0.
I found directly using python to write such internal functions a bit messy as we need to parse them, type check them, and make another symbol resolver just to deal with it (like adding another small nac3standalone in nac3core...), which might also reduce the compilation speed. So I still write llvm in rust, and internally make some functions for these complex operations so that the output ir will not be bloated. I am not fully confident whether adding functions to llvm module this way (and their symbol name) is the most adequete but it should be working. Please have a review, thanks!
This still looks a bit too complicated.
a[1:2] = a[0:1]
, this implementation will doYes, getting list slices
a[0:1]
will always copy that sliced part according to python semantic, and llvm memmove intrinsic allows overlap? I can also optimize about this later but I think if the there are overlaps we would need to do some alloc anyway.. And we also do not support extending the list (onlya[:] = a
will be correct), so if I did not miss anythink I think currently the assignment rhs will not be the exact same list on the lhs?Thanks I see, will implement later.
I think I get the point, but I am not sure how to examine the size (at compile time?) for now. I need to look into inkwell and llvm documentation later.
So it seems that we indeed need to implement this in rust directly since we need things like memmove and dropping the list and modify the list's len behind the scene?
I not sure how to do this.. or can we implement this in C or rust and link them to our library?
Output LLVM IR from rustc, use
include_bytes!
, and then read it withModule.parse_bitcode_from_buffer
?Thanks for the suggestion! I will look into this and also maybe refer to this cpython implementation to further confirm our implementation.
Note that #146 changes the slice representation, so this PR should be merged after that and change to the correct representation.
Yes. The ‘llvm.memmove.*’ intrinsics move a block of memory from the source location to the destination location. It is similar to the ‘llvm.memcpy’ intrinsic but allows the two memory locations to overlap.
No.
You can now get the pointer size from the code generator struct after merging #146. You can use inkwell::types::BasicType::size_of to get the size of basic types (integers, floats, etc.), and if the result is none it means that it is a pointer.
I think there can are 2 cases.
memmove
calls. one for copying the content, and one for shifting the remaining elements in the dest array. Note that you have to check for whetherdest.len >= src.len
.8986da22d3
to1f71eacbd4
We may need to document that we do not support
a[:None:-1]
(maybe we can after implementing the Option type?), which is equivalent toa[::-1]
, which is NOT equivalent toa[:0:-1]
since the latter would exclude 0.(I don't even know you can write something like this)
Neither did I and I suspect none of our users need it. We can omit it.
e10ed55914
toec43d7df7e
ec43d7df7e
toe00ddfbc68
I end up still using c to implement the slice assignment function. Because if I did not miss something, we need VLA or explicit stack alloca for list assignment, and in rust, these two features are not easily available (rfc#618, rfc#1909) (it can be done for sure, but with some external crates(which does not seem very popular) that still calls some wrapper c code).
A issue is that I cannot seem to find a way to make clang emit bitcode without target information in our case, so when compiling for non-host targets, there will be some invalid flags warnings in the llvm function attribute. This should be fine since they seem to be ignored, but I am not fully confident whether this will cause any problem later.. I will look into how to delete all these function attributes in inkwell.
WIP: List Slice Support (#72)to List Slice Support (#72)@ -87,3 +96,1 @@
"trunc",
)
.into(),
codegen_callback: Some(Arc::new(GenCall::new(Box::new(|ctx, _, fun, args| {
Is this changing more than the indentation?
@ -135,3 +144,1 @@
"zext",
)
.into(),
codegen_callback: Some(Arc::new(GenCall::new(Box::new(|ctx, _, fun, args| {
Is this changing more than the indentation?
@ -626,0 +642,4 @@
) -> FunctionValue<'ctx> {
let out_dir = env!("OUT_DIR");
let bitcode_path = format!("{}/slice_assign.bc", out_dir);
let new_mod = Module::parse_bitcode_from_path(bitcode_path, ctx.ctx).unwrap();
This will break with packaging and create a mess that nobody wants to solve. I had told you several times to use
include_bytes!
andModule.parse_bitcode_from_buffer
.@ -0,0 +1,380 @@
use indoc::indoc;
Can this go in a separate PR? This one is already very large and hard to review.
@ -437,3 +438,3 @@
args.insert(0, FuncArg { name: "self".into(), ty: obj.0, default_value: None });
}
let params = args.iter().map(|arg| ctx.get_llvm_type(generator, arg.ty).into()).collect_vec();
let params =
Please put all pure style changes into a separate PR.
That doesn't sound very nice... Are you sure there isn't a way to make Clang output generic LLVM bitcode?
clang -target wasm32
seems to at least remove many CPU-specific options. Note that you need to usellvmPackages_13.clang-unwrapped
for-target
to work.Oh ok I will use include_bytes! and parse_bitcode_from_buffer. Sorry that I think previously I did not fully get the point.. I thought it was only about outputing the llvm bitcode to the appropriate cargo OUT_DIR directory - just curious, may I ask why
include_bytes!
andparse_bitcode_from_buffer
is better? and what package management mess will there be if using parse from path? Thanks!Ok I will put style changes into a seoarate PR. Sorry I enabled auto format this time so changes are melded together.
Ok I will also put tests into a separate PR after this one
If nac3artiq.so depends on some external files:
It is much simpler to just embed all the required data into the nac3artiq.so binary.
Thanks very much! Now I get the point!
e00ddfbc68
tod494b6a3cc
d494b6a3cc
toa323a8475b
If I did not miss anything I think I confirmed that currently there is not a way to make clang or rustc to output target independent llvm bitcode (here and here).
Deleting function attributes in inkwell is indeed also not good.. since there seems not to be a way to iterate over all the attributes of a function in inkwell.
Specifying
-target wasm32
also makes llvm refuse to work on target rv32g and rv32ima:One way that I currently think of to workaround this problem: refactor to make nac3core target aware (just add another function in the
generator
trait to return isa?), then compile theslice_assign.c
to all the possible targets inbuild.rs
, and choose the bitcode of the correct version when doing codegen according to the isa information.a323a8475b
to5e545c229b
You just pass it to that one clang invokation in build.rs without touching the rest of LLVM. There is no rv32 involved there.
@ -0,0 +13,4 @@
"-Wno-implicit-function-declaration",
"-o",
];
let host_compile_ok = Command::new("clang-13")
Just use "clang" to make changing the version easier (only flake.nix needs to be edited).
So is it ok to use some patch like the below so that we can call the unwrapped clang without calling
clang-13
to output wasm32 llvm bitcode here?@ -0,0 +28,4 @@
.unwrap()
.success();
// TODO: confirm flags
let cortexa9_compile_ok = Command::new("clang-13")
They probably all produce the same bitcode in the part that we care about, so this is getting too complicated. IMO the wasm32 output should be generic enough for our purposes. Can inkwell load it without warning?
I mean if I just do this in
build.rs
, without touching other part of llvm:and then in nac3artiq, in
device_db.py
, settarget
to berv32g
, modify./demo.py
to make it generate slice assignment code, and runpython ./demo.py
, llvm will give the error(not warning):Oh I see, you mean that it doesn't like the included raw bitcode when targeting rv32.
Yes.
OK. What about something like:
Not pretty but at least the consequences of this new LLVM disappointment stay close to their source, and it will be easy to fix later if the LLVM developers ever implement a generic bitcode output.
We should probably remove those lines as well:
and perhaps even the entire attributes or most of it:
In fact we really just want the functions...
Thanks for the idea it works! Implemented this in the new commit.
I think including only the parts we want (the function definitions) instead of removing the others might be more robust. That
awk
command does that. Later LLVM versions could add other unwanted attributes or information.52581a213b
tob1635c7ab9
Thanks! Since there are metadata and attributes inside the function body so I still use the regex in rust to filter out unwanted parts in the new commit.
@ -0,0 +42,4 @@
.unwrap()
.replace_all(&filtered_output, "");
println!("{}", filtered_output);
I don't think this should be here. AFAIK build.rs is only supposed to print cargo commands.
Oh sure I can delete this. But I think this should be ok since cargo will just ignore all the printed out lines that are not started with
cargo:
in build scripts.@ -623,6 +642,24 @@ pub fn get_builtins(primitives: &mut (PrimitiveStore, Unifier)) -> BuiltinInfo {
)
}
fn load_fn_from_bit_code<'ctx, 'a>(
Rename to load_irrt
@ -626,0 +647,4 @@
fun: &str,
) -> FunctionValue<'ctx> {
let bitcode_buf = MemoryBuffer::create_from_memory_range(
include_bytes!(concat!(env!("OUT_DIR"), "/slice_assign.bc")),
Rename to irrt.bc
Ok I will do the renaming.
@ -11,0 +26,4 @@
pub const IRRT_LEN_ID: usize = 1;
pub const IRRT_SLICE_ASSIGN_8_ID: usize = 2;
pub const IRRT_SLICE_ASSIGN_32_ID: usize = 3;
pub const IRRT_SLICE_ASSIGN_64_ID: usize = 4;
Isn't there an idiomatic way to combine this list of integer constants with IRRT_SYMBOL_TABLE? This looks like C programming to me.
Oh I just found this, should be useful to make things here prettier.
@ -0,0 +8,4 @@
# define MAX(a, b) (a > b ? a : b)
# define MIN(a, b) (a > b ? b : a)
int32_t __nac3_irrt_slice_index_bound(int32_t i, int32_t const len) {
It's equivalent but you usually see
const int
and notint const
.Ok I will change to
const int
Even with wasm32 target?
tbh do we really need 700+ lines for this feature? it doesn't make sense to me... I thought it would be relatively simple to implement those functions in (multiple) memmove.
I think I would have a look at how to implement it later after the exception allocation stuff.
Yes, the attached file is the output from running unwrapped clang:
still some amount of attributes and metadata.
Sure, the core is just ~250 loc (including c and rust), the rest are mainly just about:
a[:] = b[:]
, directly call the list slice runtime function, fora[:] = [...]
, the list on the rhs should be dropped after assignment by stackrestore, fast path for list.len == 0, etc.)I would not say that currently the code is perfectly without any repetitive boilerplate, but as far as I see I cannot seem to find a way to further fundamentally drop the code size... The cpython implementation also takes more than about ~350 lines of c code (here, here, and here)
cebf5b16a2
tod4969d2ee2
d4969d2ee2
to1cbddebc6d
1cbddebc6d
to0bb799a72b
@ -375,2 +422,4 @@
}
// return true if we take this fast path, false if we do not
fn slice_assign_fast_path<'ctx, 'a, G: CodeGenerator>(
Do we need this "fast path"?
I'm also a bit taken aback by the size and complexity of this PR.
Without this fast path, all the list slice thing can work.
This fast path handles this 2 cases:
a[::] = b[::]
(both lhs and rhs are list slices)a[::] = [..]
(lhs is list slice and rhs is a new list).for case 1, without fast path it will be equivalent to
tmp = b[::]; a[::] = tmp
, whereas if using this fast path,b
will only be copied ifb == a
and their ranges overlap. And ifb
is copied, the memory space taken by the temporary array will be freed after we exit the slice assign irrt function.And for case 2 without fast path it will be
tmp = [..]; a[::] = tmp
, where the stack memory taken bytmp
will not be freed before exiting current function, which will be bad if we do this in a loop. This fast path will call stacksave and stackrestore to free the memory taken by thetmp
array immediately after assignment is done.This fast path can indeed be simpler though, by always alloc tmp array, not optimizing cases where we do not actually need to do
stacksave, alloca tmparr, stackrestore
.Actually I think we don't need fast path now. I think the length calculation is already pretty fast for these cases as we don't have to do division, and the actual assignment operation already does memcpy for the fast case.
Removed in the new commits. Also this fast path indeed only optimize the a small amount of common cases (even not things like
a[:], b[:] = c[:], [..]
due to the assignment unpacking).I just have a quick look at this. One problem I see is the use of conditions, zext it and do multiplication to implement something like
c if cond else 0
. Please use the select instruction for this, which would generate better IR and easier to optimize.Oh I was not aware of select instruction before, will implement this, thanks!
@ -0,0 +8,4 @@
# define MAX(a, b) (a > b ? a : b)
# define MIN(a, b) (a > b ? b : a)
int32_t __nac3_irrt_slice_index_bound(int32_t i, const int32_t len) {
For the slice index bound, I think you do not need to bound it like this, just allow it to have invalid access. We will fix this later when we implement exception. With this, you can just have a rust helper function to help you generate the access (something like icmp and select). Using a rust helper function can also allow us to modify this and add exception later.
The reason I do this is because python allows slice indecies to be out-of-bound:
I did intentionally leave the c functions to only handle correct input, leaving the part that we may need to raise exception in rust
Interesting, never know that they allow this...
Simple way to reduce the code size: check for common pattern...
@ -0,0 +113,4 @@
(*dest_arr_len - dest_end - 1) * sizeof(T) \
); \
*dest_arr_len = *dest_arr_len - (dest_end - dest_ind) - 1; \
} \
Should we have a stackrestore to remove that alloca?
In c I seem to also failed to find a way to call
stacksave
andstackrestore
manually... But here I think this should be fine since this temp array will be freed when exiting this slice assignment function? And later in this function we do not make another function call, and most operations inside the function after this alloca needs it..So currently eventually the only place that I call
stacksave
andstackrestore
is in that fastpathSorry forgot that this is a separate function call. If so it should be fine. I guess you fast path can be removed too.
@ -0,0 +93,4 @@
); \
if (need_alloca) { \
T *tmp = alloca(src_arr_len * sizeof(T)); \
__builtin_memmove(tmp, src_arr, src_arr_len * sizeof(T)); \
You should use
memcpy
, which is usually faster thanmemmove
.memmove
should only be used when the buffers overlap, i.e. when you shift the remaining elements in the list.changed to memcpy in the new commit
Is memcpy really faster? I thought memcpy was just a relic from 1970.
I think it should be a bit faster, both
__builtin_memmove
and__builtin_memcpy
are translated into llvm intrinsics, and llvm intrinsics for memmove allows overlap while memcpy does not allow. So memmove might do more work to guarantee its correctness.@ -0,0 +54,4 @@
/* if dest_arr_len == 0, do nothing since we do not support extending list */ \
if (*dest_arr_len == 0) return; \
/* if src_arr_len == 0 and need drop */ \
if (src_arr_len == 0) { \
I guess you can make this a special case of the next case. Just ignore the step if
src_len == 0
and avoid the first memcpy but keep the second.handled in the new commits
@ -699,0 +946,4 @@
ctx.builder
.build_bitcast(
unsafe { ctx.builder.build_gep(dest_arr, &[zero, one], "dest_arr") },
int32.ptr_type(AddressSpace::Generic),
Well this is actually unsafe. We may have u64 length pointer for 64 bits target... Maybe you can make a temporary integer, take its address and store it back? I don't think we will ever do something that big so it should be fine. I think LLVM should be able to optimize this for 32 bits target.
Ah thanks I see, then I think I also need to change the code to modify the list length in the rust side.
0bb799a72b
to7b9da1835c
Ah thanks for pointing this out! Indeed I should have noticed this one.
7b9da1835c
to94216fb1ab
0b1c3df091
to8e16ecab5f
I tentatively move all the ir runtime library related code to
nac3core/codegen/irrt/
, hopefully this will be better8e16ecab5f
toa227d99516
a227d99516
to5edec788b2
ychenfo referenced this pull request2022-01-09 00:44:18 +08:00
5edec788b2
to96a2881a08
rebased on the current master branch
@ -6,0 +14,4 @@
use nac3parser::ast::Expr;
pub struct IrrtSymbols;
impl IrrtSymbols {
Not sure if there's much point in this structure vs. just using the names directly. I had removed it.
Oh here as the names are getting more, I just thought that since there are still some places in the following code that these names repeatedly appear (especially when adding inline attributes, and in list slice assignment function when checking function bit width), so I personally think that it might be cleaner to add this back...
The
IrrtSymbols::...
constants would also repeatedly appear...Ok I see, I will delete this struct and just use their names directly.
Just wondering: Do we really need thhe IRRT prefix? The symbol names are super long and I think
__nac3
prefix is enough?They might be enough indeed. After all I think the purpose of the prefix is just to 1. mark that these are the functions that nac3 will call internally to perform some operations, and 2. not to collide with other symbol names that users would use. I am not sure.. will long symbol names be a potential problem?
They're just more typing for the developer. The toolchain doesn't have issues with long symbols, by C++ name mangler standards those aren't even barely long symbols.
Ok I see, so should I just change the prefix to
__nac3
only?Yes
96a2881a08
to23db915046
Renamed the prefix to
__nac3_
and removed the IrrtSymbols struct in the new commit.@ -46,0 +273,4 @@
int32.into(),
int32.into(),
elem_ptr_type.into(),
int32.into(),
Add comments with the parameter names. It's hard to follow this long list.
@ -46,0 +165,4 @@
ctx.builder
.build_select(
ctx.builder.build_and(
ctx.builder.build_int_compare(EQ, s, length, "s_eq_len"),
Better use
IntPredicate::EQ
here, those short constants are a bit confusing.We really should add some unit tests (probably using the LLVM JIT or bitcode interpreter to run the compiler's output). It seems there are lots of corner cases and it's not obvious from reading the code that they are all handled correctly. That could go into a separate PR though.
Yes, all the code in this PR is tested using a full-stack test I wrote in nac3standalone (but I do not think it is the best as I just wrote the test to use a lot of
Command
s to compile and run programs and compare the print output), I will port these tests into codegen/irrt using llvm jit or bitcode interpreter in a new PR.23db915046
to276f55b1d4
Updated to use
IntPredicate
, add comments and rebase on the latest master branch.276f55b1d4
to4976e89ae2
fixed some typo
@ -48,0 +241,4 @@
let int64_ptr = ctx.ctx.i64_type().ptr_type(AddressSpace::Generic);
let fun_symbol = if let BasicTypeEnum::IntType(ty) = ty {
match ty.get_bit_width() {
w if w < 32 => "__nac3_list_slice_assign_uint8_t",
When would we use that?
We can have list of boolean, which is of width 1, and is represented by a byte in our supported targets?
I tried to make clang to output
i1
but the_ExtInt
must requirebitwidth >= 2
.Maybe it would be better to change the condition here to exactly
w == 1
, or maybew <= 8
? since I think we do not have other type represented by bit width smaller than 32.