investigate poor compilation speed with KernelInvariant #125
Labels
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/nac3#125
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Removing inlining pass reduces the optimization time by 50%.
Keeping inlining pass but changing optimization level from Aggressive to Less can reduce optimization time by ~30%. We already have aggressive optimization level for individual functions, so this might be applicable.
This gets the LTO build to work (nac3artiq test fails due to some problem with ncurses, you also need to set doCheck=false).
A few random ideas:
@inline
decorator.@inline
functions first and then make them accessible to all modules.I'm just brainstorming here, not all of these may be good. Maybe topics for the next meeting.
The most expensive optimization currently is inlining and optimization for the inlined code. Functions are already optimized before the final global optimization.
Probably hard, I've tried doing this and the result binary looks quite different from the one with aggressive global optimization. I did not look into details and see how performance differ though. Also, enabling aggressive optimization for each function and then do light global optimization seems to take more time (more time per function, no change for global optimization time) and produce a different binary (not sure if it is better or worse, I guess worse) based on my previous experiments.
Do you have any ideas about possible heuristics? We only generate called functions now so not sure how should we do this based on call graphs.
I have some idea about inlining. The idea is to mark certain code as performance critical and optimize that part (and all callees) aggressively, while for other functions (probably setup code and error reporting code etc.) we do not perform global optimization. This is simple to implement right now.
We can also cache the codegen result for the performance insensitive code if we are certain that the referenced python value did not change, which I think can be done. Performance insensitive code might be some library code which rarely change, so might be worth to cache.
Caching codegen result without avoiding global optimization would not save us too much time considering it is the global optimization part taking most of our time. For marking code as inline, I think users may not be aware of certain code being the bottleneck if they don't have the experience of writing high performance code and without the profiling tools.
Yes - this has to be combined with at least one of the other suggestions.
And indeed the kernel invariant feature is what causes long compilation time. Disabling it can improve compilation time by about 30%.
Just list out things I've tried:
with optimize
to selectively optimize code. Significant improvement, but haven't yet tested if the optimized code is still fast enough to avoid RTIO underflow.It would be better if I can have a more complicated code that takes a bit more time to compile, and perhaps hit some slow path in our code.
Things left to try:
with optimize
construct. It seems the most promising to me.Why should we support bigint at all in the parser? The compiler wouldn't know what to do with big numbers.
I suggest using int64 only which likely has no/negligible performance impact on modern PCs. The net performance impact could even be positive since there would be no need to deconstruct an enum.
We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error.
We can also optimize by modifying the tokenizer implementation to refer to the string slice directly instead of pushing strings... I don't understand why they do this, which is extremely inefficient.
FYI: The parser code is a bit slow, it is spending around 20% of the time (parser time, not total time) in memcpy. It seems that the Stmt type is quite large (136 bytes) which cause many match code in the parser to generate memcpy calls.
Valgrind output:
Indicating it is memmoving something like 2
Stmt
... very frequently, causing parser slowness. I guess we can try to minimize the AST node size if needed.We can also make the custom field mandatory... and modify the fold implementation to make it mutate in-place. This can probably give some performance improvement.
Option<i64>
then?I think we should still parse it as users can write bigint that we do not support. It only takes a very small amount of time now so it should not be a performance issue.
Yes, parse it and put
None
in the AST ifint64
would overflow. The compiler can then still emit a meaningful error message when it encountersNone
instead ofSome(i64)
. Removing bigint dependencies may also improve NAC3 build time and bloat a little bit.Reducing the size of Stmt by boxing some of its fields does not work. It uses fewer memmove, but more memory allocation, which is even worse.
Tabled as discussed in today's meeting.
Some notes and results on PGO:
9cc9a0284a
*.gcda
files instead of the*.profraw
files that LLVM wants.python demo.py
. We probably want a better sample.flake.nix
. Simply use thenac3artiq-pgo
package to enable PGO.Original:
LLVM compiled with Clang, no PGO:
LLVM compiled with Clang + PGO:
In these results, about 170ms of the total execution time is spent on CPython and one-time initialization. The NAC3 compilation contribution would be roughly 110ms/110ms/89ms.
So, PGO works, at least on Linux!
@pca006132 How did you enable mimalloc? Is there a reason not to use it all the time?
https://github.com/microsoft/mimalloc/#dynamic-override
Probably no, but the gain from using an alternate memory allocator might vary depending on the OS.
Tried this - compiles but segfaults. Function list taken from
752594e764/include/mimalloc-override.h
With mimalloc (dynamically loaded) + PGO:
(for some reason, the fish
time
command is broken with mimalloc, so it's using bash)Alternative approach for enabling mimalloc by patching rustc:
This causes the stage2 rustc build to fail:
https://github.com/microsoft/mimalloc/issues/147
Do we really need to patch rustc just to link a library? Does https://github.com/microsoft/mimalloc#static-override work on Linux?
Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation...
But the object file you linked is not the one that does the override. You should link
mimalloc-override.o
.mimalloc-override.o
isn't in the nixpkgs outputs. It seems it does override anyway? Otherwise why would the TLS problem (https://github.com/microsoft/mimalloc/issues/147) manifest itself in rustc stage2? I'll take a closer look though.I have figured out the problem - one bonus of the rustc rebuild is you'll get mimalloc in rustc (stage2) and hopefully faster build times :)
python demo.py
still segfaults though.Interesting, do we really need to put that at the beginning of the linker command? glibc malloc and free should be weak putting mimalloc at the end of the linker command should still work?
Maybe it works, but I suspect the linker is dumber than you think. When doing that, you even need to add another
-lc
at the end, otherwise mimalloc doesn't findatexit
.I'm at a loss as to why Python segfaults, so I'm just following the mimalloc instructions carefully...
https://github.com/microsoft/mimalloc/issues/80
So I'm using the override correctly.
I also checked that rustc does invoke the linker as per mimalloc's recommendations.
Python seems to segfault because dynamically loaded parts of nac3artiq such as libstdc++ still use the old malloc, and pointers get passed around between allocators.
Using
patchelf
on nac3artiq.so to dynamically link mimalloc before libc does NOT work because Python has already loaded the libc malloc when it loads nac3artiq - so the malloc calls resolve to libc. It would work on nac3standalone.Linking statically with Rust (to remove annoying DSOs like libstdc++) is only possible with musl libc. Surprisingly, building things with musl (LLVM, rustc, ...) works out-of-the-box with nixpkgs - it just takes time as a lot of dependencies are not on cache.nixos.org. I find glibc to be a fairly disgusting piece of software so I'm happy to switch to musl, which claims to address my concerns with glibc.
But it seems there are issues with thread-local storage (TLS) when mixing libcs, which result in memory corruption/crashes. Are we directly or indirectly using TLS anywhere in NAC3?
A more practical issue is Rust also refuses to build a statically-linked/freestanding DSO, we need to make it build a static library and then use some linker hacks to convert it into a DSO. Or write another rustc patch so it stops complaining about cdylib not being a legit crate type with musl.
We definitely need musl because of this glibc nonsense: https://stackoverflow.com/questions/57476533/why-is-statically-linking-glibc-discouraged#57478728
Current status:
When additionally linking mimalloc, the crash trace is:
https://git.musl-libc.org/cgit/musl/tree/src/internal/vdso.c#n43
Seems we're still not out of the woods with this ELF/Linux dynamic linker mess. I'm impressed by how crappy it is.
See this musl mailing list thread: https://www.openwall.com/lists/musl/2022/01/01/1
The solution to the MWE I posted there is:
(and pass argc and argv to the library of course)
But I'm not sure of much more of this kind of crap we would have to deal with until statically-linked nac3artiq works.
Other options: (1) LD_PRELOAD (but it would have to be in every user's nix-shell) (2) patchelf on the Python interpreter (3) maybe get LLVM to use the
mi_
functions, though with cmake and C++ it looks like it'll be a pain.As expected from something that uses cmake, https://reviews.llvm.org/D101427 does not work and it continues to use the glibc allocator.
In fact this mimalloc support in LLVM seems pretty silly, all it does is add the mimalloc sources to the LLVM build system (as if it wasn't complicated enough) and then attempt to use this flimsy symbol interposition hack to enable it.
Enabled mimalloc via LD_PRELOAD and the Nix Python wrapper. Seems fine and with good usability.