investigate poor compilation speed with KernelInvariant #125

Open
opened 2021-12-06 14:13:43 +08:00 by sb10q · 39 comments
There is no content yet.
sb10q added this to the Prealpha milestone 2021-12-06 14:13:43 +08:00
sb10q added the
high-priority
label 2021-12-06 14:13:43 +08:00
pca006132 was assigned by sb10q 2021-12-06 14:14:03 +08:00

Removing inlining pass reduces the optimization time by 50%.

Removing inlining pass reduces the optimization time by 50%.

Keeping inlining pass but changing optimization level from Aggressive to Less can reduce optimization time by ~30%. We already have aggressive optimization level for individual functions, so this might be applicable.

Keeping inlining pass but changing optimization level from Aggressive to Less can reduce optimization time by ~30%. We already have aggressive optimization level for individual functions, so this might be applicable.
Poster
Owner

This gets the LTO build to work (nac3artiq test fails due to some problem with ncurses, you also need to set doCheck=false).

diff --git a/.cargo/config b/.cargo/config
new file mode 100644
index 0000000..99a616d
--- /dev/null
+++ b/.cargo/config
@@ -0,0 +1,2 @@
+[target.x86_64-unknown-linux-gnu]
+rustflags = ["-C", "linker=clang", "-C", "link-arg=-fuse-ld=lld"]
diff --git a/flake.nix b/flake.nix
index 60d18c6..b0239ed 100644
--- a/flake.nix
+++ b/flake.nix
@@ -55,7 +55,7 @@
             name = "nac3artiq";
             src = self;
             inherit cargoSha256;
-            nativeBuildInputs = [ pkgs.python3 llvm-nac3 ];
+            nativeBuildInputs = [ pkgs.python3 pkgs.clang_12 pkgs.lld_12 llvm-nac3 ];
             buildInputs = [ pkgs.python3 llvm-nac3 ];
             cargoBuildFlags = [ "--package" "nac3artiq" ];
             cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ];
diff --git a/llvm/default.nix b/llvm/default.nix
index ca37267..cd3e891 100644
--- a/llvm/default.nix
+++ b/llvm/default.nix
@@ -1,4 +1,4 @@
-{ lib, stdenv
+{ lib
 , pkgsBuildBuild
 , fetchurl
 , fetchpatch
@@ -17,6 +17,8 @@
 let
   inherit (lib) optional optionals optionalString;
 
+  stdenv = llvmPackages_12.stdenv;
+
   release_version = "12.0.1";
   candidate = ""; # empty or "rcN"
   dash-candidate = lib.optionalString (candidate != "") "-${candidate}";
@@ -48,7 +50,7 @@ in stdenv.mkDerivation (rec {
 
   outputs = [ "out" "lib" "dev" "python" ];
 
-  nativeBuildInputs = [ cmake python3 ]
+  nativeBuildInputs = [ cmake python3 llvmPackages_12.bintools ]
     ++ optionals enableManpages [ python3.pkgs.sphinx python3.pkgs.recommonmark ];
 
   buildInputs = [ ];
@@ -119,6 +121,8 @@ in stdenv.mkDerivation (rec {
   cmakeFlags = with stdenv; [
     "-DLLVM_INSTALL_CMAKE_DIR=${placeholder "dev"}/lib/cmake/llvm/"
     "-DCMAKE_BUILD_TYPE=${if debugVersion then "Debug" else "Release"}"
+    "-DLLVM_ENABLE_LTO=Full"
+    "-DLLVM_USE_LINKER=lld"
     "-DLLVM_BUILD_TESTS=${if stdenv.targetPlatform.isMinGW then "OFF" else "ON"}"
     "-DLLVM_HOST_TRIPLE=${stdenv.hostPlatform.config}"
     "-DLLVM_DEFAULT_TARGET_TRIPLE=${stdenv.hostPlatform.config}"

This gets the LTO build to work (nac3artiq test fails due to some problem with ncurses, you also need to set doCheck=false). ``` diff --git a/.cargo/config b/.cargo/config new file mode 100644 index 0000000..99a616d --- /dev/null +++ b/.cargo/config @@ -0,0 +1,2 @@ +[target.x86_64-unknown-linux-gnu] +rustflags = ["-C", "linker=clang", "-C", "link-arg=-fuse-ld=lld"] diff --git a/flake.nix b/flake.nix index 60d18c6..b0239ed 100644 --- a/flake.nix +++ b/flake.nix @@ -55,7 +55,7 @@ name = "nac3artiq"; src = self; inherit cargoSha256; - nativeBuildInputs = [ pkgs.python3 llvm-nac3 ]; + nativeBuildInputs = [ pkgs.python3 pkgs.clang_12 pkgs.lld_12 llvm-nac3 ]; buildInputs = [ pkgs.python3 llvm-nac3 ]; cargoBuildFlags = [ "--package" "nac3artiq" ]; cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ]; diff --git a/llvm/default.nix b/llvm/default.nix index ca37267..cd3e891 100644 --- a/llvm/default.nix +++ b/llvm/default.nix @@ -1,4 +1,4 @@ -{ lib, stdenv +{ lib , pkgsBuildBuild , fetchurl , fetchpatch @@ -17,6 +17,8 @@ let inherit (lib) optional optionals optionalString; + stdenv = llvmPackages_12.stdenv; + release_version = "12.0.1"; candidate = ""; # empty or "rcN" dash-candidate = lib.optionalString (candidate != "") "-${candidate}"; @@ -48,7 +50,7 @@ in stdenv.mkDerivation (rec { outputs = [ "out" "lib" "dev" "python" ]; - nativeBuildInputs = [ cmake python3 ] + nativeBuildInputs = [ cmake python3 llvmPackages_12.bintools ] ++ optionals enableManpages [ python3.pkgs.sphinx python3.pkgs.recommonmark ]; buildInputs = [ ]; @@ -119,6 +121,8 @@ in stdenv.mkDerivation (rec { cmakeFlags = with stdenv; [ "-DLLVM_INSTALL_CMAKE_DIR=${placeholder "dev"}/lib/cmake/llvm/" "-DCMAKE_BUILD_TYPE=${if debugVersion then "Debug" else "Release"}" + "-DLLVM_ENABLE_LTO=Full" + "-DLLVM_USE_LINKER=lld" "-DLLVM_BUILD_TESTS=${if stdenv.targetPlatform.isMinGW then "OFF" else "ON"}" "-DLLVM_HOST_TRIPLE=${stdenv.hostPlatform.config}" "-DLLVM_DEFAULT_TARGET_TRIPLE=${stdenv.hostPlatform.config}" ```
Poster
Owner

A few random ideas:

  1. Mark functions to be inlined explictly with a @inline decorator.
  2. Compile @inline functions first and then make them accessible to all modules.
  3. Run expensive optimizations in parallel in each thread.
  4. Run only lighter optimizations in single-thread code after linking all the modules. Though unless we do (2) we probably want at least passes of inlining, constant propagation, and dead code elimination.
  5. To maximize parallel optimization, choose carefully what goes into each per-thread LLVM module using some heuristics e.g. based on call graphs. Probably should be done at the same time as #7

I'm just brainstorming here, not all of these may be good. Maybe topics for the next meeting.

A few random ideas: 1. Mark functions to be inlined explictly with a ``@inline`` decorator. 2. Compile ``@inline`` functions first and then make them accessible to all modules. 3. Run expensive optimizations in parallel in each thread. 4. Run only lighter optimizations in single-thread code after linking all the modules. Though unless we do (2) we probably want at least passes of inlining, constant propagation, and dead code elimination. 5. To maximize parallel optimization, choose carefully what goes into each per-thread LLVM module using some heuristics e.g. based on call graphs. Probably should be done at the same time as https://git.m-labs.hk/M-Labs/nac3/issues/7 I'm just brainstorming here, not all of these may be good. Maybe topics for the next meeting.

Run expensive optimizations in parallel in each thread.

The most expensive optimization currently is inlining and optimization for the inlined code. Functions are already optimized before the final global optimization.

Run only lighter optimizations in single-thread code after linking all the modules.

Probably hard, I've tried doing this and the result binary looks quite different from the one with aggressive global optimization. I did not look into details and see how performance differ though. Also, enabling aggressive optimization for each function and then do light global optimization seems to take more time (more time per function, no change for global optimization time) and produce a different binary (not sure if it is better or worse, I guess worse) based on my previous experiments.

To maximize parallel optimization, choose carefully what goes into each per-thread LLVM module using some heuristics e.g. based on call graphs.

Do you have any ideas about possible heuristics? We only generate called functions now so not sure how should we do this based on call graphs.

> Run expensive optimizations in parallel in each thread. The most expensive optimization currently is inlining and optimization for the inlined code. Functions are already optimized before the final global optimization. > Run only lighter optimizations in single-thread code after linking all the modules. Probably hard, I've tried doing this and the result binary looks quite different from the one with aggressive global optimization. I did not look into details and see how performance differ though. Also, enabling aggressive optimization for each function and then do light global optimization seems to take more time (more time per function, no change for global optimization time) and produce a different binary (not sure if it is better or worse, I guess worse) based on my previous experiments. > To maximize parallel optimization, choose carefully what goes into each per-thread LLVM module using some heuristics e.g. based on call graphs. Do you have any ideas about possible heuristics? We only generate called functions now so not sure how should we do this based on call graphs.

I have some idea about inlining. The idea is to mark certain code as performance critical and optimize that part (and all callees) aggressively, while for other functions (probably setup code and error reporting code etc.) we do not perform global optimization. This is simple to implement right now.

We can also cache the codegen result for the performance insensitive code if we are certain that the referenced python value did not change, which I think can be done. Performance insensitive code might be some library code which rarely change, so might be worth to cache.

Caching codegen result without avoiding global optimization would not save us too much time considering it is the global optimization part taking most of our time. For marking code as inline, I think users may not be aware of certain code being the bottleneck if they don't have the experience of writing high performance code and without the profiling tools.

I have some idea about inlining. The idea is to mark certain code as performance critical and optimize that part (and all callees) aggressively, while for other functions (probably setup code and error reporting code etc.) we do not perform global optimization. This is simple to implement right now. We can also cache the codegen result for the performance insensitive code if we are certain that the referenced python value did not change, which I think can be done. Performance insensitive code might be some library code which rarely change, so might be worth to cache. Caching codegen result without avoiding global optimization would not save us too much time considering it is the global optimization part taking most of our time. For marking code as inline, I think users may not be aware of certain code being the bottleneck if they don't have the experience of writing high performance code and without the profiling tools.
Poster
Owner

The most expensive optimization currently is inlining and optimization for the inlined code. Functions are already optimized before the final global optimization.

Yes - this has to be combined with at least one of the other suggestions.

> The most expensive optimization currently is inlining and optimization for the inlined code. Functions are already optimized before the final global optimization. Yes - this has to be combined with at least one of the other suggestions.

And indeed the kernel invariant feature is what causes long compilation time. Disabling it can improve compilation time by about 30%.

And indeed the kernel invariant feature is what causes long compilation time. Disabling it can improve compilation time by about 30%.

Just list out things I've tried:

  1. Changing memory allocator to mimalloc. Significant improvement.
  2. with optimize to selectively optimize code. Significant improvement, but haven't yet tested if the optimized code is still fast enough to avoid RTIO underflow.
  3. Optimize parser implementation. For number parsing, use constant generic for passing radix. Not very significant improvement.
  4. Use busy-poll with exponential backoff for channel receive. Reduced codegen time a bit, but not very significant.
  5. Use faster hasher for all hashmap/hashset. Reduced overall compilation time a bit but still not very significant.
  6. Thin LTO. Not very effective.

It would be better if I can have a more complicated code that takes a bit more time to compile, and perhaps hit some slow path in our code.

Things left to try:

  1. Validate and improve the with optimize construct. It seems the most promising to me.
  2. Use named type for LLVM codegen. This may reduce memory allocation overhead in LLVM and improve locality. I have problem implementing this, as there would be type mismatch if we use different names for the same type in caller and callee or initializer.
  3. Change integer representation in parser implementation. The current parser implementation uses BigInt, which requires memory allocation. There is a significant overhead when parsing programs with a lot of numbers. We can probably use an enum to represent int32/int64/bigint, and make the common case (int32/int64) a lot faster.
  4. Benchmark on Windows. We should try this on Windows and see how it goes, as some of the optimizations may be system dependent (thread local, busy poll, changing memory allocator).
  5. Modify parser implementation to reduce the number of clones. Profiling shows that the generated parser did a lot of cloning which may be unneeded (when I manually inspect the generated code). But I have no idea how to modify this (perhaps just need to change the RustPython parser definition, or have to change LALRPOP, I don't have much idea about this).
Just list out things I've tried: 1. Changing memory allocator to mimalloc. Significant improvement. 2. `with optimize` to selectively optimize code. Significant improvement, but haven't yet tested if the optimized code is still fast enough to avoid RTIO underflow. 3. Optimize parser implementation. For number parsing, use constant generic for passing radix. Not very significant improvement. 4. Use busy-poll with exponential backoff for channel receive. Reduced codegen time a bit, but not very significant. 5. Use faster hasher for all hashmap/hashset. Reduced overall compilation time a bit but still not very significant. 6. Thin LTO. Not very effective. It would be better if I can have a more complicated code that takes a bit more time to compile, and perhaps hit some slow path in our code. Things left to try: 1. Validate and improve the `with optimize` construct. It seems the most promising to me. 2. Use named type for LLVM codegen. This may reduce memory allocation overhead in LLVM and improve locality. I have problem implementing this, as there would be type mismatch if we use different names for the same type in caller and callee or initializer. 3. Change integer representation in parser implementation. The current parser implementation uses BigInt, which requires memory allocation. There is a significant overhead when parsing programs with a lot of numbers. We can probably use an enum to represent int32/int64/bigint, and make the common case (int32/int64) a lot faster. 4. Benchmark on Windows. We should try this on Windows and see how it goes, as some of the optimizations may be system dependent (thread local, busy poll, changing memory allocator). 5. Modify parser implementation to reduce the number of clones. Profiling shows that the generated parser did a lot of cloning which may be unneeded (when I manually inspect the generated code). But I have no idea how to modify this (perhaps just need to change the RustPython parser definition, or have to change LALRPOP, I don't have much idea about this).
Poster
Owner

Why should we support bigint at all in the parser? The compiler wouldn't know what to do with big numbers.

Why should we support bigint at all in the parser? The compiler wouldn't know what to do with big numbers.
Poster
Owner

I suggest using int64 only which likely has no/negligible performance impact on modern PCs. The net performance impact could even be positive since there would be no need to deconstruct an enum.

I suggest using int64 only which likely has no/negligible performance impact on modern PCs. The net performance impact could even be positive since there would be no need to deconstruct an enum.

I suggest using int64 only which likely has no/negligible performance impact on modern PCs. The net performance impact could even be positive since there would be no need to deconstruct an enum.

We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error.

> I suggest using int64 only which likely has no/negligible performance impact on modern PCs. The net performance impact could even be positive since there would be no need to deconstruct an enum. We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error.

We can also optimize by modifying the tokenizer implementation to refer to the string slice directly instead of pushing strings... I don't understand why they do this, which is extremely inefficient.

We can also optimize by modifying the tokenizer implementation to refer to the string slice directly instead of pushing strings... I don't understand why they do this, which is extremely inefficient.

FYI: The parser code is a bit slow, it is spending around 20% of the time (parser time, not total time) in memcpy. It seems that the Stmt type is quite large (136 bytes) which cause many match code in the parser to generate memcpy calls.

Valgrind output:

  │   │     Total:     1,351,296 bytes (6.74%, 56,506.6/Minstr) in 4,968 blocks (6.42%, 207.74/Minstr), avg size 272 bytes
  │   │     Copied at {
  │   │       ^1: 0x4848B20: memmove (in /nix/store/8pq21lpkiprwpkjm2kg7dnjlj08amm23-valgrind-3.18.1/libexec/valgrind/vgpreload_dhat-amd64-linux.so)
  │   │       #2: 0x13ACF2: nac3parser::parser::parse (mod.rs:890)
  │   │       #3: 0x138DE2: nac3parser::parser::parse_program (parser.rs:24)
  │   │       #4: 0x11D838: nac3standalone::main (main.rs:69)
  │   │       #5: 0x1210D2: std::sys_common::backtrace::__rust_begin_short_backtrace (function.rs:227)
  │   │       #6: 0x121088: _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h76b8f2d80ac1f3f3E.llvm.14983614383746848584 (rt.rs:63)
  │   │       #7: 0x2B9770: std::rt::lang_start_internal (in /home/pca006132/code/rust/nac3/target/release/nac3standalone)
  │   │       #8: 0x11DB17: main (in /home/pca006132/code/rust/nac3/target/release/nac3standalone)
  │   │     }

Indicating it is memmoving something like 2 Stmt... very frequently, causing parser slowness. I guess we can try to minimize the AST node size if needed.

We can also make the custom field mandatory... and modify the fold implementation to make it mutate in-place. This can probably give some performance improvement.

FYI: The parser code is a bit slow, it is spending around 20% of the time (parser time, not total time) in memcpy. It seems that the Stmt type is quite large (136 bytes) which cause many match code in the parser to generate memcpy calls. Valgrind output: ``` │ │ Total: 1,351,296 bytes (6.74%, 56,506.6/Minstr) in 4,968 blocks (6.42%, 207.74/Minstr), avg size 272 bytes │ │ Copied at { │ │ ^1: 0x4848B20: memmove (in /nix/store/8pq21lpkiprwpkjm2kg7dnjlj08amm23-valgrind-3.18.1/libexec/valgrind/vgpreload_dhat-amd64-linux.so) │ │ #2: 0x13ACF2: nac3parser::parser::parse (mod.rs:890) │ │ #3: 0x138DE2: nac3parser::parser::parse_program (parser.rs:24) │ │ #4: 0x11D838: nac3standalone::main (main.rs:69) │ │ #5: 0x1210D2: std::sys_common::backtrace::__rust_begin_short_backtrace (function.rs:227) │ │ #6: 0x121088: _ZN3std2rt10lang_start28_$u7b$$u7b$closure$u7d$$u7d$17h76b8f2d80ac1f3f3E.llvm.14983614383746848584 (rt.rs:63) │ │ #7: 0x2B9770: std::rt::lang_start_internal (in /home/pca006132/code/rust/nac3/target/release/nac3standalone) │ │ #8: 0x11DB17: main (in /home/pca006132/code/rust/nac3/target/release/nac3standalone) │ │ } ``` Indicating it is memmoving something like 2 `Stmt`... very frequently, causing parser slowness. I guess we can try to minimize the AST node size if needed. We can also make the custom field mandatory... and modify the fold implementation to make it mutate in-place. This can probably give some performance improvement.
Poster
Owner

We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error.

Option<i64> then?

> We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error. ``Option<i64>`` then?

We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error.

Option<i64> then?

I think we should still parse it as users can write bigint that we do not support. It only takes a very small amount of time now so it should not be a performance issue.

> > We need to support bigint. There can be non-artiq code that uses bigint that has to be parsed correctly instead of throwing an error. > > ``Option<i64>`` then? I think we should still parse it as users can write bigint that we do not support. It only takes a very small amount of time now so it should not be a performance issue.
Poster
Owner

Yes, parse it and put None in the AST if int64 would overflow. The compiler can then still emit a meaningful error message when it encounters None instead of Some(i64). Removing bigint dependencies may also improve NAC3 build time and bloat a little bit.

Yes, parse it and put ``None`` in the AST if ``int64`` would overflow. The compiler can then still emit a meaningful error message when it encounters ``None`` instead of ``Some(i64)``. Removing bigint dependencies may also improve NAC3 build time and bloat a little bit.

Reducing the size of Stmt by boxing some of its fields does not work. It uses fewer memmove, but more memory allocation, which is even worse.

Reducing the size of Stmt by boxing some of its fields does not work. It uses fewer memmove, but more memory allocation, which is even worse.
Poster
Owner

Tabled as discussed in today's meeting.

Tabled as discussed in today's meeting.
sb10q removed this from the Prealpha milestone 2021-12-21 18:44:49 +08:00
sb10q removed the
high-priority
label 2021-12-21 18:44:51 +08:00
pca006132 was unassigned by sb10q 2021-12-21 18:46:51 +08:00
Poster
Owner

Some notes and results on PGO:

  • Test conditions: Core i9-9900K, ARTIQ 51cb8adba092850a826af0817f6db590b7f08151 + NAC3 9cc9a0284a
  • Profile data collected on Ryzen 9 5950X
  • PGO only works with LLVM compiled with Clang. GCC builds are easier but produce some *.gcda files instead of the *.profraw files that LLVM wants.
  • LLVM won't build with Clang for mingw/Windows. CMake attempts to use -fPIC and fails, even when LLVM_ENABLE_PIC is turned OFF.
  • Unlike GCC instrumented binaries, LLVM's profiler with compiler-rt is a pain in the neck to use outside of pure Clang builds, such as when rustc is involved. Enabling instrumentation on the Rust side does not enable it on the C++ side. So, there are some slightly unsavory hacks to force linking NAC3 against compiler-rt and initialization of the profiler.
  • PGO builds involve recompiling LLVM at each NAC3 change. We probably want that only on release branches when we have them.
  • PGO data collection is simply done with python demo.py. We probably want a better sample.
  • Only LLVM PGO is done, PGO is not enabled for Rust code.
  • All relevant code is committed to flake.nix. Simply use the nac3artiq-pgo package to enable PGO.

Original:

> time artiq_compile nac3devices.py
________________________________________________________
Executed in  280.33 millis    fish           external
   usr time  273.08 millis    0.00 micros  273.08 millis
   sys time   26.14 millis  163.00 micros   25.98 millis

LLVM compiled with Clang, no PGO:

> time artiq_compile nac3devices.py
________________________________________________________
Executed in  279.58 millis    fish           external
   usr time  274.44 millis  125.00 micros  274.31 millis
   sys time   24.14 millis   31.00 micros   24.10 millis


LLVM compiled with Clang + PGO:

> time artiq_compile nac3devices.py
________________________________________________________
Executed in  258.85 millis    fish           external
   usr time  249.79 millis  122.00 micros  249.67 millis
   sys time   26.69 millis   34.00 micros   26.65 millis

In these results, about 170ms of the total execution time is spent on CPython and one-time initialization. The NAC3 compilation contribution would be roughly 110ms/110ms/89ms.

So, PGO works, at least on Linux!

Some notes and results on PGO: * Test conditions: Core i9-9900K, ARTIQ 51cb8adba092850a826af0817f6db590b7f08151 + NAC3 9cc9a0284a15293421d2733b3fe4191ac72b680c * Profile data collected on Ryzen 9 5950X * PGO only works with LLVM compiled with Clang. GCC builds are easier but produce some ``*.gcda`` files instead of the ``*.profraw`` files that LLVM wants. * LLVM won't build with Clang for mingw/Windows. CMake attempts to use -fPIC and fails, [even when LLVM_ENABLE_PIC is turned OFF](https://twitter.com/CmakeHate). * Unlike GCC instrumented binaries, LLVM's profiler with compiler-rt is a pain in the neck to use outside of pure Clang builds, such as when rustc is involved. Enabling instrumentation on the Rust side does not enable it on the C++ side. So, there are some slightly unsavory hacks to force linking NAC3 against compiler-rt and initialization of the profiler. * PGO builds involve recompiling LLVM at each NAC3 change. We probably want that only on release branches when we have them. * PGO data collection is simply done with ``python demo.py``. We probably want a better sample. * Only LLVM PGO is done, PGO is not enabled for Rust code. * All relevant code is committed to ``flake.nix``. Simply use the ``nac3artiq-pgo`` package to enable PGO. Original: ``` > time artiq_compile nac3devices.py ________________________________________________________ Executed in 280.33 millis fish external usr time 273.08 millis 0.00 micros 273.08 millis sys time 26.14 millis 163.00 micros 25.98 millis ``` LLVM compiled with Clang, no PGO: ``` > time artiq_compile nac3devices.py ________________________________________________________ Executed in 279.58 millis fish external usr time 274.44 millis 125.00 micros 274.31 millis sys time 24.14 millis 31.00 micros 24.10 millis ``` LLVM compiled with Clang + PGO: ``` > time artiq_compile nac3devices.py ________________________________________________________ Executed in 258.85 millis fish external usr time 249.79 millis 122.00 micros 249.67 millis sys time 26.69 millis 34.00 micros 26.65 millis ``` In these results, about 170ms of the total execution time is spent on CPython and one-time initialization. The NAC3 compilation contribution would be roughly 110ms/110ms/89ms. So, PGO works, at least on Linux!
Poster
Owner

Changing memory allocator to mimalloc. Significant improvement.

@pca006132 How did you enable mimalloc? Is there a reason not to use it all the time?

> Changing memory allocator to mimalloc. Significant improvement. @pca006132 How did you enable mimalloc? Is there a reason not to use it all the time?

Changing memory allocator to mimalloc. Significant improvement.

@pca006132 How did you enable mimalloc?

https://github.com/microsoft/mimalloc/#dynamic-override

diff --git a/flake.nix b/flake.nix
index b1876254..98840731 100644
--- a/flake.nix
+++ b/flake.nix
@@ -315,6 +315,7 @@
           packages.x86_64-linux.vivado
           packages.x86_64-linux.openocd-bscanspi
         ];
+        LD_PRELOAD = "${pkgs.mimalloc}/lib/libmimalloc.so";
       };
 
       hydraJobs = {

Is there a reason not to use it all the time?

Probably no, but the gain from using an alternate memory allocator might vary depending on the OS.

> > Changing memory allocator to mimalloc. Significant improvement. > > @pca006132 How did you enable mimalloc? https://github.com/microsoft/mimalloc/#dynamic-override ```diff diff --git a/flake.nix b/flake.nix index b1876254..98840731 100644 --- a/flake.nix +++ b/flake.nix @@ -315,6 +315,7 @@ packages.x86_64-linux.vivado packages.x86_64-linux.openocd-bscanspi ]; + LD_PRELOAD = "${pkgs.mimalloc}/lib/libmimalloc.so"; }; hydraJobs = { ``` > Is there a reason not to use it all the time? Probably no, but the gain from using an alternate memory allocator might vary depending on the OS.
Poster
Owner

Tried this - compiles but segfaults. Function list taken from 752594e764/include/mimalloc-override.h

diff --git a/flake.nix b/flake.nix
index 1e37b47..1bd98bb 100644
--- a/flake.nix
+++ b/flake.nix
@@ -46,6 +46,15 @@
           suppress_build_script_link_lines=false
           '';
       };
+      mimalloc-link = (
+        (pkgs.lib.concatMapStrings (f: "-C link-arg=-Wl,--wrap=${f} -C link-arg=-Wl,--defsym=__wrap_${f}=mi_${f} ")
+          ["malloc" "calloc" "realloc" "free"
+           "strdup" "strndup" "realpath"
+           "reallocf" "malloc_size" "malloc_usable_size" "cfree"
+           "valloc" "pvalloc" "reallocarray" "memalign" "aligned_alloc" "posix_memalign"])
+        + "-C link-arg=-Wl,--wrap=_posix_memalign -C link-arg=-Wl,--defsym=__wrap__posix_memalign=mi_posix_memalign "
+        + "-C link-arg=${pkgs.mimalloc}/lib/mimalloc.o -C link-arg=-lc"
+      );
     in rec {
       packages.x86_64-linux = rec {
         llvm-nac3 = pkgs.callPackage "${self}/llvm" {};
@@ -58,6 +67,10 @@
             buildInputs = [ pkgs.python3 llvm-nac3 ];
             cargoBuildFlags = [ "--package" "nac3artiq" ];
             cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ];
+            configurePhase =
+              ''
+              export RUSTFLAGS="${mimalloc-link}"
+              '';
             installPhase =
               ''
               TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages}
@@ -83,7 +96,7 @@
             doCheck = false;
             configurePhase =
               ''
-              export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUSTFLAGS="-C link-arg=-L${pkgs.llvmPackages_13.compiler-rt}/lib/linux -C link-arg=-lclang_rt.profile-x86_64"
+              export RUSTFLAGS="-C link-arg=-L${pkgs.llvmPackages_13.compiler-rt}/lib/linux -C link-arg=-lclang_rt.profile-x86_64 ${mimalloc-link}"
               '';
             installPhase =
               ''
@@ -119,6 +132,10 @@
             buildInputs = [ pkgs.python3 llvm-nac3-pgo ];
             cargoBuildFlags = [ "--package" "nac3artiq" ];
             cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ];
+            configurePhase =
+              ''
+              export RUSTFLAGS="${mimalloc-link}"
+              '';
             installPhase =
               ''
               TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages}
Program received signal SIGSEGV, Segmentation fault.
0x00007fff610dda9f in mi_free_generic () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
(gdb) backtrace
#0  0x00007fff610dda9f in mi_free_generic () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
#1  0x00007fff5efdec7d in _GLOBAL__sub_I_ConstraintElimination.cpp () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
#2  0x00007fff61fc0420 in ?? () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
#3  0x00000391220d0020 in ?? ()
#4  0x00007fffffff64e0 in ?? ()
#5  0x00007fffffff64f0 in ?? ()
#6  0x00007fffffff6500 in ?? ()
#7  0x00007fffffff64c8 in ?? ()
#8  0x00007fffffff64d0 in ?? ()
#9  0x00000391220e0070 in ?? ()
#10 0x000000000084a5c0 in ?? ()
#11 0x0000000000000010 in ?? ()
#12 0x0000039122030080 in ?? ()
#13 0x0000000000000010 in ?? ()
#14 0x000000000084a5c0 in ?? ()
#15 0x0000000000000010 in ?? ()
#16 0x0000000000000010 in ?? ()
#17 0x00007fff61fc0420 in ?? () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
#18 0x0000000000814b60 in ?? ()
#19 0x0000000000000028 in ?? ()
#20 0x0000000000000028 in ?? ()
#21 0x00007fff60804414 in llvm::cl::Option::addArgument() [clone .localalias] () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so
#22 0x0000000000000000 in ?? ()
Tried this - compiles but segfaults. Function list taken from https://github.com/microsoft/mimalloc/blob/752594e76423526e108413731518a26e3322b9ca/include/mimalloc-override.h ```patch diff --git a/flake.nix b/flake.nix index 1e37b47..1bd98bb 100644 --- a/flake.nix +++ b/flake.nix @@ -46,6 +46,15 @@ suppress_build_script_link_lines=false ''; }; + mimalloc-link = ( + (pkgs.lib.concatMapStrings (f: "-C link-arg=-Wl,--wrap=${f} -C link-arg=-Wl,--defsym=__wrap_${f}=mi_${f} ") + ["malloc" "calloc" "realloc" "free" + "strdup" "strndup" "realpath" + "reallocf" "malloc_size" "malloc_usable_size" "cfree" + "valloc" "pvalloc" "reallocarray" "memalign" "aligned_alloc" "posix_memalign"]) + + "-C link-arg=-Wl,--wrap=_posix_memalign -C link-arg=-Wl,--defsym=__wrap__posix_memalign=mi_posix_memalign " + + "-C link-arg=${pkgs.mimalloc}/lib/mimalloc.o -C link-arg=-lc" + ); in rec { packages.x86_64-linux = rec { llvm-nac3 = pkgs.callPackage "${self}/llvm" {}; @@ -58,6 +67,10 @@ buildInputs = [ pkgs.python3 llvm-nac3 ]; cargoBuildFlags = [ "--package" "nac3artiq" ]; cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ]; + configurePhase = + '' + export RUSTFLAGS="${mimalloc-link}" + ''; installPhase = '' TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages} @@ -83,7 +96,7 @@ doCheck = false; configurePhase = '' - export CARGO_TARGET_X86_64_UNKNOWN_LINUX_GNU_RUSTFLAGS="-C link-arg=-L${pkgs.llvmPackages_13.compiler-rt}/lib/linux -C link-arg=-lclang_rt.profile-x86_64" + export RUSTFLAGS="-C link-arg=-L${pkgs.llvmPackages_13.compiler-rt}/lib/linux -C link-arg=-lclang_rt.profile-x86_64 ${mimalloc-link}" ''; installPhase = '' @@ -119,6 +132,10 @@ buildInputs = [ pkgs.python3 llvm-nac3-pgo ]; cargoBuildFlags = [ "--package" "nac3artiq" ]; cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ]; + configurePhase = + '' + export RUSTFLAGS="${mimalloc-link}" + ''; installPhase = '' TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages} ``` ``` Program received signal SIGSEGV, Segmentation fault. 0x00007fff610dda9f in mi_free_generic () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so (gdb) backtrace #0 0x00007fff610dda9f in mi_free_generic () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so #1 0x00007fff5efdec7d in _GLOBAL__sub_I_ConstraintElimination.cpp () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so #2 0x00007fff61fc0420 in ?? () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so #3 0x00000391220d0020 in ?? () #4 0x00007fffffff64e0 in ?? () #5 0x00007fffffff64f0 in ?? () #6 0x00007fffffff6500 in ?? () #7 0x00007fffffff64c8 in ?? () #8 0x00007fffffff64d0 in ?? () #9 0x00000391220e0070 in ?? () #10 0x000000000084a5c0 in ?? () #11 0x0000000000000010 in ?? () #12 0x0000039122030080 in ?? () #13 0x0000000000000010 in ?? () #14 0x000000000084a5c0 in ?? () #15 0x0000000000000010 in ?? () #16 0x0000000000000010 in ?? () #17 0x00007fff61fc0420 in ?? () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so #18 0x0000000000814b60 in ?? () #19 0x0000000000000028 in ?? () #20 0x0000000000000028 in ?? () #21 0x00007fff60804414 in llvm::cl::Option::addArgument() [clone .localalias] () from /home/sb/nac3/nac3artiq/demo/nac3artiq.so #22 0x0000000000000000 in ?? () ```
Poster
Owner

With mimalloc (dynamically loaded) + PGO:

$ time artiq_compile nac3devices.py 

real	0m0.249s
user	0m0.232s
sys	0m0.029s

(for some reason, the fish time command is broken with mimalloc, so it's using bash)

With mimalloc (dynamically loaded) + PGO: ``` $ time artiq_compile nac3devices.py real 0m0.249s user 0m0.232s sys 0m0.029s ``` (for some reason, the fish ``time`` command is broken with mimalloc, so it's using bash)
Poster
Owner

Alternative approach for enabling mimalloc by patching rustc:

diff --git a/flake.nix b/flake.nix
index 1e37b47..4965b91 100644
--- a/flake.nix
+++ b/flake.nix
@@ -5,7 +5,19 @@
 
   outputs = { self, nixpkgs }:
     let
-      pkgs = import nixpkgs { system = "x86_64-linux"; };
+      pkgs = import nixpkgs {
+        system = "x86_64-linux";
+        overlays = [
+          (self: super: {
+            rustc = super.rustc.overrideAttrs(oa: {
+              patchPhase = (oa.patchPhase or "") + ''
+                substituteInPlace compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs \
+                  --replace 'push("-m64".to_string())' 'extend(["-m64".to_string(), "${pkgs.mimalloc}/lib/mimalloc.o".to_string()])'
+              '';
+            });
+          })
+        ];
+      };
       pkgs-mingw = import nixpkgs {
         system = "x86_64-linux";
         crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; };

This causes the stage2 rustc build to fail:

   Compiling rustdoc v0.0.0 (/build/rustc-1.56.1-src/src/librustdoc)
error: /build/rustc-1.56.1-src/build/x86_64-unknown-linux-gnu/stage1-tools/release/deps/libpest_derive-489389269d71b50b.so: cannot allocate memory in static TLS block
  --> src/librustdoc/html/layout.rs:58:17
   |
58 |     templates: &tera::Tera,
   |                 ^^^^

error: could not compile `rustdoc` due to previous error
warning: build failed, waiting for other jobs to finish...
error: build failed


command did not execute successfully: "/nix/store/w8r7zj2bp8avhsdf119ngb93i23fzjsh-cargo-bootstrap-1.55.0/bin/.cargo-wrapped" "build" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j>
expected success, got: exit status: 101

https://github.com/microsoft/mimalloc/issues/147

Alternative approach for enabling mimalloc by patching rustc: ``` diff --git a/flake.nix b/flake.nix index 1e37b47..4965b91 100644 --- a/flake.nix +++ b/flake.nix @@ -5,7 +5,19 @@ outputs = { self, nixpkgs }: let - pkgs = import nixpkgs { system = "x86_64-linux"; }; + pkgs = import nixpkgs { + system = "x86_64-linux"; + overlays = [ + (self: super: { + rustc = super.rustc.overrideAttrs(oa: { + patchPhase = (oa.patchPhase or "") + '' + substituteInPlace compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs \ + --replace 'push("-m64".to_string())' 'extend(["-m64".to_string(), "${pkgs.mimalloc}/lib/mimalloc.o".to_string()])' + ''; + }); + }) + ]; + }; pkgs-mingw = import nixpkgs { system = "x86_64-linux"; crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; }; ``` This causes the stage2 rustc build to fail: ``` Compiling rustdoc v0.0.0 (/build/rustc-1.56.1-src/src/librustdoc) error: /build/rustc-1.56.1-src/build/x86_64-unknown-linux-gnu/stage1-tools/release/deps/libpest_derive-489389269d71b50b.so: cannot allocate memory in static TLS block --> src/librustdoc/html/layout.rs:58:17 | 58 | templates: &tera::Tera, | ^^^^ error: could not compile `rustdoc` due to previous error warning: build failed, waiting for other jobs to finish... error: build failed command did not execute successfully: "/nix/store/w8r7zj2bp8avhsdf119ngb93i23fzjsh-cargo-bootstrap-1.55.0/bin/.cargo-wrapped" "build" "--target" "x86_64-unknown-linux-gnu" "-Zbinary-dep-depinfo" "-j> expected success, got: exit status: 101 ``` https://github.com/microsoft/mimalloc/issues/147

Do we really need to patch rustc just to link a library? Does https://github.com/microsoft/mimalloc#static-override work on Linux?

Do we really need to patch rustc just to link a library? Does https://github.com/microsoft/mimalloc#static-override work on Linux?
Poster
Owner

Do we really need to patch rustc just to link a library?

Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation...

> Do we really need to patch rustc just to link a library? Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation...

Do we really need to patch rustc just to link a library?

Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation...

But the object file you linked is not the one that does the override. You should link mimalloc-override.o.

> > Do we really need to patch rustc just to link a library? > > Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation... But the object file you linked is not the one that does the override. You should link `mimalloc-override.o`.
Poster
Owner

But the object file you linked is not the one that does the override. You should link mimalloc-override.o.

mimalloc-override.o isn't in the nixpkgs outputs. It seems it does override anyway? Otherwise why would the TLS problem (https://github.com/microsoft/mimalloc/issues/147) manifest itself in rustc stage2? I'll take a closer look though.

I have figured out the problem - one bonus of the rustc rebuild is you'll get mimalloc in rustc (stage2) and hopefully faster build times :)

> But the object file you linked is not the one that does the override. You should link mimalloc-override.o. ``mimalloc-override.o`` isn't in the nixpkgs outputs. It seems it does override anyway? Otherwise why would the TLS problem (https://github.com/microsoft/mimalloc/issues/147) manifest itself in rustc stage2? I'll take a closer look though. I have figured out the problem - one bonus of the rustc rebuild is you'll get mimalloc in rustc (stage2) and hopefully faster build times :)
Poster
Owner
diff --git a/flake.nix b/flake.nix
index 1e37b47..cedd183 100644
--- a/flake.nix
+++ b/flake.nix
@@ -5,7 +5,24 @@
 
   outputs = { self, nixpkgs }:
     let
-      pkgs = import nixpkgs { system = "x86_64-linux"; };
+      pkgs = import nixpkgs {
+        system = "x86_64-linux";
+        overlays = [
+          (self: super: {
+            # unbreak rustc stage2 build (https://github.com/microsoft/mimalloc/issues/147)
+            mimalloc = super.mimalloc.overrideAttrs(oa: {
+              cmakeFlags = super.mimalloc.cmakeFlags ++ [ "-DMI_LOCAL_DYNAMIC_TLS=ON" ];
+            });
+            # you wish rustc would let you specify the linker command...
+            rustc = super.rustc.overrideAttrs(oa: {
+              patchPhase = (oa.patchPhase or "") + ''
+                substituteInPlace compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs \
+                  --replace 'push("-m64".to_string())' 'extend(["-m64".to_string(), "${pkgs.mimalloc}/lib/mimalloc.o".to_string()])'
+              '';
+            });
+          })
+        ];
+      };
       pkgs-mingw = import nixpkgs {
         system = "x86_64-linux";
         crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; };

python demo.py still segfaults though.

``` diff --git a/flake.nix b/flake.nix index 1e37b47..cedd183 100644 --- a/flake.nix +++ b/flake.nix @@ -5,7 +5,24 @@ outputs = { self, nixpkgs }: let - pkgs = import nixpkgs { system = "x86_64-linux"; }; + pkgs = import nixpkgs { + system = "x86_64-linux"; + overlays = [ + (self: super: { + # unbreak rustc stage2 build (https://github.com/microsoft/mimalloc/issues/147) + mimalloc = super.mimalloc.overrideAttrs(oa: { + cmakeFlags = super.mimalloc.cmakeFlags ++ [ "-DMI_LOCAL_DYNAMIC_TLS=ON" ]; + }); + # you wish rustc would let you specify the linker command... + rustc = super.rustc.overrideAttrs(oa: { + patchPhase = (oa.patchPhase or "") + '' + substituteInPlace compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs \ + --replace 'push("-m64".to_string())' 'extend(["-m64".to_string(), "${pkgs.mimalloc}/lib/mimalloc.o".to_string()])' + ''; + }); + }) + ]; + }; pkgs-mingw = import nixpkgs { system = "x86_64-linux"; crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; }; ``` ``python demo.py`` still segfaults though.

Do we really need to patch rustc just to link a library?

Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation...

Interesting, do we really need to put that at the beginning of the linker command? glibc malloc and free should be weak putting mimalloc at the end of the linker command should still work?

> > Do we really need to patch rustc just to link a library? > > Yes we do. Sadly, rust/cargo has very limited options to modify the linker command. That rustc patch is simply implementing this static override technique by inserting the object file at the beginning of the linker invokation... Interesting, do we really need to put that at the beginning of the linker command? glibc malloc and free should be weak putting mimalloc at the end of the linker command should still work?
Poster
Owner

Maybe it works, but I suspect the linker is dumber than you think. When doing that, you even need to add another -lc at the end, otherwise mimalloc doesn't find atexit.
I'm at a loss as to why Python segfaults, so I'm just following the mimalloc instructions carefully...

Maybe it works, but I suspect the linker is dumber than you think. When doing that, you even need to add another ``-lc`` at the end, otherwise mimalloc doesn't find ``atexit``. I'm at a loss as to why Python segfaults, so I'm just following the mimalloc instructions carefully...
Poster
Owner

You should link mimalloc-override.o.

https://github.com/microsoft/mimalloc/issues/80

So I'm using the override correctly.
I also checked that rustc does invoke the linker as per mimalloc's recommendations.

> You should link mimalloc-override.o. https://github.com/microsoft/mimalloc/issues/80 So I'm using the override correctly. I also checked that rustc does invoke the linker as per mimalloc's recommendations.
Poster
Owner

Python seems to segfault because dynamically loaded parts of nac3artiq such as libstdc++ still use the old malloc, and pointers get passed around between allocators.

Using patchelf on nac3artiq.so to dynamically link mimalloc before libc does NOT work because Python has already loaded the libc malloc when it loads nac3artiq - so the malloc calls resolve to libc. It would work on nac3standalone.

Linking statically with Rust (to remove annoying DSOs like libstdc++) is only possible with musl libc. Surprisingly, building things with musl (LLVM, rustc, ...) works out-of-the-box with nixpkgs - it just takes time as a lot of dependencies are not on cache.nixos.org. I find glibc to be a fairly disgusting piece of software so I'm happy to switch to musl, which claims to address my concerns with glibc.
But it seems there are issues with thread-local storage (TLS) when mixing libcs, which result in memory corruption/crashes. Are we directly or indirectly using TLS anywhere in NAC3?
A more practical issue is Rust also refuses to build a statically-linked/freestanding DSO, we need to make it build a static library and then use some linker hacks to convert it into a DSO. Or write another rustc patch so it stops complaining about cdylib not being a legit crate type with musl.

Python seems to segfault because dynamically loaded parts of nac3artiq such as libstdc++ still use the old malloc, and pointers get passed around between allocators. Using ``patchelf`` on nac3artiq.so to dynamically link mimalloc before libc does NOT work because Python has already loaded the libc malloc when it loads nac3artiq - so the malloc calls resolve to libc. It would work on nac3standalone. Linking statically with Rust (to remove annoying DSOs like libstdc++) is only possible with musl libc. Surprisingly, building things with musl (LLVM, rustc, ...) works out-of-the-box with nixpkgs - it just takes time as a lot of dependencies are not on cache.nixos.org. I find glibc to be a fairly disgusting piece of software so I'm happy to switch to musl, which claims to address my concerns with glibc. But it seems there are issues with thread-local storage (TLS) when mixing libcs, which result in memory corruption/crashes. Are we directly or indirectly using TLS anywhere in NAC3? A more practical issue is Rust also refuses to build a statically-linked/freestanding DSO, we need to make it build a static library and then use some linker hacks to convert it into a DSO. Or write another rustc patch so it stops complaining about cdylib not being a legit crate type with musl.
Poster
Owner

We definitely need musl because of this glibc nonsense: https://stackoverflow.com/questions/57476533/why-is-statically-linking-glibc-discouraged#57478728

Current status:

diff --git a/flake.nix b/flake.nix
index 1e37b47..e22630c 100644
--- a/flake.nix
+++ b/flake.nix
@@ -6,6 +6,10 @@
   outputs = { self, nixpkgs }:
     let
       pkgs = import nixpkgs { system = "x86_64-linux"; };
+      pkgs-musl = import nixpkgs {
+        system = "x86_64-linux";
+        crossSystem = { config = "x86_64-unknown-linux-musl"; };
+      };
       pkgs-mingw = import nixpkgs {
         system = "x86_64-linux";
         crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; };
@@ -48,9 +52,9 @@
       };
     in rec {
       packages.x86_64-linux = rec {
-        llvm-nac3 = pkgs.callPackage "${self}/llvm" {};
+        llvm-nac3 = pkgs-musl.callPackage "${self}/llvm" {};
         nac3artiq = pkgs.python3Packages.toPythonModule (
-          pkgs.rustPlatform.buildRustPackage {
+          pkgs-musl.rustPlatform.buildRustPackage {
             name = "nac3artiq";
             src = self;
             cargoLock = { lockFile = ./Cargo.lock; };
@@ -58,11 +62,37 @@
             buildInputs = [ pkgs.python3 llvm-nac3 ];
             cargoBuildFlags = [ "--package" "nac3artiq" ];
             cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ];
+            # HACK: Wrap the linker to force static linking, and use -crt-static to trick rustc into thinking it's linking
+            # dynamically, so it accepts to build a cdylib crate.
+            configurePhase =
+              ''
+              cat << EOF > hacked-cc
+              #!${pkgs.bash}/bin/bash
+              set -e
+
+              declare -a finalopts
+              finalopts=()
+              for o in "\$@"; do
+                  if [ "\$o" = "-lgcc_s" ] || [ "\$o" = "-Wl,-Bdynamic" ] ; then
+                      continue
+                  fi
+                  finalopts+=("\$o")
+              done
+
+              exec ${pkgs-musl.stdenv.cc}/bin/x86_64-unknown-linux-musl-cc -Wl,-Bstatic -L${pkgs-musl.zlib.static}/lib "\''\${finalopts[@]}"
+              EOF
+              chmod +x hacked-cc
+
+              cat hacked-cc
+
+              export CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_LINKER=`pwd`/hacked-cc
+              export RUSTFLAGS="-C target-feature=-crt-static"
+              '';
             installPhase =
               ''
               TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages}
               mkdir -p $TARGET_DIR
-              cp target/x86_64-unknown-linux-gnu/release/libnac3artiq.so $TARGET_DIR/nac3artiq.so
+              cp target/x86_64-unknown-linux-musl/release/libnac3artiq.so $TARGET_DIR/nac3artiq.so
               '';
           }
         );
> ldd result/lib/python3.9/site-packages/nac3artiq.so
	statically linked
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import nac3artiq
Segmentation fault (core dumped)
We definitely need musl because of this glibc nonsense: https://stackoverflow.com/questions/57476533/why-is-statically-linking-glibc-discouraged#57478728 Current status: ```diff diff --git a/flake.nix b/flake.nix index 1e37b47..e22630c 100644 --- a/flake.nix +++ b/flake.nix @@ -6,6 +6,10 @@ outputs = { self, nixpkgs }: let pkgs = import nixpkgs { system = "x86_64-linux"; }; + pkgs-musl = import nixpkgs { + system = "x86_64-linux"; + crossSystem = { config = "x86_64-unknown-linux-musl"; }; + }; pkgs-mingw = import nixpkgs { system = "x86_64-linux"; crossSystem = { config = "x86_64-w64-mingw32"; libc = "msvcrt"; }; @@ -48,9 +52,9 @@ }; in rec { packages.x86_64-linux = rec { - llvm-nac3 = pkgs.callPackage "${self}/llvm" {}; + llvm-nac3 = pkgs-musl.callPackage "${self}/llvm" {}; nac3artiq = pkgs.python3Packages.toPythonModule ( - pkgs.rustPlatform.buildRustPackage { + pkgs-musl.rustPlatform.buildRustPackage { name = "nac3artiq"; src = self; cargoLock = { lockFile = ./Cargo.lock; }; @@ -58,11 +62,37 @@ buildInputs = [ pkgs.python3 llvm-nac3 ]; cargoBuildFlags = [ "--package" "nac3artiq" ]; cargoTestFlags = [ "--package" "nac3ast" "--package" "nac3parser" "--package" "nac3core" "--package" "nac3artiq" ]; + # HACK: Wrap the linker to force static linking, and use -crt-static to trick rustc into thinking it's linking + # dynamically, so it accepts to build a cdylib crate. + configurePhase = + '' + cat << EOF > hacked-cc + #!${pkgs.bash}/bin/bash + set -e + + declare -a finalopts + finalopts=() + for o in "\$@"; do + if [ "\$o" = "-lgcc_s" ] || [ "\$o" = "-Wl,-Bdynamic" ] ; then + continue + fi + finalopts+=("\$o") + done + + exec ${pkgs-musl.stdenv.cc}/bin/x86_64-unknown-linux-musl-cc -Wl,-Bstatic -L${pkgs-musl.zlib.static}/lib "\''\${finalopts[@]}" + EOF + chmod +x hacked-cc + + cat hacked-cc + + export CARGO_TARGET_X86_64_UNKNOWN_LINUX_MUSL_LINKER=`pwd`/hacked-cc + export RUSTFLAGS="-C target-feature=-crt-static" + ''; installPhase = '' TARGET_DIR=$out/${pkgs.python3Packages.python.sitePackages} mkdir -p $TARGET_DIR - cp target/x86_64-unknown-linux-gnu/release/libnac3artiq.so $TARGET_DIR/nac3artiq.so + cp target/x86_64-unknown-linux-musl/release/libnac3artiq.so $TARGET_DIR/nac3artiq.so ''; } ); ``` ``` > ldd result/lib/python3.9/site-packages/nac3artiq.so statically linked ``` ``` Python 3.9.6 (default, Jun 28 2021, 08:57:49) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import nac3artiq Segmentation fault (core dumped) ```
Poster
Owner

When additionally linking mimalloc, the crash trace is:

0x00007fffe8f3d78b in __vdsosym () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
(gdb) backtrace
#0  0x00007fffe8f3d78b in __vdsosym ()
   from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
#1  0x00007fffe8f3a3ca in cgt_init ()
   from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
#2  0x00007fffe8f3a40d in clock_gettime ()
   from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
#3  0x00007fffe6dfe436 in mi_heap_main_init.part ()
   from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
#4  0x00007fffe6d82e04 in _mi_process_init ()
   from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so
#5  0x00007ffff7fdbbce in call_init () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2
#6  0x00007ffff7fdbcb4 in _dl_init () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2
#7  0x00007ffff791cf5c in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#8  0x00007ffff7fdff9c in dl_open_worker () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2
#9  0x00007ffff791cf15 in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#10 0x00007ffff7fdf7bd in _dl_open () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2
#11 0x00007ffff7bc0236 in dlopen_doit () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2
#12 0x00007ffff791cf15 in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#13 0x00007ffff791cfaf in _dl_catch_error () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6
#14 0x00007ffff7bc08e9 in _dlerror_run () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2
#15 0x00007ffff7bc02b6 in dlopen@@GLIBC_2.2.5 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2
...

https://git.musl-libc.org/cgit/musl/tree/src/internal/vdso.c#n43

Seems we're still not out of the woods with this ELF/Linux dynamic linker mess. I'm impressed by how crappy it is.

When additionally linking mimalloc, the crash trace is: ``` 0x00007fffe8f3d78b in __vdsosym () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so (gdb) backtrace #0 0x00007fffe8f3d78b in __vdsosym () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so #1 0x00007fffe8f3a3ca in cgt_init () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so #2 0x00007fffe8f3a40d in clock_gettime () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so #3 0x00007fffe6dfe436 in mi_heap_main_init.part () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so #4 0x00007fffe6d82e04 in _mi_process_init () from /nix/store/b4gm0fv517dq4w2nh43r53s2a1vnrhf7-nac3artiq-x86_64-unknown-linux-musl/lib/python3.9/site-packages/nac3artiq.so #5 0x00007ffff7fdbbce in call_init () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2 #6 0x00007ffff7fdbcb4 in _dl_init () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2 #7 0x00007ffff791cf5c in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6 #8 0x00007ffff7fdff9c in dl_open_worker () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2 #9 0x00007ffff791cf15 in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6 #10 0x00007ffff7fdf7bd in _dl_open () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/ld-linux-x86-64.so.2 #11 0x00007ffff7bc0236 in dlopen_doit () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2 #12 0x00007ffff791cf15 in _dl_catch_exception () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6 #13 0x00007ffff791cfaf in _dl_catch_error () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libc.so.6 #14 0x00007ffff7bc08e9 in _dlerror_run () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2 #15 0x00007ffff7bc02b6 in dlopen@@GLIBC_2.2.5 () from /nix/store/vjq3q7dq8vmc13c3py97v27qwizvq7fd-glibc-2.33-59/lib/libdl.so.2 ... ``` https://git.musl-libc.org/cgit/musl/tree/src/internal/vdso.c#n43 Seems we're still not out of the woods with this ELF/Linux dynamic linker mess. I'm impressed by how crappy it is.
Poster
Owner

See this musl mailing list thread: https://www.openwall.com/lists/musl/2022/01/01/1

The solution to the MWE I posted there is:

#include <stdio.h>
#include <sys/time.h>

void __init_libc(char **envp, char *pn);

typedef void (*lsm2_fn)();

static void libc_start_main_stage2() {
  struct timeval tv;
  gettimeofday(&tv, NULL);
  printf("%ld\n", tv.tv_sec);  
}

void test_func(int argc, char **argv) {
  char **envp = argv+argc+1;
  __init_libc(envp, argv[0]);
  lsm2_fn stage2 = libc_start_main_stage2;
  __asm__ ( "" : "+r"(stage2) : : "memory" );
  stage2();
}

(and pass argc and argv to the library of course)

But I'm not sure of much more of this kind of crap we would have to deal with until statically-linked nac3artiq works.

Other options: (1) LD_PRELOAD (but it would have to be in every user's nix-shell) (2) patchelf on the Python interpreter (3) maybe get LLVM to use the mi_ functions, though with cmake and C++ it looks like it'll be a pain.

See this musl mailing list thread: https://www.openwall.com/lists/musl/2022/01/01/1 The solution to the MWE I posted there is: ``` #include <stdio.h> #include <sys/time.h> void __init_libc(char **envp, char *pn); typedef void (*lsm2_fn)(); static void libc_start_main_stage2() { struct timeval tv; gettimeofday(&tv, NULL); printf("%ld\n", tv.tv_sec); } void test_func(int argc, char **argv) { char **envp = argv+argc+1; __init_libc(envp, argv[0]); lsm2_fn stage2 = libc_start_main_stage2; __asm__ ( "" : "+r"(stage2) : : "memory" ); stage2(); } ``` (and pass argc and argv to the library of course) But I'm not sure of much more of this kind of crap we would have to deal with until statically-linked nac3artiq works. Other options: (1) LD_PRELOAD (but it would have to be in every user's nix-shell) (2) patchelf on the Python interpreter (3) maybe get LLVM to use the ``mi_`` functions, though with cmake and C++ it looks like it'll be a pain.
Poster
Owner

As expected from something that uses cmake, https://reviews.llvm.org/D101427 does not work and it continues to use the glibc allocator.
In fact this mimalloc support in LLVM seems pretty silly, all it does is add the mimalloc sources to the LLVM build system (as if it wasn't complicated enough) and then attempt to use this flimsy symbol interposition hack to enable it.

As expected from something that uses cmake, https://reviews.llvm.org/D101427 does not work and it continues to use the glibc allocator. In fact this mimalloc support in LLVM seems pretty silly, all it does is add the mimalloc sources to the LLVM build system (as if it wasn't complicated enough) and then attempt to use this flimsy symbol interposition hack to enable it.
Poster
Owner

Enabled mimalloc via LD_PRELOAD and the Nix Python wrapper. Seems fine and with good usability.

Enabled mimalloc via LD_PRELOAD and the Nix Python wrapper. Seems fine and with good usability.
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/nac3#125
There is no content yet.