performance tracking #13
Labels
No Milestone
No Assignees
2 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: M-Labs/nac3#13
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
One data point for now.
Generated synthetic workload with:
The result takes 6.3s to compile on my i9-9900K with 8 codegen threads configured, and 10.3s with one codegen thread. NAC3
35a94a8fc0
This is NOT a high priority topic (it's already a huge improvement over the current compiler!). Let's substantially work on this only after we can run all existing ARTIQ drivers on NAC3.
A few things to be done here:
String interning, which is done in the optimization branch. One thing to note is that the current implementation uses a static string interner (something like a hashmap but supports inverse lookup) with a mutex. Using a thread local cache can speedup access (as it reduces calls to lock) and reduce lock contention in multithreaded when we do multi-threaded parsing, but this is only faster when we enable the unstable
#[thread_local]
attribute and slower otherwise.This requires a modified version of
rustpython_parser
.Avoid cloning some structs by wrapping them in an
Arc
, this is currently done in the optimization branchnac3core/src/codegen/mod.rs
:TCall
for monomorphic functions, e.g. integer additions etc. This can be done by checking if the function and class contains type variable.TCall
is mainly used for aggregating calls for unification with aTFunc
later and tell the codegen which function instance to pick, which is not needed for monomorphic functions. This is also implemented in the optimization branch.With the above optimizations, the time required for parsing is not changed by much (actually slightly faster for simple benchmark in single threaded case), the type inference and code generation is faster by about 30~50%.
Some additional things to be done:
For demonstration, consider the slightly modified version of the script above,
The time required for the standalone branch is
, while the time required for the optimization branch is
Now implemented in https://git.m-labs.hk/M-Labs/nac3/src/branch/optimization
(additional things mentioned previously are not implemented yet)
For the source generated by this script:
The running time of the current master (
a508baae20
) isWhile the running time of the optimized branch (
105d605e6d
) isBoth are using 4 cores iirc.
Further optimized a bit...
Your synthetic workload would now take 996ms to run on 1 core, 616ms on 4 cores.
My synthetic workload would now take 6078ms to run on 4 cores, instead of 12608ms (double the time) in the previous commit.
The flamegraph looks pretty good now, hard to further optimize. (except parallel type checking...)