performance tracking #13

Open
opened 2021-09-20 12:42:13 +08:00 by sb10q · 4 comments

One data point for now.

Generated synthetic workload with:

ndummy = 100

print("class X:")
print("    def __init__(self):")
print("        1")
for j in range(ndummy):
    print("    def dummy{}(self) -> int32:".format(j))

    print("        a0 = 1")
    for i in range(1000):
        print("        a{} = a{} + 1".format(i+1, i))

    print("        return 0")


print("def run() -> int32:")
print("    x = X()")
print("    " + "; ".join("dummy{} = x.dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12
print("    return " + " + ".join("dummy{}".format(j) for j in range(ndummy)))

The result takes 6.3s to compile on my i9-9900K with 8 codegen threads configured, and 10.3s with one codegen thread. NAC3 35a94a8fc0

This is NOT a high priority topic (it's already a huge improvement over the current compiler!). Let's substantially work on this only after we can run all existing ARTIQ drivers on NAC3.

One data point for now. Generated synthetic workload with: ``` ndummy = 100 print("class X:") print(" def __init__(self):") print(" 1") for j in range(ndummy): print(" def dummy{}(self) -> int32:".format(j)) print(" a0 = 1") for i in range(1000): print(" a{} = a{} + 1".format(i+1, i)) print(" return 0") print("def run() -> int32:") print(" x = X()") print(" " + "; ".join("dummy{} = x.dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12 print(" return " + " + ".join("dummy{}".format(j) for j in range(ndummy))) ``` The result takes 6.3s to compile on my i9-9900K with 8 codegen threads configured, and 10.3s with one codegen thread. NAC3 35a94a8fc0dc4609497286d03f393f808fb4f30f This is NOT a high priority topic (it's already a huge improvement over the current compiler!). Let's substantially work on this only after we can run all existing ARTIQ drivers on NAC3.

A few things to be done here:

  1. String interning, which is done in the optimization branch. One thing to note is that the current implementation uses a static string interner (something like a hashmap but supports inverse lookup) with a mutex. Using a thread local cache can speedup access (as it reduces calls to lock) and reduce lock contention in multithreaded when we do multi-threaded parsing, but this is only faster when we enable the unstable #[thread_local] attribute and slower otherwise.
    This requires a modified version of rustpython_parser.

  2. Avoid cloning some structs by wrapping them in an Arc, this is currently done in the optimization branch nac3core/src/codegen/mod.rs:

@ -187,8 +187,8 @@ pub struct CodeGenTask {
     pub subst: Vec<(Type, Type)>,
     pub symbol_name: String,
     pub signature: FunSignature,
-    pub body: Vec<Stmt<Option<Type>>>,
-    pub calls: HashMap<CodeLocation, CallId>,
+    pub body: Arc<Vec<Stmt<Option<Type>>>>,
+    pub calls: Arc<HashMap<CodeLocation, CallId>>,
  1. Avoid generating TCall for monomorphic functions, e.g. integer additions etc. This can be done by checking if the function and class contains type variable. TCall is mainly used for aggregating calls for unification with a TFunc later and tell the codegen which function instance to pick, which is not needed for monomorphic functions. This is also implemented in the optimization branch.

With the above optimizations, the time required for parsing is not changed by much (actually slightly faster for simple benchmark in single threaded case), the type inference and code generation is faster by about 30~50%.

Some additional things to be done:

  1. Parallel type inference. This can be done by cloning the unification table after parsing top-level definitions into a number of thread local unification tables and do type infence in those threads. Note that it requires additional care for class constructors as the unification should be done in all threads (we can extract these and do it in the top-level parsing stage, or repeat the unification in all threads).
  2. Track the time for each stage of the compiler, so we could understand which part is slower. This is also implemented in the optimization branch.
  3. Store the list of function calls in the function instance, so when we do code generation we can immediately queue the function instances required instead of wait until we reach the function call expression. This can enable better parallelization.
A few things to be done here: 1. String interning, which is done in the optimization branch. One thing to note is that the current implementation uses a static string interner (something like a hashmap but supports inverse lookup) with a mutex. Using a thread local cache can speedup access (as it reduces calls to lock) and reduce lock contention in multithreaded when we do multi-threaded parsing, but this is only faster when we enable the unstable `#[thread_local]` attribute and [slower otherwise](https://matklad.github.io/2020/10/03/fast-thread-locals-in-rust.html). This requires a modified version of `rustpython_parser`. 2. Avoid cloning some structs by wrapping them in an `Arc`, this is currently done in the optimization branch `nac3core/src/codegen/mod.rs`: ```diff @ -187,8 +187,8 @@ pub struct CodeGenTask { pub subst: Vec<(Type, Type)>, pub symbol_name: String, pub signature: FunSignature, - pub body: Vec<Stmt<Option<Type>>>, - pub calls: HashMap<CodeLocation, CallId>, + pub body: Arc<Vec<Stmt<Option<Type>>>>, + pub calls: Arc<HashMap<CodeLocation, CallId>>, ``` 3. Avoid generating `TCall` for monomorphic functions, e.g. integer additions etc. This can be done by checking if the function and class contains type variable. `TCall` is mainly used for aggregating calls for unification with a `TFunc` later and tell the codegen which function instance to pick, which is not needed for monomorphic functions. This is also implemented in the optimization branch. With the above optimizations, the time required for parsing is not changed by much (actually slightly faster for simple benchmark in single threaded case), the type inference and code generation is faster by about 30~50%. Some additional things to be done: 1. Parallel type inference. This can be done by cloning the unification table after parsing top-level definitions into a number of thread local unification tables and do type infence in those threads. Note that it requires additional care for class constructors as the unification should be done in all threads (we can extract these and do it in the top-level parsing stage, or repeat the unification in all threads). 2. Track the time for each stage of the compiler, so we could understand which part is slower. This is also implemented in the optimization branch. 3. Store the list of function calls in the function instance, so when we do code generation we can immediately queue the function instances required instead of wait until we reach the function call expression. This can enable better parallelization.

For demonstration, consider the slightly modified version of the script above,

ndummy = 100

for j in range(ndummy):
    print("def dummy{}() -> int32:".format(j))

    print("    a0 = 1")
    for i in range(1000):
        print("    a{} = a{} + 1".format(i+1, i))

    print("    return 0")


print("def run() -> int32:")
print("    " + "; ".join("v{} = dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12
print("    return " + " + ".join("v{}".format(j) for j in range(ndummy)))

The time required for the standalone branch is

cargo run --release  7.42s user 1.75s system 100% cpu 9.164 total

, while the time required for the optimization branch is

cargo run --release  3.46s user 0.51s system 100% cpu 3.965 total
For demonstration, consider the slightly modified version of the script above, ```python ndummy = 100 for j in range(ndummy): print("def dummy{}() -> int32:".format(j)) print(" a0 = 1") for i in range(1000): print(" a{} = a{} + 1".format(i+1, i)) print(" return 0") print("def run() -> int32:") print(" " + "; ".join("v{} = dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12 print(" return " + " + ".join("v{}".format(j) for j in range(ndummy))) ``` The time required for the standalone branch is ``` cargo run --release 7.42s user 1.75s system 100% cpu 9.164 total ``` , while the time required for the optimization branch is ``` cargo run --release 3.46s user 0.51s system 100% cpu 3.965 total ```

Now implemented in https://git.m-labs.hk/M-Labs/nac3/src/branch/optimization

(additional things mentioned previously are not implemented yet)

For the source generated by this script:

ndummy = 100

print("class X:")
print("    def __init__(self):")
print("        1")
for j in range(ndummy):
    print("    def dummy{}(self) -> int32:".format(j))

    print("        a0 = 1")
    for i in range(10000):
        print("        a{} = a{} + 1".format(i+1, i))

    print("        return 0")


print("def run() -> int32:")
print("    x = X()")
print("    " + "; ".join("dummy{} = x.dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12
print("    return " + " + ".join("dummy{}".format(j) for j in range(ndummy)))

The running time of the current master (a508baae20) is

setup time: 15ms
parse time: 3439ms
analysis time: 27035ms
codegen time (including LLVM): 52904ms
total time: 83395ms

While the running time of the optimized branch (105d605e6d) is

setup time: 16ms
parse time: 3212ms
analysis time: 3382ms
codegen time (including LLVM): 5996ms
total time: 12608ms

Both are using 4 cores iirc.

Now implemented in https://git.m-labs.hk/M-Labs/nac3/src/branch/optimization (additional things mentioned previously are not implemented yet) For the source generated by this script: ```python ndummy = 100 print("class X:") print(" def __init__(self):") print(" 1") for j in range(ndummy): print(" def dummy{}(self) -> int32:".format(j)) print(" a0 = 1") for i in range(10000): print(" a{} = a{} + 1".format(i+1, i)) print(" return 0") print("def run() -> int32:") print(" x = X()") print(" " + "; ".join("dummy{} = x.dummy{}()".format(j, j) for j in range(ndummy))) # work around issue 12 print(" return " + " + ".join("dummy{}".format(j) for j in range(ndummy))) ``` The running time of the current master (a508baae2032fe58e9b3034231f36809a271f4da) is ``` setup time: 15ms parse time: 3439ms analysis time: 27035ms codegen time (including LLVM): 52904ms total time: 83395ms ``` While the running time of the optimized branch (105d605e6dd0d27375f4669e860f32e0fdef86a1) is ``` setup time: 16ms parse time: 3212ms analysis time: 3382ms codegen time (including LLVM): 5996ms total time: 12608ms ``` Both are using 4 cores iirc.

Further optimized a bit...
Your synthetic workload would now take 996ms to run on 1 core, 616ms on 4 cores.

setup time: 2ms
parse time: 322ms
analysis time: 132ms
codegen time (including LLVM): 539ms
total time: 996ms

setup time: 2ms
parse time: 315ms
analysis time: 130ms
codegen time (including LLVM): 168ms
total time: 616ms

My synthetic workload would now take 6078ms to run on 4 cores, instead of 12608ms (double the time) in the previous commit.

setup time: 17ms
parse time: 3175ms
analysis time: 1251ms
codegen time (including LLVM): 1633ms
total time: 6078ms

The flamegraph looks pretty good now, hard to further optimize. (except parallel type checking...)

Further optimized a bit... Your synthetic workload would now take 996ms to run on 1 core, 616ms on 4 cores. ``` setup time: 2ms parse time: 322ms analysis time: 132ms codegen time (including LLVM): 539ms total time: 996ms setup time: 2ms parse time: 315ms analysis time: 130ms codegen time (including LLVM): 168ms total time: 616ms ``` My synthetic workload would now take 6078ms to run on 4 cores, instead of 12608ms (double the time) in the previous commit. ``` setup time: 17ms parse time: 3175ms analysis time: 1251ms codegen time (including LLVM): 1633ms total time: 6078ms ``` The flamegraph looks pretty good now, hard to further optimize. (except parallel type checking...)
sb10q closed this issue 2021-09-24 10:46:56 +08:00
sb10q reopened this issue 2021-09-30 22:24:08 +08:00
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: M-Labs/nac3#13
There is no content yet.