Here are my benchmark results, comparing three different workloads between master (OLD) and this PR (NEW). The first workload currently completely trashes the GC implementation on master, the other two are more lightweight.
Here the GC "disabled" lines mean disabled at runtime, so roots are still collected.
// Very, very, very many objects
GC | OLD | NEW
disabled | 1.32s | 1.50s
enabled | 12.75s | 2.32s
// Very many objects
GC | OLD | NEW
disabled | 0.87s | 0.87s
enabled | 1.48s | 0.94s
// Less many objects
GC | OLD | NEW
disabled | 1.65s | 1.62s
enabled | 1.75s | 1.62s