This time I went back with a profiler and optimized the actual hot path.
A huge chunk of time was going to Objects.equals() because of profile pollution / missed devirtualization.
After fixing that, the next bottleneck was ARM/NEON “movemask” pain (VectorMask.toLong()), so I tried SWAR… and it ended up faster (even on x86, which I did not expect).
Can see the rest of that file and the adjacent `raw_hashtable.h` for the rest of the SwissTable-like implementation and `hashing.h` for the hash function.
FWIW, it consistently out-performs SwissTable in some respects, but uses a weaker but faster hash function that is good enough for the hash table, but not good for other use cases.
I wanted to share a follow-up to this post. https://bluuewhale.github.io/posts/further-optimizing-my-jav...
This time I went back with a profiler and optimized the actual hot path.
A huge chunk of time was going to Objects.equals() because of profile pollution / missed devirtualization.
After fixing that, the next bottleneck was ARM/NEON “movemask” pain (VectorMask.toLong()), so I tried SWAR… and it ended up faster (even on x86, which I did not expect).