Migrate encoding_rs from packed_simd to core::simd
Categories
(Core :: Internationalization, task)
Tracking
()
People
(Reporter: hsivonen, Assigned: hsivonen)
References
Details
Attachments
(1 file)
|
60.91 KB,
text/plain
|
Details |
Once core_simd becomes available as core::simd behind a feature gate and doesn't regress performance, migrate to it.
Currently, I have a port to core_simd but it appears to severely regress performance on 32-bit ARM (Exynos 5). (Not sure yet why. The usual suspect is what any()/all() compile to.)
Also, on x86_64 (Haswell) what gets faster and what gets slower suggests that the prediction for the branch after the boolean reduction for "are these 16 UTF-16 code units all ASCII?" check flips to favor "they aren't all ASCII" and disfavor "they are all ASCII".
I do have an 32-bit ARM device (rpi2), can you please share what the performance regression is all about and how to test for it? thanks
| Assignee | ||
Comment 2•4 years ago
|
||
Here's how these numbers are derived on Exynos 5 (Samsung Chromebook 2 running Crouton):
rustup default nightly-2021-07-01- Clone
encoding_rs,safe_encoding_rs_mem,encoding_bench, andstdsimdso that they are siblings inside a common parent directory. - Ensure that the current branch of
encoding_rsismaster. - With
encoding_benchas the working directory, runcargo bench --target thumbv7neon-unknown-linux-gnueabihf --features 'simd-accel self mem' > packed_simd.txt(I actually ran this four times and merged the results picking the minimum time for each benchmark, since noise can only make the benchmark run slower, using https://github.com/hsivonen/cargo-benchcmp/commits/faster ) - Change
encoding_rsto branchcore_simd. - Run step 4 again into
core_simd.txt. cargo benchcmp --threshold 4 packed_simd.txt core_simd.txt
| Assignee | ||
Comment 3•4 years ago
|
||
The next step to actually investigating this would be to create a minimal non-inline function that calls all() on m8x16 with packed_simd and mask8x16 with core_simd (with opt_level=3 and opt_level=2) and checking the generated instructions.
| Assignee | ||
Comment 4•4 years ago
|
||
On aarch64 (M1), it looks from the results like there could be the same kind of branch prediction flip as on x86_64.
Additionally, decoding UTF-16BE to UTF-16 becomes slower and UTF-16LE to UTF-16 becomes faster. While it's somewhat interesting that same-endian gets better and opposite-endian gets worse when migrating from packed_simd to core_simd, neither is real-world-relevant enough to bother investigating.
| Assignee | ||
Comment 5•4 years ago
|
||
Except on aarch64, the presumed branch prediction flip affects the decode side in a regressing way and the encode side becomes faster.
| Assignee | ||
Comment 6•4 years ago
|
||
As expected, the codegen for boolean reductions on thumbv7neon is bad.
Comment 7•4 years ago
|
||
Have you checked what these things look like in Firefox shippable builds (which have both PGO and LTO enabled)?
| Assignee | ||
Comment 8•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #7)
Have you checked what these things look like in Firefox shippable builds (which have both PGO and LTO enabled)?
No. I expect those to potentially flip branch predictions on x86_64 and aarch64. However, the 32-bit ARM codegen is so bad that there's no way PGO or LTO would make it OK.
| Assignee | ||
Comment 9•3 months ago
|
||
This was fixed in bug 1882209.
Description
•