Closed Bug 1719896 Opened 4 years ago Closed 3 months ago

Migrate encoding_rs from packed_simd to core::simd

Categories

(Core :: Internationalization, task)

task

Tracking

()

RESOLVED DUPLICATE of bug 1882209

People

(Reporter: hsivonen, Assigned: hsivonen)

References

Details

Attachments

(1 file)

Once core_simd becomes available as core::simd behind a feature gate and doesn't regress performance, migrate to it.

Currently, I have a port to core_simd but it appears to severely regress performance on 32-bit ARM (Exynos 5). (Not sure yet why. The usual suspect is what any()/all() compile to.)

Also, on x86_64 (Haswell) what gets faster and what gets slower suggests that the prediction for the branch after the boolean reduction for "are these 16 UTF-16 code units all ASCII?" check flips to favor "they aren't all ASCII" and disfavor "they are all ASCII".

I do have an 32-bit ARM device (rpi2), can you please share what the performance regression is all about and how to test for it? thanks

Flags: needinfo?(hsivonen)

Here's how these numbers are derived on Exynos 5 (Samsung Chromebook 2 running Crouton):

  1. rustup default nightly-2021-07-01
  2. Clone encoding_rs, safe_encoding_rs_mem, encoding_bench, and stdsimd so that they are siblings inside a common parent directory.
  3. Ensure that the current branch of encoding_rs is master.
  4. With encoding_bench as the working directory, run cargo bench --target thumbv7neon-unknown-linux-gnueabihf --features 'simd-accel self mem' > packed_simd.txt (I actually ran this four times and merged the results picking the minimum time for each benchmark, since noise can only make the benchmark run slower, using https://github.com/hsivonen/cargo-benchcmp/commits/faster )
  5. Change encoding_rs to branch core_simd.
  6. Run step 4 again into core_simd.txt.
  7. cargo benchcmp --threshold 4 packed_simd.txt core_simd.txt
Flags: needinfo?(hsivonen)

The next step to actually investigating this would be to create a minimal non-inline function that calls all() on m8x16 with packed_simd and mask8x16 with core_simd (with opt_level=3 and opt_level=2) and checking the generated instructions.

On aarch64 (M1), it looks from the results like there could be the same kind of branch prediction flip as on x86_64.

Additionally, decoding UTF-16BE to UTF-16 becomes slower and UTF-16LE to UTF-16 becomes faster. While it's somewhat interesting that same-endian gets better and opposite-endian gets worse when migrating from packed_simd to core_simd, neither is real-world-relevant enough to bother investigating.

Except on aarch64, the presumed branch prediction flip affects the decode side in a regressing way and the encode side becomes faster.

Have you checked what these things look like in Firefox shippable builds (which have both PGO and LTO enabled)?

(In reply to Mike Hommey [:glandium] from comment #7)

Have you checked what these things look like in Firefox shippable builds (which have both PGO and LTO enabled)?

No. I expect those to potentially flip branch predictions on x86_64 and aarch64. However, the 32-bit ARM codegen is so bad that there's no way PGO or LTO would make it OK.

This was fixed in bug 1882209.

Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Duplicate of bug: 1882209
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: