Closed Bug 1887312 Opened 8 months ago Closed 1 month ago

ONNX benchmark for WASM(INT8) is 2x-3x slower in Nightly

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(3 files)

Go to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark
From the top right, select only the "WASM(INT8)"
Click on "Start Benchmark"

Nightly: https://share.firefox.dev/3INsLfl
Chrome: https://share.firefox.dev/43uxOKY


FWIW, we are as fast as Chrome on WASM(FP16) and slightly faster on WASM(FP32)!

Attached file about:support
Flags: needinfo?(rhunt)
Severity: -- → N/A
Flags: needinfo?(rhunt)
Priority: -- → P3

I looked at the code, especially at SIMD one. There are lots of "strange" shuffles and permutations that we generate inefficient code.

Permutations (as 8x16):

0 0 0 0  0 0 0 0  1 0 0 0  0 0 0 0
0 0 0 0  1 0 0 0  2 0 0 0  3 0 0 0
0 0 1 0  2 0 3 0  4 0 5 0  6 0 7 0
0 0 1 1  2 2 3 3  4 4 5 5  6 6 7 7
0 2 4 6  8 10 12 14  0 0 0 0  0 0 0 0
0 4 8 12  0 0 0 0  0 0 0 0  0 0 0 0
0 8 0 0  0 0 0 0  0 0 0 0  0 0 0 0
1 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0
12 0 0 0  13 0 0 0  14 0 0 0  15 0 0 0
12 13 14 15  0 0 0 0  0 0 0 0  0 0 0 0
2 3 0 0  0 0 0 0  0 0 0 0  0 0 0 0
4 0 0 0  5 0 0 0  6 0 0 0  7 0 0 0
4 5 6 7  0 0 0 0  0 0 0 0  0 0 0 0
8 0 0 0  9 0 0 0  10 0 0 0  11 0 0 0
8 8 9 9  10 10 11 11  12 12 13 13  14 14 15 15
8 9 10 11  0 0 0 0  0 0 0 0  0 0 0 0
8 9 10 11  12 13 14 15  0 0 0 0  0 0 0 0

Shuffles:

0 1 2 3  0 1 2 3  0 1 2 3  16 17 18 19
0 1 2 3  16 17 18 19  0 1 2 3  0 1 2 3
0 1 2 3  8 9 10 11  16 17 18 19  24 25 26 27
0 1 4 5  8 9 12 13  16 17 20 21  24 25 28 29
0 2 4 6  8 10 12 14  16 18 20 22  24 26 28 30
0 4 8 12  16 20 24 28  0 0 0 0  0 0 0 0
4 5 6 7  20 21 22 23  24 25 26 27  28 29 30 31
8 9 10 11  20 21 22 23  24 25 26 27  28 29 30 31

They can belong to auto-vectorization (LLVM?) logic and not direct translation x86/arm SIMD instructions. That's why it affects INT8 mode.

ARM permutations for 16x8 and 32x4 are inefficient as well.

We need to review codegen/masm to generate more efficient code for such shuffles generated by auto-vectorization, and measure the performance differences again.

See Also: → 1916442
Attached image Firefox Vs Chrome.png

Firefox now runs in 14000ms, which is almost as fast as Chrome here.

Quick bisection shows that bug 1918970 was the biggest contributor , reducing the time to 50% (37000ms ->16000ms). Other recent regalloc patches from :jandem provided smaller improvements.

I think this bug is fixed. Marking dependency on bug 1918970.

Status: NEW → RESOLVED
Closed: 1 month ago
Depends on: 1918970
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: