Closed Bug 1887312 Opened 1 year ago Closed 1 year ago

ONNX benchmark for WASM(INT8) is 2x-3x slower in Nightly

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug, )

Details

Attachments

(3 files)

Go to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark
From the top right, select only the "WASM(INT8)"
Click on "Start Benchmark"

Nightly: https://share.firefox.dev/3INsLfl
Chrome: https://share.firefox.dev/43uxOKY


FWIW, we are as fast as Chrome on WASM(FP16) and slightly faster on WASM(FP32)!

Attached file about:support
Flags: needinfo?(rhunt)
Severity: -- → N/A
Flags: needinfo?(rhunt)
Priority: -- → P3

I looked at the code, especially at SIMD one. There are lots of "strange" shuffles and permutations that we generate inefficient code.

Permutations (as 8x16):

0 0 0 0  0 0 0 0  1 0 0 0  0 0 0 0
0 0 0 0  1 0 0 0  2 0 0 0  3 0 0 0
0 0 1 0  2 0 3 0  4 0 5 0  6 0 7 0
0 0 1 1  2 2 3 3  4 4 5 5  6 6 7 7
0 2 4 6  8 10 12 14  0 0 0 0  0 0 0 0
0 4 8 12  0 0 0 0  0 0 0 0  0 0 0 0
0 8 0 0  0 0 0 0  0 0 0 0  0 0 0 0
1 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0
12 0 0 0  13 0 0 0  14 0 0 0  15 0 0 0
12 13 14 15  0 0 0 0  0 0 0 0  0 0 0 0
2 3 0 0  0 0 0 0  0 0 0 0  0 0 0 0
4 0 0 0  5 0 0 0  6 0 0 0  7 0 0 0
4 5 6 7  0 0 0 0  0 0 0 0  0 0 0 0
8 0 0 0  9 0 0 0  10 0 0 0  11 0 0 0
8 8 9 9  10 10 11 11  12 12 13 13  14 14 15 15
8 9 10 11  0 0 0 0  0 0 0 0  0 0 0 0
8 9 10 11  12 13 14 15  0 0 0 0  0 0 0 0

Shuffles:

0 1 2 3  0 1 2 3  0 1 2 3  16 17 18 19
0 1 2 3  16 17 18 19  0 1 2 3  0 1 2 3
0 1 2 3  8 9 10 11  16 17 18 19  24 25 26 27
0 1 4 5  8 9 12 13  16 17 20 21  24 25 28 29
0 2 4 6  8 10 12 14  16 18 20 22  24 26 28 30
0 4 8 12  16 20 24 28  0 0 0 0  0 0 0 0
4 5 6 7  20 21 22 23  24 25 26 27  28 29 30 31
8 9 10 11  20 21 22 23  24 25 26 27  28 29 30 31

They can belong to auto-vectorization (LLVM?) logic and not direct translation x86/arm SIMD instructions. That's why it affects INT8 mode.

ARM permutations for 16x8 and 32x4 are inefficient as well.

We need to review codegen/masm to generate more efficient code for such shuffles generated by auto-vectorization, and measure the performance differences again.

Attached image Firefox Vs Chrome.png

Firefox now runs in 14000ms, which is almost as fast as Chrome here.

Quick bisection shows that bug 1918970 was the biggest contributor , reducing the time to 50% (37000ms ->16000ms). Other recent regalloc patches from :jandem provided smaller improvements.

I think this bug is fixed. Marking dependency on bug 1918970.

Status: NEW → RESOLVED
Closed: 1 year ago
Depends on: 1918970
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: