ONNX benchmark for WASM(INT8) is 2x-3x slower in Nightly
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: mayankleoboy1, Unassigned)
References
(Blocks 1 open bug, )
Details
Attachments
(3 files)
Go to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark
From the top right, select only the "WASM(INT8)"
Click on "Start Benchmark"
Nightly: https://share.firefox.dev/3INsLfl
Chrome: https://share.firefox.dev/43uxOKY
FWIW, we are as fast as Chrome on WASM(FP16) and slightly faster on WASM(FP32)!
Reporter | ||
Comment 1•8 months ago
|
||
Reporter | ||
Comment 2•8 months ago
|
||
Reporter | ||
Updated•8 months ago
|
Reporter | ||
Updated•7 months ago
|
Updated•4 months ago
|
Comment 3•4 months ago
|
||
I looked at the code, especially at SIMD one. There are lots of "strange" shuffles and permutations that we generate inefficient code.
Permutations (as 8x16):
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0
0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0
0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7
0 2 4 6 8 10 12 14 0 0 0 0 0 0 0 0
0 4 8 12 0 0 0 0 0 0 0 0 0 0 0 0
0 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12 0 0 0 13 0 0 0 14 0 0 0 15 0 0 0
12 13 14 15 0 0 0 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 5 0 0 0 6 0 0 0 7 0 0 0
4 5 6 7 0 0 0 0 0 0 0 0 0 0 0 0
8 0 0 0 9 0 0 0 10 0 0 0 11 0 0 0
8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15
8 9 10 11 0 0 0 0 0 0 0 0 0 0 0 0
8 9 10 11 12 13 14 15 0 0 0 0 0 0 0 0
Shuffles:
0 1 2 3 0 1 2 3 0 1 2 3 16 17 18 19
0 1 2 3 16 17 18 19 0 1 2 3 0 1 2 3
0 1 2 3 8 9 10 11 16 17 18 19 24 25 26 27
0 1 4 5 8 9 12 13 16 17 20 21 24 25 28 29
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
0 4 8 12 16 20 24 28 0 0 0 0 0 0 0 0
4 5 6 7 20 21 22 23 24 25 26 27 28 29 30 31
8 9 10 11 20 21 22 23 24 25 26 27 28 29 30 31
They can belong to auto-vectorization (LLVM?) logic and not direct translation x86/arm SIMD instructions. That's why it affects INT8 mode.
ARM permutations for 16x8 and 32x4 are inefficient as well.
We need to review codegen/masm to generate more efficient code for such shuffles generated by auto-vectorization, and measure the performance differences again.
Reporter | ||
Comment 4•1 month ago
|
||
Reporter | ||
Comment 5•1 month ago
|
||
Firefox now runs in 14000ms, which is almost as fast as Chrome here.
Quick bisection shows that bug 1918970 was the biggest contributor , reducing the time to 50% (37000ms ->16000ms). Other recent regalloc patches from :jandem provided smaller improvements.
I think this bug is fixed. Marking dependency on bug 1918970.
Description
•