Closed Bug 1887312 Opened 8 months ago Closed 1 month ago

ONNX benchmark for WASM(INT8) is 2x-3x slower in Nightly

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug,
URL
)

Details

Attachments

(3 files)

about:support 8 months ago Mayank Bansal 42.38 KB, text/plain		Details
Firefox Vs Chrome (Int8).png 8 months ago Mayank Bansal 313.20 KB, image/png		Details
Firefox Vs Chrome.png 1 month ago Mayank Bansal 304.99 KB, image/png		Details

Mayank Bansal

Reporter

Description

•

8 months ago

•

Edited

Go to https://huggingface.co/spaces/Xenova/webgpu-embedding-benchmark
From the top right, select only the "WASM(INT8)"
Click on "Start Benchmark"

Nightly: https://share.firefox.dev/3INsLfl
Chrome: https://share.firefox.dev/43uxOKY

FWIW, we are as fast as Chrome on WASM(FP16) and slightly faster on WASM(FP32)!

Mayank Bansal

Reporter

Comment 1

•

8 months ago

Attached file about:support — Details

Mayank Bansal

Reporter

Comment 2

•

8 months ago

Attached image Firefox Vs Chrome (Int8).png — Details

Mayank Bansal

Reporter

Updated

•

8 months ago

Flags: needinfo?(rhunt)

Mayank Bansal

Reporter

Updated

•

7 months ago

Blocks: wasm-perf-gap

Steven DeTar [:sdetar]

Updated

•

4 months ago

Severity: -- → N/A

Flags: needinfo?(rhunt)

Priority: -- → P3

Yury Delendik (:yury)

Comment 3

•

4 months ago

I looked at the code, especially at SIMD one. There are lots of "strange" shuffles and permutations that we generate inefficient code.

Permutations (as 8x16):

0 0 0 0  0 0 0 0  1 0 0 0  0 0 0 0
0 0 0 0  1 0 0 0  2 0 0 0  3 0 0 0
0 0 1 0  2 0 3 0  4 0 5 0  6 0 7 0
0 0 1 1  2 2 3 3  4 4 5 5  6 6 7 7
0 2 4 6  8 10 12 14  0 0 0 0  0 0 0 0
0 4 8 12  0 0 0 0  0 0 0 0  0 0 0 0
0 8 0 0  0 0 0 0  0 0 0 0  0 0 0 0
1 0 0 0  0 0 0 0  0 0 0 0  0 0 0 0
12 0 0 0  13 0 0 0  14 0 0 0  15 0 0 0
12 13 14 15  0 0 0 0  0 0 0 0  0 0 0 0
2 3 0 0  0 0 0 0  0 0 0 0  0 0 0 0
4 0 0 0  5 0 0 0  6 0 0 0  7 0 0 0
4 5 6 7  0 0 0 0  0 0 0 0  0 0 0 0
8 0 0 0  9 0 0 0  10 0 0 0  11 0 0 0
8 8 9 9  10 10 11 11  12 12 13 13  14 14 15 15
8 9 10 11  0 0 0 0  0 0 0 0  0 0 0 0
8 9 10 11  12 13 14 15  0 0 0 0  0 0 0 0

Shuffles:

0 1 2 3  0 1 2 3  0 1 2 3  16 17 18 19
0 1 2 3  16 17 18 19  0 1 2 3  0 1 2 3
0 1 2 3  8 9 10 11  16 17 18 19  24 25 26 27
0 1 4 5  8 9 12 13  16 17 20 21  24 25 28 29
0 2 4 6  8 10 12 14  16 18 20 22  24 26 28 30
0 4 8 12  16 20 24 28  0 0 0 0  0 0 0 0
4 5 6 7  20 21 22 23  24 25 26 27  28 29 30 31
8 9 10 11  20 21 22 23  24 25 26 27  28 29 30 31

They can belong to auto-vectorization (LLVM?) logic and not direct translation x86/arm SIMD instructions. That's why it affects INT8 mode.

ARM permutations for 16x8 and 32x4 are inefficient as well.

We need to review codegen/masm to generate more efficient code for such shuffles generated by auto-vectorization, and measure the performance differences again.

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

2 months ago

Comment 4

•

1 month ago

Attached image Firefox Vs Chrome.png — Details

Mayank Bansal

Reporter

Comment 5

•

1 month ago

Firefox now runs in 14000ms, which is almost as fast as Chrome here.

Quick bisection shows that bug 1918970 was the biggest contributor , reducing the time to 50% (37000ms ->16000ms). Other recent regalloc patches from :jandem provided smaller improvements.

I think this bug is fixed. Marking dependency on bug 1918970.

Status: NEW → RESOLVED

Closed: 1 month ago

Depends on: 1918970

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

ONNX benchmark for WASM(INT8) is 2x-3x slower in Nightly

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug,
URL
)

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Updated

Updated

Updated

Comment 3

Updated

Comment 4

Comment 5

Attachment

General

Description

File Name

Content Type