Closed Bug 1552132 Opened 6 years ago Closed 4 years ago

Some devices run arm64 JS+wasm x10 slower than arm firefox or arm64 webview

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

67 Branch
ARM64
Android
defect

Tracking

()

RESOLVED FIXED
Performance Impact high
Tracking Status
firefox68 --- affected

People

(Reporter: colormatch, Unassigned)

References

Details

(Keywords: perf:responsiveness, Whiteboard: [arm64:m4] [geckoview:fenix:p2])

Attachments

(1 file)

Attached image 10x-slower.png

User Agent: Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0

Steps to reproduce:

On some devices the performance of Fenix is running x10 slower javascript calculations.

It is probably related to aarch64 build optimizations.
When building an app, using arm aar, there is no performance issue on the same device, and it is slow as hell when built with aarch arr.
(aarch arr was tested from latest from beta channel and nightly up to v67, as v68 was not loading properly)

On device with Android 8, Snapdragon 625, 2Ghz, 4GB RAM

Fenix takes ~ 5.0 sec
App with built-in aarch arr takes ~ 5.0 sec

Firefox 66 (stable) less than 0.5 sec
app with built-in arm arr takes less than 0.5 sec

Meanwhile Fenix on a Device with Android 9, Snapdragon 632, 1.8Ghz, 3GB RAM performs OK (and the app with built-in aarch arr is running fast)

However, Fenix on a device running Android 9 on Snapdragon 835 8GB ram, takes ~4.0 seconds

To reproduce:
Website: https://kinoseed.com/

Steps (see attached image):

  1. Load any image in "Load Image"
  2. Move a slider to modify image

Actual results:

Calculations run for about 5 seconds

Expected results:

Calculation should have run for less than 0.5 seconds

Lars, is there a bug for using the ARM64 Ion backend for wasm? Or are we waiting for Cranelift?

Flags: needinfo?(lhansen)
OS: All → Android
Priority: -- → P3
Hardware: Unspecified → ARM64
Summary: some devices run x10 slower javascript aarch64 → some devices run wasm x10 slower javascript aarch64
Whiteboard: [arm64:m4]

We are waiting for Cranelift; there are no Ion plans. Ping me privately re scheduling of that project.

10x is a lot, most Wasm benchmarks I've run do not slow down by that much when run in the baseline compiler relative to Ion. So we might also be seeing other things here that are worth investigating. This could be JS slowdown (arm64 relative to arm; both use ion) and other system effects. That said, it's not impossible that the 10x slowdown is all wasm.

Really what we need here are two profiles of the same program in the two configurations on some of these devices.

Note the reporter has one device where the arm64 code seems to be performing well; this is a little worrisome because it means the slowdowns are not uniform across all devices. That should be investigated too.

Flags: needinfo?(lhansen)

(In reply to Lars T Hansen [:lth] from comment #2)

We are waiting for Cranelift; there are no Ion plans. Ping me privately re scheduling of that project.

10x is a lot, most Wasm benchmarks I've run do not slow down by that much when run in the baseline compiler relative to Ion. So we might also be seeing other things here that are worth investigating. This could be JS slowdown (arm64 relative to arm; both use ion) and other system effects. That said, it's not impossible that the 10x slowdown is all wasm.

I'm afraid it is not all wasm.
In the example page, only the tab with tools like "brightness" use some wasm, and only for part of the function.

On the problem devices Firefox 66 and WebView arm-aar, are performing well (almost instantly), but Fenix and aarch-aar are taking ~7 seconds (which is more than 10x performance decrease).

The mentioned 10x slower performance was comparing a non-wasm function (like moving the movie-style slider)

Really what we need here are two profiles of the same program in the two configurations on some of these devices.

Note the reporter has one device where the arm64 code seems to be performing well; this is a little worrisome because it means the slowdowns are not uniform across all devices. That should be investigated too.

If it helps, here is the app with aarch64 and arm arrs, using beta channel, version 67.0.20190506235559.

https://kinoseed.com/app-arm-release.apk
https://kinoseed.com/app-aarch64-release.apk

App's GV loads this url: https://kinoseed.com/

Thanks for the quick feedback. I'll update the bug title to more clearly reflect what we're talking about.

Summary: some devices run wasm x10 slower javascript aarch64 → Some devices run arm64 JS+wasm x10 slower than arm firefox or arm64 webview

After latest Fenix update, a quick test showed both devices (which showed different performance) to be performing on par (at ~3seconds per operation).

I don't know if the previous discrepancy was due to A/B test, for some reason may have ended up with 32bit version, or a mistake in testing on my part is to blame, but as of now, there is no discrepancy in the device's performance.

(In reply to colormatch from comment #5)

After latest Fenix update, a quick test showed both devices (which showed different performance) to be performing on par (at ~3seconds per operation).

I don't know if the previous discrepancy was due to A/B test, for some reason may have ended up with 32bit version, or a mistake in testing on my part is to blame, but as of now, there is no discrepancy in the device's performance.

If you see this problem again, feel free to re-open this bug.

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME

Using Firefox Preview Version 1.0.1922 build: arm64-v8a

After loading an image in kinoseed.com:

  1. Turning the phone to landscape
  2. Pressing [ ] to enter full-screen (btw, there's a separate new bug, which cuts the view up to the notification for "full-screen")
  3. Adjust "Brightness"

Time for brightness adjustment calculation on:

Android 8, Snapdragon 625, 2Ghz, 4GB RAM
~ 2 seconds

Android 9, Snapdragon 632, 1.8Ghz, 3GB RAM
~ 1 second

For comparison:
Using Firefox 67.0 on Android 8, Snapdragon 625, 2Ghz, 4GB RAM
~ 0.2 seconds

Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---
Whiteboard: [arm64:m4] → [arm64:m4][qf]
Whiteboard: [arm64:m4][qf] → [arm64:m4][qf:p1:responsiveness]

NI for awareness

Flags: needinfo?(bbouvier)

Noted as something we can look at, later on, when we're seeking Cranelift benchmarks.

In the meanwhile, if it's JS, might be worth letting nbp & sstangl know about it.

Flags: needinfo?(bbouvier)

We should try to get a profile on this, though if it's in jit code it might be rather opaque. NI to nbp and sstangl for awareness

Flags: needinfo?(sstangl)
Flags: needinfo?(rjesup)
Flags: needinfo?(nicolas.b.pierron)

fyi:

some of the corrections like "adjusting brightness", use wasm, but those are changing only the small LUT.

The computational hit gets from the trilinear interpolation (of LUT values) and updating the ImageData.data.buffer, and that's in JS.
Turning "Sharpen off", and then comparing "SAVE" (image export) time of the image will allow you to compare just the JS part.

Adding the [geckoview:fenix:p2] whiteboard tag because the Fenix team is tracking this issue: https://github.com/mozilla-mobile/fenix/issues/2566

Whiteboard: [arm64:m4][qf:p1:responsiveness] → [arm64:m4][qf:p1:responsiveness] [geckoview:fenix:p2]

Latest nightly builds seem to be performing fast, and just on time as Google Play 64-bit requirement starts today.

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WORKSFORME

unfortunately I spoke too soon.

The problem persists, however it doesn't seem to be device dependent as first thought !

Two almost identical images from camera roll (same settings taken seconds apart), and after one is loaded it produces 10x lag, and when the other is loaded - it works well.

Without restarting the app, loading the different images produces consistent (undesirable) results - the code runs fast when one of them is loaded, and lags when the other is loaded, which is really weird.

Again only happens with GV native platform arm64-v8a, and loading some images produce the problem.
I'll post a new bug report when I narrow the problem.

Status: RESOLVED → UNCONFIRMED
Resolution: WORKSFORME → ---

JS example code, which severely impacts performance of aarch64-GV

// SLOW
data32 = new Uint32Array(idata2.data.buffer)
bm = 0xFF0000
sz = la.length-1
for (var yy = 0; yy < data32.length; ++yy) {
px = data32[yy]
blue = sz*((px & bm) >> 16)/255
}

// FAST
data32 = new Uint32Array(idata2.data.buffer)
for (var yy = 0; yy < data32.length; ++yy) {
px = data32[yy]
blue = (px & 0xFF0000) >> 16
blue *= la.length-1
blue /= 255
}

There was another issue when computation is running in nested loops, something like:
blue = var1*j/255

Again, breaking the computation apart, or not having more than 1 vars* (not fully tested) significantly helps the performance.

(In reply to Chris Peterson [:cpeterson] from comment #12)

Adding the [geckoview:fenix:p2] whiteboard tag because the Fenix team is tracking this issue: https://github.com/mozilla-mobile/fenix/issues/2566

should a new bug report be made?
it seems it is not device-specific issue, but optimization problem affecting all devices
(unless the device specific issue was fixed, and this is something new)

Flags: needinfo?(sstangl)
Flags: needinfo?(rjesup)
Flags: needinfo?(nicolas.b.pierron)

re-NI'ing (removing sstangl)

Flags: needinfo?(tcampbell)
Flags: needinfo?(nicolas.b.pierron)

Chris, can you keep an eye on this as we look at turning on cranelift for arm64?

Flags: needinfo?(tcampbell)
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(cfallin)

:tcampbell, will do. Reading above, it's unclear to me how much of this is Wasm and how much is JS but we will certainly take a look once we enable in Nightly!

Flags: needinfo?(chris)
Component: General → Javascript: WebAssembly
Product: GeckoView → Core

Is there still something to do here?

Flags: needinfo?(lhansen)

Oh interesting. I'll take a look tomorrow. Ion for arm64 is live (and will likely ship in FF90) so wasm perf ought no longer be a concern here.

Flags: needinfo?(lhansen)

This is super fast on my M1 MBP, so I will assume this is fixed by the new wasm optimizing backend for arm64.

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago4 years ago
Resolution: --- → FIXED

That said, disabling the optimizing jit doesn't make that much of a difference on perceived perf on the M1, so there may have been other updates that matter more, or even updates to the site. It does use Wasm, but the code I'm looking at doesn't even have a loop, it's just a couple of 10-parameter f32 and i32 math things that are both straight-line code, presumably to be called from JS, they don't call each other. For something like this, the overhead of calling from JS is going to matter.

(Shallow investigation only.)

Performance Impact: --- → P1
Whiteboard: [arm64:m4][qf:p1:responsiveness] [geckoview:fenix:p2] → [arm64:m4] [geckoview:fenix:p2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: