Closed Bug 1691426 Opened 4 years ago Closed 4 years ago

11.42 - 15.21% booking ContentfulSpeedIndex / booking FirstVisualChange / booking fcp (android-hw-p2-8-0-android-aarch64-shippable) regression on push 0b260a20b5834dc1122445d6b9e4f6de68dbf5b8 (Wed February 3 2021)

Categories

(Core :: JavaScript: WebAssembly, defect, P1)

defect

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox85 --- unaffected
firefox86 --- unaffected
firefox87 --- wontfix
firefox88 --- fixed

People

(Reporter: Bebe, Assigned: asumu)

References

(Regression)

Details

(Keywords: perf, perf-alert, regression)

Attachments

(1 file, 1 obsolete file)

Perfherder has detected a browsertime performance regression from push 0b260a20b5834dc1122445d6b9e4f6de68dbf5b8. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
15% booking fcp android-hw-p2-8-0-android-aarch64-shippable nocondprof warm webrender 202.46 -> 233.25
12% booking FirstVisualChange android-hw-p2-8-0-android-aarch64-shippable nocondprof warm webrender 237.71 -> 267.00
11% booking ContentfulSpeedIndex android-hw-p2-8-0-android-aarch64-shippable nocondprof warm webrender 268.88 -> 299.58

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(idimitriou)
Assignee: nobody → asumu
Flags: needinfo?(idimitriou)

Hi, I'm taking assignment of this issue as I pushed the patches that triggered the alert.

In the push that triggered this, I suspect it would be the first patch (https://hg.mozilla.org/integration/autoland/rev/0b260a20b5834dc1122445d6b9e4f6de68dbf5b8) that might've caused the regression. That's because the first patch affects JS code more generally, while the other is Wasm-specific.

To test this, I've been running some browsertime tests with the patch reverted, and if I'm running this correctly it seems to regain some performance: https://treeherder.mozilla.org/perfherder/compare?originalProject=mozilla-central&originalRevision=2f575c3c4353b2c5adc27519e55ca4f3c125a1d2&newProject=try&newRevision=16568273aa9c8778c1912c958136f6239fa01cea&framework=13

A potential solution is to revert the patch along with some minor changes (to avoid breaking the other patches in the patchset). I think the patch should be revertible in this way without breaking the current state of wasm exception handling, as the patch is required mainly to accommodate a future feature that is part of bug 1690965.

If we need it we need it, regression or no...

Priority: -- → P1

Set release status flags based on info from the regressing bug 1335652

(In reply to Asumu Takikawa from comment #1)

Hi, I'm taking assignment of this issue as I pushed the patches that triggered the alert.

In the push that triggered this, I suspect it would be the first patch (https://hg.mozilla.org/integration/autoland/rev/0b260a20b5834dc1122445d6b9e4f6de68dbf5b8) that might've caused the regression. That's because the first patch affects JS code more generally, while the other is Wasm-specific.

To test this, I've been running some browsertime tests with the patch reverted, and if I'm running this correctly it seems to regain some performance: https://treeherder.mozilla.org/perfherder/compare?originalProject=mozilla-central&originalRevision=2f575c3c4353b2c5adc27519e55ca4f3c125a1d2&newProject=try&newRevision=16568273aa9c8778c1912c958136f6239fa01cea&framework=13

A potential solution is to revert the patch along with some minor changes (to avoid breaking the other patches in the patchset). I think the patch should be revertible in this way without breaking the current state of wasm exception handling, as the patch is required mainly to accommodate a future feature that is part of bug 1690965.

Can you confirm if the regression is related to increasing the size of ErrorObject, or the extra initialization work done in ErrorObject::init? I'd be curious if the extra field is pushing us into a different object size that is slower, or if it's possible to not initialize this field in ErrorObject::init and rely on some default value.

(In reply to Ryan Hunt [:rhunt] from comment #4)

Can you confirm if the regression is related to increasing the size of ErrorObject, or the extra initialization work done in ErrorObject::init? I'd be curious if the extra field is pushing us into a different object size that is slower, or if it's possible to not initialize this field in ErrorObject::init and rely on some default value.

That's a good point, I hadn't tested that before but I've done it now. I believe this perfherder comparison is showing that removing just the initialization (the diff for the change) is improving the booking tests: https://treeherder.mozilla.org/perfherder/compare?originalProject=mozilla-central&originalRevision=2f575c3c4353b2c5adc27519e55ca4f3c125a1d2&newProject=try&newRevision=cc363575fcec2a883ba77db93033a96e4441ff39&framework=13

In which case, we could leave it uninitialized. I believe it could work to have fromWasmTrap() check if the slot is uninitialized (slot contents passes isUndefined()) and then initialize it to false in that case, then do the read. In which case the initialization should only happen when errors pass through Wasm.

This patch intends to fix a performance regression on some tests caused
by the overhead of initializing the extra wasm trap slot on
ErrorObjects. The slot initialized is delayed until the flag is
read or set.

I've attached a patch to address this bug. I believe this improves performance significantly (~9%) for at least one of the tests in the perfherder alert (the booking fcp android-hw-p2-8-0-android-aarch64-shippable opt nocondprof warm webrender case) and then moderately (~5%) for the other two (booking FirstVisualChange android-hw-p2-8-0-android-aarch64-shippable opt nocondprof warm webrender and booking ContentfulSpeedIndex android-hw-p2-8-0-android-aarch64-shippable opt nocondprof warm webrender).

This is based on these perfherder comparisons: https://treeherder.mozilla.org/perfherder/compare?originalProject=mozilla-central&originalRevision=2f575c3c4353b2c5adc27519e55ca4f3c125a1d2&newProject=try&newRevision=8c1b3317be67b524efeb3a9aab13785e12db9d2f&framework=13 (looking at browsertime tests filtered by booking)

I've tried to see if I can improve the latter two tests more, but I didn't see an obvious way to do so without breaking Wasm exception functionality (commenting out the relevant code that queries the ErrorObject trap flag in WasmBuiltins.cpp seems to get more speedup).

Ryan, will this problem impact FF88? If so, can we get this reviewed and landed before monday?

Flags: needinfo?(rhunt)

Yeah, I just reviewed it and I think we should be able to land this soon.

Flags: needinfo?(rhunt)

I've discussed this with Asumu and we're having a hard time proving that any of this work fixes the regression. A perf run that he did with the current patch actually showed a slight regression [1]. Further tweaks to make this even lazier didn't seem to do the trick either [2]. I then did a perf run with the commit completely backed out and that didn't seem to regress the test but there is no clear improvement either [3].

At this point, I believe the safest course is to just land the backout patch conservatively and figure this out with more time.

[1] https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=6c316d5193c9ecfeef0450d5ce4a717e19960c0e&newProject=try&newRevision=7e6eca8149edabc51ea4f7fe472d1a8ae4773b97&framework=13
[2] https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=6c316d5193c9ecfeef0450d5ce4a717e19960c0e&newProject=try&newRevision=08912a43754bc63d359756869848b67e33688dc0&framework=13
[3] https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=f91503f7a542269c398e70a5d3e661d50afea5b3&newProject=try&newRevision=1eb049ffdbd4d16e6f0f61b3bf70e0e70c31369f&framework=13

This commit backs out the changes to add a isWasmTrap() flag on
ErrorObject to fix a performance regression. Some form of flag will
need to be added back, but for now this is the easiest fix for
release.

If backing it out did not improve perf then, provided we believe that perfherder gives good results on the Pixel 2, the patch wasn't responsible for the regression in the first place. With that in mind, could it be that other patches in the set are to blame after all? Does backing them all out (assuming that's practical) restore performance?

(And it's not like we won't need the field in the structure. Let's take this up with eg jorendorff + jonco as soon as the freeze is over and absent any good ideas, plan to reland the original patch after that.)

The full backout of all exception-handling patches (required because they build upon the two patches in the regressing push) yields this perfherder result [1]. Possibly 1-3% improvements, but confidence is low after 15 retriggers on both sides so I almost attribute it to noise.

So at this point, I really don't know if the original patches caused this regression.

[1] https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=f91503f7a542269c398e70a5d3e661d50afea5b3&newProject=try&newRevision=dcaf04d27734e384522694086b5acf5d63dfc0c4&framework=13

I did a perf comparison of the original regressing commits in autoland verus a commit with them immediately backed out and was able to see the cited regression [1]. So something is magical about that spot in the tree, or the commits that have happened since, that causes the backout to only be effective there.

Considering the soft freeze is over, I think we should try to land Asumu's patch that removes the slot and see if that makes a difference. If it doesn't, then we should just consider this an acceptable regression, if it even exists. We need this behavior for the wasm exception-handling proposal and this is the most efficient way to add it.

[1] https://treeherder.mozilla.org/perfherder/compare?originalProject=try&originalRevision=c1ca5b3dc8afe626ec8b2965d0ffb0877e300847&newProject=try&newRevision=c9f611b375aabbc4c29b00aab08b3672e6925039&framework=13

Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/1b682d807c40 address perf regression for ErrorObject wasm trap flag r=rhunt
Attachment #9204378 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch

The patch landed in nightly and beta is affected.
:asumu, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(asumu)

After the patch landed, I've been looking at if it had an effect. It seems inconclusive from just looking at a comparison between the previous mozilla-central push and the one for the patch (it actually gets worse) but looking at trends it seems more positive:

https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=0&highlightChangelogData=1&highlightedRevisions=916497e295fe&&highlightedRevisions=0b260a20b5834&highlightedRevisions=b422dd886e036&highlightedRevisions=287db0596e0b0&series=mozilla-central,2895566,1,13&series=autoland,2891495,1,13&timerange=2592000
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=916497e295fe&&highlightedRevisions=0b260a20b5834&highlightedRevisions=287db0596e0b0&highlightedRevisions=b422dd886e036&series=mozilla-central,2895651,1,13&series=autoland,2891544,1,13&timerange=2592000
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightedRevisions=916497e295fe&&highlightedRevisions=0b260a20b5834&highlightedRevisions=287db0596e0b0&highlightedRevisions=b422dd886e036&series=mozilla-central,2895649,1,13&series=autoland,2891542,1,13&timerange=2592000

(relevant revisions like the original alert, and autoland/central pushes with the patch highlighted)

These plots show the trends (with both mozilla-central and autoland plotted) on the three tests from this bug's alert. And it does look like it has trended down around the time the patch landed (especially in autoland), but it's hard to tell exactly.

I'm not sure if that means it is important enough for an uplift. Do you have any thoughts on this Ryan?

Flags: needinfo?(asumu) → needinfo?(rhunt)

Hmm, there does seem to be a slight improvement in the trend here. But it's slight and this regression has been so tricky to reliably figure out what's going on that I think we should just let this be for now.

Flags: needinfo?(rhunt)

Ok thanks, that makes sense to me as well. I'll set status_firefox87 to wontfix now then.

Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: