306.06 - 6.43% outlook fnbpaint / twitch dcf + 25 more (Linux, OSX, Windows) regression on Wed December 8 2021
Categories
(Core :: DOM: Core & HTML, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox95 | --- | unaffected |
firefox96 | --- | unaffected |
firefox97 | --- | affected |
People
(Reporter: alexandrui, Assigned: hsivonen)
References
(Regression)
Details
(Keywords: perf, perf-alert, regression)
Attachments
(1 obsolete file)
Perfherder has detected a browsertime performance regression from push e135fe933c4e31482e91c77617a27c04b3a44eef. As author of one of the patches included in that push, we need your help to address this regression.
Regressions:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
306% | outlook fnbpaint | linux1804-64-shippable-qr | warm webrender | 70.15 -> 284.83 |
256% | outlook dcf | macosx1015-64-shippable-qr | warm webrender | 76.62 -> 272.83 |
251% | outlook fnbpaint | macosx1015-64-shippable-qr | warm webrender | 78.50 -> 275.42 |
250% | outlook fcp | macosx1015-64-shippable-qr | warm webrender | 78.71 -> 275.67 |
220% | outlook dcf | linux1804-64-shippable-qr | warm webrender | 87.42 -> 280.12 |
218% | outlook fnbpaint | linux1804-64-shippable-qr | warm webrender | 89.29 -> 284.33 |
218% | outlook fcp | linux1804-64-shippable-qr | warm webrender | 89.33 -> 284.42 |
154% | outlook dcf | windows10-64-shippable-qr | warm webrender | 86.71 -> 220.12 |
148% | outlook fnbpaint | windows10-64-shippable-qr | warm webrender | 89.71 -> 222.50 |
148% | outlook fcp | windows10-64-shippable-qr | warm webrender | 89.71 -> 222.38 |
... | ... | ... | ... | ... |
35% | outlook loadtime | windows10-64-shippable-qr | warm webrender | 397.17 -> 535.08 |
17% | outlook loadtime | linux1804-64-shippable-qr | cold webrender | 1,091.46 -> 1,271.79 |
16% | outlook loadtime | macosx1015-64-shippable-qr | cold webrender | 1,216.54 -> 1,407.17 |
14% | outlook loadtime | windows10-64-shippable-qr | cold webrender | 1,033.46 -> 1,176.21 |
6% | twitch dcf | linux1804-64-shippable-qr | cold webrender | 626.06 -> 666.33 |
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
40% | twitch FirstVisualChange | macosx1015-64-shippable-qr | warm webrender | 78.33 -> 46.67 |
36% | twitch fcp | macosx1015-64-shippable-qr | warm webrender | 51.62 -> 33.17 |
31% | twitch FirstVisualChange | windows10-64-shippable-qr | warm webrender | 97.25 -> 67.00 |
29% | twitch fnbpaint | windows10-64-shippable-qr | warm webrender | 64.06 -> 45.62 |
29% | twitch fcp | windows10-64-shippable-qr | warm webrender | 59.27 -> 42.33 |
... | ... | ... | ... | ... |
2% | twitch loadtime | macosx1015-64-shippable-qr | warm webrender | 504.69 -> 492.67 |
Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.
For more information on performance sheriffing please see our FAQ.
Assignee | ||
Comment 1•3 years ago
|
||
First of all, I'd like to ask that the patch not be backed out, because the patch also has performance benefits for other pages and is meant to have performance benefits. Notably, the patch eliminates a whole reload on some pages.
I don't have access to what exactly Raptor loads in the outlook case, but looking at the live site, it indeed looks like the top-level page (but not the subdocuments) goes through a code path that the patch changed.
The top-level document doesn't declare the encoding on the HTTP layer but declares it via <meta>
very early on. Overall, this should result in the parser thread doing less work than previously, but the interaction between the parser thread and the main thread can be expected to differ.
Without actually looking at this in Pernosco, I expect the situation with the patch to result in flushing a small number of tree operations earlier than previously. That shouldn't cause layout to be started, though, so it's unclear to me what exactly in the flushing of operations causes the measured difference.
Perhaps the first flush starts a new timer delay for the next flush or something along those lines so that the next flush comes later than the single pre-patch flush encompassing the data from both post-patch flushes.
I'll investigate.
Comment 2•3 years ago
|
||
Set release status flags based on info from the regressing bug 1701828
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 3•3 years ago
|
||
I expect the patch for bug 1745142 to address this.
Assignee | ||
Comment 4•3 years ago
|
||
Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #3)
I expect the patch for bug 1745142 to address this.
Assignee | ||
Comment 6•3 years ago
|
||
New idea: Maybe taking the encoding speculation from the queue shouldn't involve calling Interrupt()
and the interrupted state continues for too long.
Assignee | ||
Comment 7•3 years ago
|
||
Baseline reference on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3c1f9d4c32755a55697e3c42b2eb2cbfe52a42b1
Assignee | ||
Comment 8•3 years ago
|
||
Without calling Interrupt()
:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ba9ee7142b6f373acb8d012dcb2314a2309b264
Assignee | ||
Comment 9•3 years ago
|
||
I'll try combining with the patch for bug 1745142.
Assignee | ||
Comment 10•3 years ago
|
||
Assignee | ||
Comment 11•3 years ago
|
||
Eliminating parser continuation runnables and using MutexAutoUnlock
instead:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=59b2ccdb1dc122ee564d3e04f92dadc865e38672
Assignee | ||
Comment 12•3 years ago
|
||
Assignee | ||
Comment 13•3 years ago
|
||
Hmm. Disappointing results.
Updated•3 years ago
|
Assignee | ||
Comment 15•3 years ago
|
||
This isn't quite the root root cause, but the problem here is that the time from the parser thread flushing the queue that contains that StartLayout
operation got the time from flush on parser thread to StartLayout
executing on the main thread extended from 6 milliseconds to 41 milliseconds (on my workstation). This suggests that whatever the parser flushed to the main thread previously made the main thread busy enough not to process the next flush in a timely manner.
Assignee | ||
Comment 16•3 years ago
|
||
Assignee | ||
Comment 17•3 years ago
|
||
try candidate:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2ab1ca1d802bc215e234f9fdac6a057f8471db86
(In reply to Henri Sivonen (:hsivonen) from comment #15)
This isn't quite the root root cause, but the problem here is that the time from the parser thread flushing the queue that contains that
StartLayout
operation got the time from flush on parser thread toStartLayout
executing on the main thread extended from 6 milliseconds to 41 milliseconds (on my workstation). This suggests that whatever the parser flushed to the main thread previously made the main thread busy enough not to process the next flush in a timely manner.
It seems that the RunFlushLoop
throttling mechanism ends up throttling the flush that contains the StartLayout
operation. The attached patch does three things:
- If a tree op flush is throttled, speculative loads are flushed anyway.
- Throttling is bypassed if layout hasn't started.
- The first speculation dequeued after an encoding commitment is flushed right away.
Assignee | ||
Comment 18•3 years ago
|
||
(If this is fixed by bypassing throttling, we should find out as a follow-up, why the throttling activates what's supposed to be the frontmost page in the harness.)
Assignee | ||
Comment 19•3 years ago
|
||
Assignee | ||
Comment 20•3 years ago
|
||
Further examination suggests that it's not the throttling mechanism in the tree op executor. But something seems to be making the relevant flush runnable run after quite a delay relative to when it was posted to the event queue.
Assignee | ||
Comment 21•3 years ago
|
||
What I was seeing locally seems to have no relation to the metrics on try. Also, I also tried a different case locally, and it was quite noisy.
On the topic of whether this metric change is representative of Outlook specifically or a larger set of similar pages: Comment 0 shows an improvement on Twitch. As it happens, Outlook and Twitch are the two sites in our measurement set that don't declare the character encoding on the HTTP layer and, therefore, were affected by bug 1701828. That Twitch improved suggests that we don't need to worry about a problem on all sites of this type.
Notably, telemetry on time to non-blank paint doesn't show any cliff.
smaug pointed out to me that on the day the patch that regressed the fnbpaint measurement on Outlook here, SpeedIndex for Outlook improved considerably on December 8 when the patch landed. Since that metric is more important than fnbpaint, it doesn't make sense to back out.
Updated•3 years ago
|
Comment 22•3 years ago
|
||
It looks like before this patch, we somehow measured dnf and fnb "twice" - we measured two values, e.g. 215ms and 60ms, and then only took the latter value. After the patch, there's only one dnf / fnb value for the page load, and it seems to be roughly equal to the sum of the two old values.
Here's a profile from "before" which shows the double measurement: https://share.firefox.dev/3onT8ib
In the marker chart, there are two DOMContentFlushed markers and two FirstNonBlankPaint markers.
Here's a profile from "after": https://share.firefox.dev/3g5zouM
You can see only one marker each, and it is roughly equal to the sum of the markers in the "before" profiles.
These profiles come from a ./mach try fuzzy --geckoProfile
run, for the macosx1015-64-shippable-qr
configuration, which had the following "regression" detected:
| 256% | outlook dcf | macosx1015-64-shippable-qr | warm webrender | 76.62 -> 272.83 |
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to Markus Stange [:mstange] from comment #22)
It looks like before this patch, we somehow measured dnf and fnb "twice" - we measured two values, e.g. 215ms and 60ms, and then only took the latter value. After the patch, there's only one dnf / fnb value for the page load, and it seems to be roughly equal to the sum of the two old values.
Thank you!
The most obvious explanation for this would be that bug 1701828 eliminated a reload scenario and the reload scenario could easily have lead to two measurements. However, 1) one would expect the first measurement to be smaller than the second one in that case and 2) at least the current live version of Outlook doesn't appear to have the conditions that would have triggered a reload before bug 1701828.
In any case, I'm resolving this based on the finding that this is a measurement error.
Comment 24•3 years ago
|
||
Thank you all for looking into this!
Description
•