Regression in FX_NEW_WINDOW_MS

RESOLVED FIXED in Firefox 57

Status

()

Firefox
General
P1
normal
RESOLVED FIXED
6 months ago
6 months ago

People

(Reporter: past, Assigned: eoger)

Tracking

(Blocks: 1 bug)

Trunk
Firefox 57
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox55 unaffected, firefox56 unaffected, firefox57 fixed)

Details

(Whiteboard: [photon-performance])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(2 attachments)

(Reporter)

Description

6 months ago
Telemetry shows a regression in the time it takes to open a new tab since Friday:

https://mzl.la/2vXkYFv

The regression is significant (takes us back to early-55 days) so we should figure out what caused it.
Blocks: 1363767
Whiteboard: [photon-performance]
(In reply to Panos Astithas [:past] (56 Regression Engineering Owner) (please ni?) from comment #0)
> Telemetry shows a regression in the time it takes to open a new tab since
> Friday:
> 

To be clear, this probe measures the time to open new windows, and not new tabs.

Updated

6 months ago
Whiteboard: [photon-performance] → [photon-performance] [triage]
Normally, this kind of bug would be hellish to try to diagnose, especially because tpaint (the window open talos test) didn't notice any regression.

I was, however, able to reproduce this regression locally. Here's what I did:

1) Get a baseline measurement of FX_NEW_WINDOW_MS on the 2017-08-17 Nightly. I used:

./mach mozregression --launch=2017-08-17 --pref "toolkit.telemetry.enabled:true"

Then opened a bunch of windows using Ctrl-N, and then opened up about:telemetry and looked at the FX_NEW_WINDOW_MS histogram. Here's what I got:

FX_NEW_WINDOW_MS
11 samples, average = 60.4, sum = 664

 22 |  0  0%
 37 |#########################  6  55%
 62 |#####################  5  45%
103 |  0  0%


Then I did the same thing for the 2017-08-19 build:

./mach mozregression --launch=2017-08-19 --pref "toolkit.telemetry.enabled:true"

FX_NEW_WINDOW_MS
18 samples, average = 99.6, sum = 1793

 37 |  0  0%
 62 |#########################  12  67%
103 |#############  6  33%
171 |  0  0%

The text-based histograms are probably making this a little difficult to notice, but there's been a definite shift towards the low 100's in the 2017-08-19th build as compared to the 2017-08-17th build.

I was able to reproduce this regression with multiple repetitions of the experiment.

2) Then, I ran the following mozregression command:

./mach mozregression --good=2017-08-17 --bad=2017-08-19 --pref "toolkit.telemetry.enabled:true"

and repeated the above experiment, using a fixed number of windows to manually open with Ctrl-N (I chose 7 arbitrarily).


Here's what I got:

 3:57.38 INFO: Narrowed inbound regression window from [6a8b37d4, ccffcdc9] (3 builds) to [6a8b37d4, b0f76b21] (2 builds) (~1 steps left)
 3:57.38 INFO: No more inbound revisions, bisection finished.
 3:57.38 INFO: Last good revision: 6a8b37d47733df46e639c885eda6616e66379711
 3:57.38 INFO: First bad revision: b0f76b216a196dc786c2853e3077746d8b772b2a
 3:57.38 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=6a8b37d47733df46e639c885eda6616e66379711&tochange=b0f76b216a196dc786c2853e3077746d8b772b2a

That's bug 1380377. A manual backout of that patch locally confirms that this improves the FX_NEW_WINDOW_MS probe by a significant degree for me locally.

So I think that's our best candidate for the regressor.
Blocks: 1380377
Bug 1380377 is simple enough that a backout for a few days would probably confirm or refute that it's the regressor.
(Assignee)

Comment 5

6 months ago
Created attachment 8899571 [details] [diff] [review]
not_a_fix.patch

I can reproduce this reliably following mconley's comment 3 instructions.
Funny thing is, dropping a <html:img> with a correct src attribute anywhere in the file """"fixes"""" the problem. Not really sure what is happening here. (see attached file).
What course of action can we take?
(In reply to Edouard Oger [:eoger] from comment #5)
> What course of action can we take?

I recommend the following:

1) Back out bug 1380377, and ensure that we see our FX_NEW_WINDOW_MS numbers bounce back
2) Put window.performance.mark markers around the start and end points for the FX_NEW_WINDOW_MS probe, and then use window.performance.measure to construct a measurement between those two markers. This will cause the time range being measured to show up in the Gecko Profiler timeline.
3) Capture a profile of opening a few windows without the patch, and then with the patch, and then manually compare the profiles looking for an explanation.

I can coach somebody through (2) and (3), but I think the highest priority is probably (1).

Updated

6 months ago
Flags: qe-verify-
Priority: -- → P2
Whiteboard: [photon-performance] [triage] → [photon-performance]
Comment hidden (mozreview-request)

Updated

6 months ago
Assignee: nobody → eoger
Status: NEW → ASSIGNED
Priority: P2 → P1
(Assignee)

Comment 10

6 months ago
Normal profile:
https://perfht.ml/2xpHfeB

Regressed profile:
https://perfht.ml/2xpGdz1

We can see that we are calling an expensive nsDocument::DispatchContentLoadedEvents method in the regressed profile, which might explain the slowdown. Can we ni? someone who might know what is happening here?
Merged backout:
https://hg.mozilla.org/mozilla-central/rev/5f5549876ea6

Does this resolve this bug and reopen 1380377?
Flags: needinfo?(eoger)
Status: ASSIGNED → RESOLVED
Last Resolved: 6 months ago
status-firefox55: --- → unaffected
status-firefox56: --- → unaffected
status-firefox57: affected → fixed
Resolution: --- → FIXED

Updated

6 months ago
Iteration: --- → 57.2 - Aug 29

Updated

6 months ago
Flags: needinfo?(eoger)

Updated

6 months ago
Target Milestone: --- → Firefox 57
(In reply to Edouard Oger [:eoger] from comment #10)
> Normal profile:
> https://perfht.ml/2xpHfeB
> 
> Regressed profile:
> https://perfht.ml/2xpGdz1
> 
> We can see that we are calling an expensive
> nsDocument::DispatchContentLoadedEvents method in the regressed profile,
> which might explain the slowdown. Can we ni? someone who might know what is
> happening here?

Comparing these 2 profiles, I don't see a big difference in the code that's running in them. The major difference I see is that in the "normal" profile the "window-open" measurement stops around when we start executing the 'load' handler of the new window, whereas in the 'regressed' profile the "window-open" measurement stops just before the 'MozAfterPaint' event.

This makes me wonder if it's possible that the FX_NEW_WINDOW_MS probe has been measuring incorrect data because the "document-shown" observer notification may not be what we want (after paint is more relevant).
You need to log in before you can comment on or make changes to this bug.