Regression in firstPaint on around April 16th on Windows
Categories
(Toolkit :: Startup and Profile System, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr60 | --- | unaffected |
firefox66 | --- | unaffected |
firefox67 | --- | unaffected |
firefox68 | --- | wontfix |
People
(Reporter: alexical, Assigned: tnikkel)
References
Details
(Keywords: perf, regression, Whiteboard: [fxperf:p1])
Attachments
(1 file)
75.67 KB,
image/png
|
Details |
There seems to be a regression on Windows, an improvement on Linux, and no change for OSX. The change is most dramatic at the 95th percentile, but it's noticeable in the median. It's unnoticeable at the 25th percentile, so I think this is likely due to some kind of disk IO.
It also seems to correspond to an improvement in SIMPLE_MEASURES_BLANKWINDOWSHOWN, and there seems to be either an improvement or no change in SIMPLE_MEASURES_DELAYEDSTARTUPFINISHED (difficult to tell because this last measure seems to have improved with bug 1538279.)
Looking into potential causes. Bug 1541798 lines up time-wise, but I can't think of a mechanism by which that could cause this regression.
Reporter | ||
Comment 1•6 years ago
|
||
Hmm. This actually does line up with an increase in submissions for SIMPLE_MEASURES_BLANKWINDOWSHOWN, which makes me think bug 1541798 is plausible.
Reporter | ||
Comment 3•6 years ago
|
||
I think bug 1541798 is ruled out, as the regression still shows up if I split out pings by whether they have the default theme or not.
The regression is certainly introduced in the latter build for April 16th, though (20190416220148).
Reporter | ||
Comment 4•6 years ago
|
||
This is certainly related to WebRender. This is a graph of SIMPLE_MEASURES_FIRSTPAINT by build date and split out by gfx/features/webrender/status. Green is 'available'. The times here are extremely high, around 300 seconds, which makes me think this could be from Bug 1543560? Sotaro, I'm not too familiar with all of the GPU process error handling, but could we be just waiting now when we would have previously noticed a GPU process crash? The other alternative is a deadlock, but I can't see how that would happen on first glance.
Reporter | ||
Comment 5•6 years ago
|
||
(To be clear, the above graph is only for the 95th percentile on Windows. The median is mostly unaffected. You can see some impact in the 75th percentile though.)
Comment 6•6 years ago
|
||
Mmm, I do not expect that Bug 1543560 could be a direct cause of the problem.
When WebRender caused error, RendererOGL::NotifyWebRenderError() is called from RendererOGL::UpdateAndRender()
https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RendererOGL.cpp#124
It is end up to GPUProcessManager::NotifyWebRenderError(). It disables WebRender in UI process.
https://searchfox.org/mozilla-central/source/gfx/ipc/GPUProcessManager.cpp#435
Then CompositorSession is re-created when nsBaseWidget::GetLayerManager() is called next time.
Bug 1543560 changed webrender error delivery from CompositorBridgeParent to CompositorManagerParent. Before Bug 1543560 fix, there could be a case that the WebRender error is not delivered to GPUProcessManager and a case that it caused IPC error and crashed UI process.
Comment 7•6 years ago
|
||
I found a problem that CompositorManagerParent::sInstance is not set in GPU process, then CompositorManagerParent::NotifyWebRenderError() does nothing.
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 9•6 years ago
|
||
Bug 1546298 fix seems not affect to the https://mzl.la/2GsVK7z. From it, Bug 1543560 seems not related to this bug. NotifyWebRenderError() is called only during Fallback from WebRender to Compositor because of WebRender error. It only rarely happen.
Reporter | ||
Comment 10•6 years ago
|
||
Hmm. Going to look deeper at my changes in bug 1538710 and see if that could be the culprit.
Reporter | ||
Comment 11•6 years ago
|
||
Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...
Comment 12•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #11)
Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...
Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?
Reporter | ||
Comment 13•6 years ago
|
||
(In reply to :Gijs (he/him) from comment #12)
(In reply to Doug Thayer [:dthayer] from comment #11)
Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...
Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?
Can you clarify? If Windows could load the wrong version of a dll and the right version of other files (which would be necessary to report with the new build ID, I believe), then I think we would see general chaos after updates.
In general it could be possible for a regression to show up an update later, if it involved, say, a cache that we build in a different way, so that we would have to first build the cache and only see the changes on the subsequent run, but there shouldn't be anything in bug 1538710 like that.
Comment 14•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #13)
(In reply to :Gijs (he/him) from comment #12)
(In reply to Doug Thayer [:dthayer] from comment #11)
Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...
Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?
Can you clarify? If Windows could load the wrong version of a dll and the right version of other files (which would be necessary to report with the new build ID, I believe), then I think we would see general chaos after updates.
Sorry, yeah, I meant something roughly like the following sequence of events:
- update to version X
- start version X
- we preload xul.dll version X as part of that, and then actually load it.
- update to version X+1
- start version X+1
and then waves hands vaguely somehow caching version X means we're slower at loading version X+1, because windows or us don't do the requisite checks as to whether the file has changed? Or something?
In general it could be possible for a regression to show up an update later, if it involved, say, a cache that we build in a different way, so that we would have to first build the cache and only see the changes on the subsequent run, but there shouldn't be anything in bug 1538710 like that.
Mm, I was thinking of bug 1538279, but then I guess we would have expected to see something sooner...
Reporter | ||
Comment 15•6 years ago
|
||
(In reply to :Gijs (he/him) from comment #14)
Mm, I was thinking of bug 1538279, but then I guess we would have expected to see something sooner...
Ah - yeah, I think bug 1538279 is ruled out because of it being WebRender-specific and such a dramatic change (i.e., ~15s -> ~300s). I think something has to be completely broken under certain circumstances to explain this - not just slow.
Reporter | ||
Comment 17•6 years ago
|
||
(In reply to Dave Townsend [:mossop] (he/him) from comment #16)
Should we be tracking this for 68?
There's a chance it's just a reporting problem. Given the severity of the problem as reported in the data (>300s startups) I am baffled as to why we don't seem to have seen reports of Firefox effectively failing to start up.
But since we can't currently rule out a real user-visible problem, I think yes.
Comment 18•6 years ago
|
||
[Tracking Requested - why for this release]:
See comment 17.
Reporter | ||
Comment 19•6 years ago
|
||
Hey Masayuki, per comment 17, I'm trying to assess whether it's possible that we're just not hitting this line, and thus reporting a long delay before first paint, but still managing to actually paint before then. Since I know you've been doing work with the PresShell and this code is behind an if on mPresShell, do you think that's possible?
Comment 20•6 years ago
|
||
I'm not so familiar with initialization of PresShell
and related objects, though.
nsViewManaer::mPresShell
is set only by nsViewManager::SetPresShell()
(I mean that it's not in the CC too). It's called only by PresShell::Init()
and PresShell::Destroy()
. And PresShell::Init()
is always called immediately after the instance created and set to non-nullptr
always. So, it seems that the line should hits as you expected.
Reporter | ||
Comment 21•6 years ago
|
||
Inspecting one of the pings with a slow (>200 seconds) firstPaint measure, it seems clear from the gfx-related histograms that the user is successfully viewing webpages and everything is visually working as normal. Apparently we are just somehow not hitting the initial firstPaint recording.
Updated•6 years ago
|
Reporter | ||
Comment 22•6 years ago
|
||
Jeff, this came up in our meeting a few weeks ago and you seemed to imply that you guys would take over investigating it. Am I remembering that correctly? Just want to make sure someone has this on mind.
Comment 23•6 years ago
|
||
I don't think anyone's actively looking at this but we should probably find someone.
Comment 24•6 years ago
|
||
tnikkel will take a look we will see if there is anything actionable
![]() |
||
Updated•6 years ago
|
Comment 25•6 years ago
|
||
(Showing up on the list of unowned tracked beta bugs, so assigning for comment 24 for next steps.)
Assignee | ||
Comment 26•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #21)
Inspecting one of the pings with a slow (>200 seconds) firstPaint measure,
it seems clear from the gfx-related histograms that the user is successfully
viewing webpages and everything is visually working as normal. Apparently we
are just somehow not hitting the initial firstPaint recording.
The viewing webpages and things working normally you observed were happening before the 200 seconds were up (ie before the first paint measure was recorded)?
Reporter | ||
Comment 27•6 years ago
|
||
Eugh, no, that's a fair point, and I'm not sure why I missed it. There's no measures that I know have which are timestamped in such a way to demonstrate this. The only thing is the other simpleMeasurement measures are all fine, but I don't think any of them actually require a successful paint. I.e., delayedStartupFinished, which is traditionally after firstPaint, comes before it in these pings, but I don't know of any solid reason why that couldn't happen while we legitimately haven't painted.
I think this puts us back to the puzzle of how could ~5% of WR-enabled users be seeing several-minute long startups without any of them reporting a problem. So I would still say faulty reporting is the most attractive explanation.
Assignee | ||
Comment 28•6 years ago
|
||
Regression range from comment 11:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a17f163841e956c10eb74cdbe27d7073facf6b7f&tochange=12a60898fdc12e4ab8735f228a77d6c1e6a600a9
The only thing that sticks out to me that hasn't been mentioned is bug 1531838
I went and looked at a bunch of telemetry pings, one even had a first paint of 235699519 (~65 hours). I looked at pings with crazy times like that, and ones with more reasonable 200-400s first paint times. I looked at
WR_SCENESWAP_TIME
WR_RENDER_TIME
WR_SCENEBUILD_TIME
PAINT_BUILD_DISPLAYLIST_TIME
CONTENT_PAINT_TIME
COMPOSITE_TIME
sessionRestoreInit
createTopLevelWindow
sessionRestored
in each ping, in each case they all looked normal. There didn't seem to be any huge outliers. In most cases they actually seemed to be performing quite well. The worst I saw was a couple of composite times between 300-1000ms.
I looked to see if the number of total composites had any relation to the first paint time when firstpaint > 200000. It didn't seem like it.
Ideas for trying to track this down:
- record more values in nsViewManager::Refresh (where we record firstPaint now). We could record before we start the first paint, the duration of the first paint itself, and for good measure throw in a second recording at the first paint location to rule out data corruption.
- flip a global bool when we are doing the first paint and record more telemetry inside painting.
Assignee | ||
Updated•6 years ago
|
Comment 29•6 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #28)
Ideas for trying to track this down:
- record more values in nsViewManager::Refresh (where we record firstPaint now). We could record before we start the first paint, the duration of the first paint itself, and for good measure throw in a second recording at the first paint location to rule out data corruption.
- flip a global bool when we are doing the first paint and record more telemetry inside painting.
Both of those seem like reasonable approaches.
Comment 30•6 years ago
|
||
Are we sure this can't be caused by something silly like firefox opening in the background and the first paint only getting recorded when the window comes back into the foreground? 65 hours sounds an awful lot like somebody started firefox on friday before leaving work and it didn't paint until they got back in monday morning.
Assignee | ||
Comment 31•6 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #30)
Are we sure this can't be caused by something silly like firefox opening in
the background and the first paint only getting recorded when the window
comes back into the foreground? 65 hours sounds an awful lot like somebody
started firefox on friday before leaving work and it didn't paint until they
got back in monday morning.
Perhaps, but we would probably want to understand why it only start showing up (or increase in frequency) with webrender with a very specific half-day regression range?
Assignee | ||
Comment 32•6 years ago
|
||
I also compared the session length against the firstPaint value. Of 1600 pings, 64 had firstpaint > 200000, and all of those 64 had firstpaint <= sessionlength. So random corruption doesn't seem likely.
Assignee | ||
Comment 33•6 years ago
|
||
(In reply to Doug Thayer [:dthayer] from comment #19)
Hey Masayuki, per comment 17, I'm trying to assess whether it's possible
that we're just not hitting [this
line](https://searchfox.org/mozilla-central/rev/
99a2a5a955960b0e58ceade1db1f7652d9db4ba1/view/nsViewManager.cpp#346), and
thus reporting a long delay before first paint, but still managing to
actually paint before then.
Bingo! Turns out firstPaint is a totally broken measurement. If I comment out the FlushRendering and presShell->Paint call in nsViewManager::Refresh I get a browser that appears to work just fine. The only problem I saw was the awesome bar drop down just painting black. So it's possible to paint and have a usable browser before the firstPaint event.
I think nsViewManager::Refresh is only called in response to the OS sending us an "expose" event ("you should paint this part of your window"). All painting use to happen this way, if we knew some part of a webpage changed we would tell the OS, and the OS in turn would send us an expose event. But that is no longer the case.
So this regression is probably actually an improvement because it means we doing less painting.
We should fix firstPaint asap though.
Assignee | ||
Comment 34•6 years ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #33)
Bingo! Turns out firstPaint is a totally broken measurement. If I comment
out the FlushRendering and presShell->Paint call in nsViewManager::Refresh I
get a browser that appears to work just fine. The only problem I saw was the
awesome bar drop down just painting black. So it's possible to paint and
have a usable browser before the firstPaint event.
On Windows the awesomebar dropdown works fine with those lines commented out (only on macos did I see the problem, haven't tested linux). So these pings are likely coming from fully functional sessions.
Assignee | ||
Comment 35•6 years ago
|
||
I filed bug 1556568 to fix firstPaint.
Comment 36•6 years ago
|
||
Thanks for figuring this out! Seems like this is no longer a bug, and we'll fix on bug 1556568
Updated•6 years ago
|
Updated•6 years ago
|
Description
•