1546140 - Regression in firstPaint on around April 16th on Windows

Alex Thayer [:alexical] (she/her)

Reporter

Description

•

6 years ago

•

Edited

See https://mzl.la/2GsVK7z

There seems to be a regression on Windows, an improvement on Linux, and no change for OSX. The change is most dramatic at the 95th percentile, but it's noticeable in the median. It's unnoticeable at the 25th percentile, so I think this is likely due to some kind of disk IO.

It also seems to correspond to an improvement in SIMPLE_MEASURES_BLANKWINDOWSHOWN, and there seems to be either an improvement or no change in SIMPLE_MEASURES_DELAYEDSTARTUPFINISHED (difficult to tell because this last measure seems to have improved with bug 1538279.)

Looking into potential causes. Bug 1541798 lines up time-wise, but I can't think of a mechanism by which that could cause this regression.

Alex Thayer [:alexical] (she/her)

Reporter

Comment 1

•

6 years ago

Hmm. This actually does line up with an increase in submissions for SIMPLE_MEASURES_BLANKWINDOWSHOWN, which makes me think bug 1541798 is plausible.

Comment 2

•

6 years ago

Could also be bug 1544538.

Comment 3

•

6 years ago

I think bug 1541798 is ruled out, as the regression still shows up if I split out pings by whether they have the default theme or not.

The regression is certainly introduced in the latter build for April 16th, though (20190416220148).

Alex Thayer [:alexical] (she/her)

Reporter

Comment 4

•

6 years ago

Attached image Screen Shot 2019-04-22 at 3.41.53 PM.png — Details

This is certainly related to WebRender. This is a graph of SIMPLE_MEASURES_FIRSTPAINT by build date and split out by gfx/features/webrender/status. Green is 'available'. The times here are extremely high, around 300 seconds, which makes me think this could be from Bug 1543560? Sotaro, I'm not too familiar with all of the GPU process error handling, but could we be just waiting now when we would have previously noticed a GPU process crash? The other alternative is a deadlock, but I can't see how that would happen on first glance.

Flags: needinfo?(sotaro.ikeda.g)

Alex Thayer [:alexical] (she/her)

Reporter

Comment 5

•

6 years ago

(To be clear, the above graph is only for the 95th percentile on Windows. The median is mostly unaffected. You can see some impact in the 75th percentile though.)

Sotaro Ikeda [:sotaro]

Comment 6

•

6 years ago

Mmm, I do not expect that Bug 1543560 could be a direct cause of the problem.

When WebRender caused error, RendererOGL::NotifyWebRenderError() is called from RendererOGL::UpdateAndRender()
https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/RendererOGL.cpp#124

It is end up to GPUProcessManager::NotifyWebRenderError(). It disables WebRender in UI process.
https://searchfox.org/mozilla-central/source/gfx/ipc/GPUProcessManager.cpp#435

Then CompositorSession is re-created when nsBaseWidget::GetLayerManager() is called next time.

Bug 1543560 changed webrender error delivery from CompositorBridgeParent to CompositorManagerParent. Before Bug 1543560 fix, there could be a case that the WebRender error is not delivered to GPUProcessManager and a case that it caused IPC error and crashed UI process.

Sotaro Ikeda [:sotaro]

Comment 7

•

6 years ago

I found a problem that CompositorManagerParent::sInstance is not set in GPU process, then CompositorManagerParent::NotifyWebRenderError() does nothing.

Sotaro Ikeda [:sotaro]

Updated

•

6 years ago

Depends on: 1546298

Sotaro Ikeda [:sotaro]

Comment 8

•

6 years ago

Created Bug 1546298 to address comment 7.

Flags: needinfo?(sotaro.ikeda.g)

Marco Castelluccio [:marco]

Updated

•

6 years ago

Keywords: regression

Mike Conley (:mconley) (:⚙️)

Updated

•

6 years ago

Whiteboard: [fxperf] → [fxperf:p1]

Florian Quèze [:florian]

Updated

•

6 years ago

Priority: -- → P1

Sotaro Ikeda [:sotaro]

Comment 9

•

6 years ago

Bug 1546298 fix seems not affect to the https://mzl.la/2GsVK7z. From it, Bug 1543560 seems not related to this bug. NotifyWebRenderError() is called only during Fallback from WebRender to Compositor because of WebRender error. It only rarely happen.

Alex Thayer [:alexical] (she/her)

Reporter

Comment 10

•

6 years ago

Hmm. Going to look deeper at my changes in bug 1538710 and see if that could be the culprit.

Alex Thayer [:alexical] (she/her)

Reporter

Comment 11

•

6 years ago

Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...

:Gijs (he/him)

Comment 12

•

6 years ago

(In reply to Doug Thayer [:dthayer] from comment #11)

Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...

Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?

Alex Thayer [:alexical] (she/her)

Reporter

Comment 13

•

6 years ago

(In reply to :Gijs (he/him) from comment #12)

(In reply to Doug Thayer [:dthayer] from comment #11)

Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...

Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?

Can you clarify? If Windows could load the wrong version of a dll and the right version of other files (which would be necessary to report with the new build ID, I believe), then I think we would see general chaos after updates.

In general it could be possible for a regression to show up an update later, if it involved, say, a cache that we build in a different way, so that we would have to first build the cache and only see the changes on the subsequent run, but there shouldn't be anything in bug 1538710 like that.

Flags: needinfo?(gijskruitbosch+bugs)

:Gijs (he/him)

Comment 14

•

6 years ago

(In reply to Doug Thayer [:dthayer] from comment #13)

(In reply to :Gijs (he/him) from comment #12)

(In reply to Doug Thayer [:dthayer] from comment #11)

Eh, the problem with that idea is that the regression clearly begins with 20190416220148, and my changes landed in 20190416095014...

Is it possible the regression would only show up after at least one app update after a run with your changes, e.g. because Windows still has a preloaded copy of the "wrong" xul.dll?

Can you clarify? If Windows could load the wrong version of a dll and the right version of other files (which would be necessary to report with the new build ID, I believe), then I think we would see general chaos after updates.

Sorry, yeah, I meant something roughly like the following sequence of events:

update to version X
start version X
we preload xul.dll version X as part of that, and then actually load it.
update to version X+1
start version X+1

and then waves hands vaguely somehow caching version X means we're slower at loading version X+1, because windows or us don't do the requisite checks as to whether the file has changed? Or something?

In general it could be possible for a regression to show up an update later, if it involved, say, a cache that we build in a different way, so that we would have to first build the cache and only see the changes on the subsequent run, but there shouldn't be anything in bug 1538710 like that.

Mm, I was thinking of bug 1538279, but then I guess we would have expected to see something sooner...

Flags: needinfo?(gijskruitbosch+bugs)

Alex Thayer [:alexical] (she/her)

Reporter

Comment 15

•

6 years ago

(In reply to :Gijs (he/him) from comment #14)

Mm, I was thinking of bug 1538279, but then I guess we would have expected to see something sooner...

Ah - yeah, I think bug 1538279 is ruled out because of it being WebRender-specific and such a dramatic change (i.e., ~15s -> ~300s). I think something has to be completely broken under certain circumstances to explain this - not just slow.

Dave Townsend [:mossop]

Comment 16

•

6 years ago

Should we be tracking this for 68?

Flags: needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Reporter

Comment 17

•

6 years ago

(In reply to Dave Townsend [:mossop] (he/him) from comment #16)

Should we be tracking this for 68?

There's a chance it's just a reporting problem. Given the severity of the problem as reported in the data (>300s startups) I am baffled as to why we don't seem to have seen reports of Firefox effectively failing to start up.

But since we can't currently rule out a real user-visible problem, I think yes.

Flags: needinfo?(dothayer)

Dave Townsend [:mossop]

Comment 18

•

6 years ago

[Tracking Requested - why for this release]:

See comment 17.

tracking-firefox68: --- → ?

Alex Thayer [:alexical] (she/her)

Reporter

Comment 19

•

6 years ago

Hey Masayuki, per comment 17, I'm trying to assess whether it's possible that we're just not hitting this line, and thus reporting a long delay before first paint, but still managing to actually paint before then. Since I know you've been doing work with the PresShell and this code is behind an if on mPresShell, do you think that's possible?

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 20

•

6 years ago

I'm not so familiar with initialization of PresShell and related objects, though.

nsViewManaer::mPresShell is set only by nsViewManager::SetPresShell() (I mean that it's not in the CC too). It's called only by PresShell::Init() and PresShell::Destroy(). And PresShell::Init() is always called immediately after the instance created and set to non-nullptr always. So, it seems that the line should hits as you expected.

Flags: needinfo?(masayuki)

Alex Thayer [:alexical] (she/her)

Reporter

Comment 21

•

6 years ago

Inspecting one of the pings with a slow (>200 seconds) firstPaint measure, it seems clear from the gfx-related histograms that the user is successfully viewing webpages and everything is visually working as normal. Apparently we are just somehow not hitting the initial firstPaint recording.

Julien Cristau [:jcristau]

Updated

•

6 years ago

status-firefox66: --- → unaffected

status-firefox67: --- → unaffected

status-firefox68: --- → affected

status-firefox-esr60: --- → unaffected

tracking-firefox68: ? → +

Alex Thayer [:alexical] (she/her)

Reporter

Comment 22

•

6 years ago

Jeff, this came up in our meeting a few weeks ago and you seemed to imply that you guys would take over investigating it. Am I remembering that correctly? Just want to make sure someone has this on mind.

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 23

•

6 years ago

I don't think anyone's actively looking at this but we should probably find someone.

Flags: needinfo?(jmuizelaar) → needinfo?(jbonisteel)

Jessie [:jbonisteel] pls NI

Comment 24

•

6 years ago

tnikkel will take a look we will see if there is anything actionable

Flags: needinfo?(jbonisteel)

Jim Mathies [:jimm]

Updated

•

6 years ago

Flags: needinfo?(tnikkel)

Justin Dolske [:Dolske]

Comment 25

•

6 years ago

(Showing up on the list of unowned tracked beta bugs, so assigning for comment 24 for next steps.)

Assignee: nobody → tnikkel

Timothy Nikkel (:tnikkel)

Assignee

Comment 26

•

6 years ago

(In reply to Doug Thayer [:dthayer] from comment #21)

Inspecting one of the pings with a slow (>200 seconds) firstPaint measure,
it seems clear from the gfx-related histograms that the user is successfully
viewing webpages and everything is visually working as normal. Apparently we
are just somehow not hitting the initial firstPaint recording.

The viewing webpages and things working normally you observed were happening before the 200 seconds were up (ie before the first paint measure was recorded)?

Flags: needinfo?(tnikkel) → needinfo?(dothayer)

Alex Thayer [:alexical] (she/her)

Reporter

Comment 27

•

6 years ago

Eugh, no, that's a fair point, and I'm not sure why I missed it. There's no measures that I know have which are timestamped in such a way to demonstrate this. The only thing is the other simpleMeasurement measures are all fine, but I don't think any of them actually require a successful paint. I.e., delayedStartupFinished, which is traditionally after firstPaint, comes before it in these pings, but I don't know of any solid reason why that couldn't happen while we legitimately haven't painted.

I think this puts us back to the puzzle of how could ~5% of WR-enabled users be seeing several-minute long startups without any of them reporting a problem. So I would still say faulty reporting is the most attractive explanation.

Flags: needinfo?(dothayer)

Timothy Nikkel (:tnikkel)

Assignee

Comment 28

•

6 years ago

Regression range from comment 11:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a17f163841e956c10eb74cdbe27d7073facf6b7f&tochange=12a60898fdc12e4ab8735f228a77d6c1e6a600a9

The only thing that sticks out to me that hasn't been mentioned is bug 1531838

I went and looked at a bunch of telemetry pings, one even had a first paint of 235699519 (~65 hours). I looked at pings with crazy times like that, and ones with more reasonable 200-400s first paint times. I looked at

WR_SCENESWAP_TIME
WR_RENDER_TIME
WR_SCENEBUILD_TIME
PAINT_BUILD_DISPLAYLIST_TIME
CONTENT_PAINT_TIME
COMPOSITE_TIME
sessionRestoreInit
createTopLevelWindow
sessionRestored

in each ping, in each case they all looked normal. There didn't seem to be any huge outliers. In most cases they actually seemed to be performing quite well. The worst I saw was a couple of composite times between 300-1000ms.

I looked to see if the number of total composites had any relation to the first paint time when firstpaint > 200000. It didn't seem like it.

Ideas for trying to track this down:

record more values in nsViewManager::Refresh (where we record firstPaint now). We could record before we start the first paint, the duration of the first paint itself, and for good measure throw in a second recording at the first paint location to rule out data corruption.
flip a global bool when we are doing the first paint and record more telemetry inside painting.

Timothy Nikkel (:tnikkel)

Assignee

Updated

•

6 years ago

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 29

•

6 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #28)

Ideas for trying to track this down:

record more values in nsViewManager::Refresh (where we record firstPaint now). We could record before we start the first paint, the duration of the first paint itself, and for good measure throw in a second recording at the first paint location to rule out data corruption.

flip a global bool when we are doing the first paint and record more telemetry inside painting.

Both of those seem like reasonable approaches.

Flags: needinfo?(jmuizelaar)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 30

•

6 years ago

Are we sure this can't be caused by something silly like firefox opening in the background and the first paint only getting recorded when the window comes back into the foreground? 65 hours sounds an awful lot like somebody started firefox on friday before leaving work and it didn't paint until they got back in monday morning.

Timothy Nikkel (:tnikkel)

Assignee

Comment 31

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #30)

Are we sure this can't be caused by something silly like firefox opening in
the background and the first paint only getting recorded when the window
comes back into the foreground? 65 hours sounds an awful lot like somebody
started firefox on friday before leaving work and it didn't paint until they
got back in monday morning.

Perhaps, but we would probably want to understand why it only start showing up (or increase in frequency) with webrender with a very specific half-day regression range?

Timothy Nikkel (:tnikkel)

Assignee

Comment 32

•

6 years ago

I also compared the session length against the firstPaint value. Of 1600 pings, 64 had firstpaint > 200000, and all of those 64 had firstpaint <= sessionlength. So random corruption doesn't seem likely.

Timothy Nikkel (:tnikkel)

Assignee

Comment 33

•

6 years ago

(In reply to Doug Thayer [:dthayer] from comment #19)

Hey Masayuki, per comment 17, I'm trying to assess whether it's possible
that we're just not hitting [this
line](https://searchfox.org/mozilla-central/rev/
99a2a5a955960b0e58ceade1db1f7652d9db4ba1/view/nsViewManager.cpp#346), and
thus reporting a long delay before first paint, but still managing to
actually paint before then.

Bingo! Turns out firstPaint is a totally broken measurement. If I comment out the FlushRendering and presShell->Paint call in nsViewManager::Refresh I get a browser that appears to work just fine. The only problem I saw was the awesome bar drop down just painting black. So it's possible to paint and have a usable browser before the firstPaint event.

I think nsViewManager::Refresh is only called in response to the OS sending us an "expose" event ("you should paint this part of your window"). All painting use to happen this way, if we knew some part of a webpage changed we would tell the OS, and the OS in turn would send us an expose event. But that is no longer the case.

So this regression is probably actually an improvement because it means we doing less painting.

We should fix firstPaint asap though.

Timothy Nikkel (:tnikkel)

Assignee

Comment 34

•

6 years ago

(In reply to Timothy Nikkel (:tnikkel) from comment #33)

Bingo! Turns out firstPaint is a totally broken measurement. If I comment
out the FlushRendering and presShell->Paint call in nsViewManager::Refresh I
get a browser that appears to work just fine. The only problem I saw was the
awesome bar drop down just painting black. So it's possible to paint and
have a usable browser before the firstPaint event.

On Windows the awesomebar dropdown works fine with those lines commented out (only on macos did I see the problem, haven't tested linux). So these pings are likely coming from fully functional sessions.

Timothy Nikkel (:tnikkel)

Assignee

Comment 35

•

6 years ago

I filed bug 1556568 to fix firstPaint.

Randell Jesup [:jesup] (needinfo me)

Comment 36

•

6 years ago

Thanks for figuring this out! Seems like this is no longer a bug, and we'll fix on bug 1556568

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → DUPLICATE

Julien Cristau [:jcristau]

Updated

•

6 years ago

tracking-firefox68: + → ---

Julien Cristau [:jcristau]

Updated

•

6 years ago

status-firefox68: affected → wontfix