Closed Bug 1373695 Opened 3 years ago Closed 2 years ago

4.62% tpaint (osx-10-10) regression on push 8dd7bd8a399a1d21cad7361765b8c85947b49be5 (Thu Jun 15 2017)

Categories

(Core :: DOM: Core & HTML, defect, P1)

56 Branch
defect

Tracking

()

RESOLVED INVALID
Tracking Status
firefox-esr52 --- unaffected
firefox55 --- unaffected
firefox56 --- affected
firefox57 --- affected

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, talos-regression)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=8dd7bd8a399a1d21cad7361765b8c85947b49be5

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  5%  tpaint osx-10-10 opt e10s     301.29 -> 315.22


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=7330

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: Untriaged → DOM
Product: Firefox → Core
Version: 53 Branch → 56 Branch
:mystor, can you take a look at this regression as it is related to the patch from bug 1369627 based on retriggering and seeing a consistent pattern.
Flags: needinfo?(michael)
Can we take a look at the new numbers now that bug 1343728 is on inbound and see if the issue persists? I expect that the performance regression will disappear with these new patches :).

If it doesn't, I'll look into it more on Monday.
Flags: needinfo?(michael) → needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #3)
> unfortunately this didn't help:
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=1209600&series=%5Bmozilla-inbound,
> 8faafec2fe5cc222e3f6c8d5b053fa72b8c93c8f,1%5D&series=%5Bautoland,
> 8faafec2fe5cc222e3f6c8d5b053fa72b8c93c8f,0%5D&selected=%5Bmozilla-inbound,
> 8faafec2fe5cc222e3f6c8d5b053fa72b8c93c8f%5D
> 
> It was worth a try!

The patch I was holding out hope for ended up being backed out. I want to give it another try too :). I'll start working on getting a setup for profiling this on a mac as well however.
Assigning to Michael as he's looking into this.
Assignee: nobody → michael
Priority: -- → P1
On my local mac machine my latest patches for bug 1343728 do have about a 5%-ish performance improvement over current inbound, so I think that patch definitely helps.

I looked at some profiles of running this test locally, and I didn't see any of the code which the guilty patch changed running, except for one profile (which I unfortunately lost) where I noticed an unnecessary sync IPC being sent from the child process. I intend to keep an eye on the window opening code to see if I can find and remove that bug.
:mystor, can you update this bug?  I want to make sure we don't forget about this.
Flags: needinfo?(michael)
I just got an r+ on the bug which was blocking bug 1343728, so I'm going to try to land it again. If it sticks, we should look at the perf numbers again and see if we still have a problem.
Flags: needinfo?(michael)
The code which I was hoping would improve tpaint performance has landed, and it doesn't seem to be eliminating this macOS only regression from my reading of the talos numbers (although I must admit I have no idea how to read them).
ni? myself to look into this more.
Flags: needinfo?(michael)
I was talking to ehsan a few days ago. Apparently this test doesn't necessarily actually test the time to the first complete paint, but rather to the first composite which occurs. My change may actually have improved speed (by causing a the full composite to occur earlier, and avoiding an empty composite), but it would appear to be a performance regression because we were timing the same thing.

mconley, ehsan suggested you had run into something like this with tpaint in the past. Do you know if that might be what is going on here?
Flags: needinfo?(michael) → needinfo?(mconley)
(In reply to Michael Layzell [:mystor] from comment #11)
> mconley, ehsan suggested you had run into something like this with tpaint in
> the past. Do you know if that might be what is going on here?

It might be, but at this time, it's not super easy to tell, since the MozAfterPaint event that gets fired in the new window opened by tpaint doesn't tell us whether or not the rect that was invalidated is empty. That actually is something we're struggling with in bug 1371341, coincidentally.

Do you happen to know if it's straight-forward to get the rect exposed to the event that's fired in unprivileged content?
Flags: needinfo?(mconley) → needinfo?(michael)
(In reply to Mike Conley (:mconley) - Digging out of needinfo / review backlog. from comment #12)
> Do you happen to know if it's straight-forward to get the rect exposed to
> the event that's fired in unprivileged content?

I don't know anything about the MozAfterPaint event unfortunately. It looks like tnikkel added some information to the event recently in bug 1264798, so ni?-ing him.
Flags: needinfo?(michael) → needinfo?(tnikkel)
(In reply to Mike Conley (:mconley) - Digging out of needinfo / review backlog. from comment #12)
> Do you happen to know if it's straight-forward to get the rect exposed to
> the event that's fired in unprivileged content?

Seems like the same question is asked and answered in bug 1371341, comment 9 to comment 11.
Flags: needinfo?(tnikkel)
How must we conclude this? The metrics remained the same since this landed.
Flags: needinfo?(michael)
Flags: needinfo?(mconley)
From the profiles I've looked at I haven't seen any of the code which the regressing bug added being run in the profiles, however, the code which the regressing bug added _does_ cause the first non-empty paint to occur earlier (as the content process now knows its dimensions sooner, and thus paints sooner). I think this probably actually improved performance in some small way, but that the test was measuring the wrong thing.

I think the way to fix this test is probably in bug 1371341, so I'm going to make this block on that bug.

TL;DR, I don't think this is a real regression, but I can always be wrong.
Depends on: 1371341
Flags: needinfo?(michael)
I suggest once bug 1377251's last remaining patch is landed, we hack together a version of tpaint that uses the TIME_TO_NON_BLANK_PAINT entry and push to try with and without mystor's patch. If the scores are about the same, I think we should close this out.
Flags: needinfo?(mconley)
I see bug 1377251 is now resolved. Do you have the time to work on this hack to see if the regression persists?
Flags: needinfo?(mconley)
Hm. It's been a while since bug 1369627 landed, so a backout is going to be tricky to test. mystor, is there a simple way to "neuter" the changes from bug 1369627 so that their codepaths aren't hit?

Or have we reached and passed the point of no return here?
Flags: needinfo?(mconley) → needinfo?(michael)
We would need to remove dimensions from CreatedWindowinfo, http://searchfox.org/mozilla-central/rev/f0e4ae5f8c40ba742214e89aba3f554da0b89a33/dom/ipc/DOMTypes.ipdlh#121, stop calling RecvUpdateWindowDimensions here http://searchfox.org/mozilla-central/rev/f0e4ae5f8c40ba742214e89aba3f554da0b89a33/dom/ipc/ContentChild.cpp#1034, and delete the corresponding setters for dimensions on the sending side: http://searchfox.org/mozilla-central/rev/f0e4ae5f8c40ba742214e89aba3f554da0b89a33/dom/ipc/ContentParent.cpp#4706.

_however_ I don't think this would fully fix it.

I did try this a while ago, when I was working on bug 1343728, and it turns out from the small amount of testing which I did that if we don't send down the dimension information like this, tests start failing :-/. We might be able to get it stable enough to run talos though.
Flags: needinfo?(michael)
Is this only 10.10 regression?
(In reply to Jim Mathies [:jimm] from comment #21)
> Is this only 10.10 regression?

This regression doesn't hit outside of macOS, yes. I don't know if we run talos on other macOS versions.
Assignee: michael → nobody
I don't think we should keep this bug open. 

We don't really have a good/easy way to check, unfortunately, but this regression was probably caused by the problem fixed with bug 1377251. Namely, this patch causes the child to be able to do a non-empty paint faster (the dimensions are sent down faster), but does not run any code which should have any significant runtime. The most likely cause of this apparent regression is simply that we performed a non-empty paint faster.

Unless we want to put engineering effort into developing a version of the browser without the code changes which I made in that patch, which are necessary for the async window.open work, we can't really easily check this, so I think it's reasonable to close.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.