Closed Bug 1227275 Opened 9 years ago Closed 8 years ago

70-80% e10s sessionrestore regression on fx-team (v.45) on Nov 20, 2015 from push 924d421d766a

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: jmaher, Assigned: mconley)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Talos has detected a Firefox performance regression from your commit 924d421d766a in bug 1209689.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=eff4131a3e4caf33761bd52cb9a5f2dd5601c4f1&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#sessionrestore.2Fsessionrestore_no_auto_restore

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux64,win64,win32 -u none -t other  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a sessionrestore

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Thursday, or the offending patch will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
oh, and the most important thing, the compare view:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=fx-team&originalRevision=a0db720c980e&newProject=fx-team&newRevision=924d421d766a

this will be useful when we actually get more data for each test job
Assignee: nobody → mconley
Flags: needinfo?(mconley)
Seems to place the blame squarely on https://hg.mozilla.org/mozilla-central/rev/243353ae9c08.
yes, that is the culprit
One interesting thing to note: this regression just makes the e10s results the same as that which we were seeing with non-e10s:

https://treeherder.mozilla.org/perf.html#/graphs?series=[mozilla-inbound,c22155824871783bb5d8f514b3a01b4c885034c0,1]&series=[mozilla-inbound,da512b523fc252e7eca6dd4a0dc5c217125f6948,1]

This makes me question whether this is really a regression, or we've just changed the way we're measuring the same thing.
I've figured out what's happening here - it's tied into what wlach is talking about in comment 5. Basically, the patch I identified in comment 3 changes things so that unrestored background tabs are non-remote browsers (just like in non-e10s mode), which means that they have to load scripts / initialize as soon as they're put into the DOM just like in single-process mode[1].

So this regression is actually us behaving slightly more like single-process Firefox for session restore (but a little bit better, since the selected browser loads remotely).

I still think this is worth fixing, since the ~70% win that loading the background tabs in remote browsers gave us is highly attractive and should be fought for, but I don't think it's worth backing out bug 1209689 for, since the e10s release criteria would not block us from releasing with the same session restore performance.

So what I suggest is that we put this bug into the e10s-perf meta bug, and have the e10s team tackle this as a "juicy perf win opportunity" that we can gain back with some thinking, but not a top priority.

Does that sound reasonable, jmaher?

[1]: Unlike with e10s, where they load scripts and initialize, but in the content process, which frees up the parent more.
Flags: needinfo?(jmaher)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #6)
> Does that sound reasonable, avih?

It does. How about you assign it to someone to hopefully reduce the chance it becomes priority Z?

Vladan, what say you?
Flags: needinfo?(jmaher) → needinfo?(vladan.bugzilla)
/me likes
Blocks: e10s-perf
Yes, this approach is fine by me
Flags: needinfo?(vladan.bugzilla)
Whiteboard: [talos_regression][e10s] → [talos_regression][e10s][MemShrink]
this is on aurora now!  :mconley, do you have any further information on this issue?
Flags: needinfo?(mconley)
(In reply to Joel Maher (:jmaher) from comment #10)
> this is on aurora now!  :mconley, do you have any further information on
> this issue?

Hey Joel,

Yeah, we're fine with this e10s-only regression (see comment 6). Basically, we're still better than non-e10s here, and we know where the slowdown is coming from. I'd love to optimize this in the future by making it so that background unrestored tabs are remote again (but un-crashable, or at least resilient to crashing), but that's not currently a high priority.

Does that clear it up?
Flags: needinfo?(mconley) → needinfo?(jmaher)
sounds great- should we leave this open?
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #12)
> sounds great- should we leave this open?

The work that I describe in comment 11 could be done here, or in a separate bug, it doesn't really matter to me. Is this bug decorated in such a way as to cause noise on some of your dashboards? If so, I'm okay with WONTFIX'ing this, and I'll file a new bug.
Flags: needinfo?(jmaher)
we can leave this open!  thanks for clarifying.
Flags: needinfo?(jmaher)
Whiteboard: [talos_regression][e10s][MemShrink] → [talos_regression][e10s]
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #11)
> (In reply to Joel Maher (:jmaher) from comment #10)
> > this is on aurora now!  :mconley, do you have any further information on
> > this issue?
> 
> Hey Joel,
> 
> Yeah, we're fine with this e10s-only regression (see comment 6). Basically,
> we're still better than non-e10s here, and we know where the slowdown is
> coming from. I'd love to optimize this in the future by making it so that
> background unrestored tabs are remote again (but un-crashable, or at least
> resilient to crashing), but that's not currently a high priority.
> 
> Does that clear it up?

Mike, it looks like Bill has been working on a bunch of potential causes for crashes and hangs with his Fuzzer in bug 1240985

working with him might help get restored unloaded tabs back into the remote processor.
Flags: needinfo?(mconley)
(In reply to Danial Horton from comment #16)
> (In reply to Mike Conley (:mconley) - Needinfo me! from comment #11)
> > (In reply to Joel Maher (:jmaher) from comment #10)
> > > this is on aurora now!  :mconley, do you have any further information on
> > > this issue?
> > 
> > Hey Joel,
> > 
> > Yeah, we're fine with this e10s-only regression (see comment 6). Basically,
> > we're still better than non-e10s here, and we know where the slowdown is
> > coming from. I'd love to optimize this in the future by making it so that
> > background unrestored tabs are remote again (but un-crashable, or at least
> > resilient to crashing), but that's not currently a high priority.
> > 
> > Does that clear it up?
> 
> Mike, it looks like Bill has been working on a bunch of potential causes for
> crashes and hangs with his Fuzzer in bug 1240985
> 
> working with him might help get restored unloaded tabs back into the remote
> processor.

Yes, billm's work probably will reduce the number of tab crashes that users are likely to see. We still need to account for the background unrestored tab crash case, which is why made those tabs run in the parent process.

I actually wonder if bug 906076 might be the best way forward there instead, so that we don't bog down _either_ process while the tabs sit in the background.
Flags: needinfo?(mconley)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #17)
> I actually wonder if bug 906076 might be the best way forward there instead,
> so that we don't bog down _either_ process while the tabs sit in the
> background.

If I understand that bug correctly then it attempts to lazify the instantiation of various tab-related things. But lazification does little if addons try to access those properties that trigger instantiation.

In those cases it may still be useful to have them in the child process.
This is true. I wonder if it might be smarter to have them run in some kind of dummy process though so that a large number of unrestored background tabs won't bog down (or get crashed by) restored tabs. Then, when restored, we put them into an appropriate process.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
yeah, no, sorry Jim, I'm not letting you get away with this regression.

This should stay open until the lazy session restoring is implemented.

Joel said this can stay open till the talos regression is resolved one way or another.
I thought we agreed to take this one due to the new behavior. It's not blocking e10s no issue with it remaining open if Mike see's a need.
Flags: needinfo?(mconley)
(In reply to Danial Horton from comment #20)
> yeah, no, sorry Jim, I'm not letting you get away with this regression.

To be clear, this is not a regression from non-e10s. In fact, e10s performs equally if not slightly better than non-e10s on this benchmark.

What I've done is filed bug 1256472 to track the work to attempt to regain the session restore wins that we (unfortunately) had to give up.
Flags: needinfo?(mconley)
You need to log in before you can comment on or make changes to this bug.