Closed Bug 1464058 Opened 6 years ago Closed 6 years ago

51.6 - 159.95% sessionrestore_many_windows (windows10-64, windows10-64-qr) regression on push 4269e02632264e158ac9d7cfc12cb33ab0dc8bf5 (Thu May 24 2018)

Categories

(Firefox :: Session Restore, defect)

59 Branch
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Firefox 62
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- unaffected
firefox60 --- unaffected
firefox61 --- unaffected
firefox62 + wontfix
firefox63 + wontfix

People

(Reporter: igoldan, Assigned: jkt)

References

(Depends on 1 open bug)

Details

(Keywords: perf, regression, talos-regression)

Attachments

(2 files)

Talos has detected a Firefox performance regression from push:

https://hg.mozilla.org/integration/autoland/pushloghtml?changeset=4269e02632264e158ac9d7cfc12cb33ab0dc8bf5

As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

160%  sessionrestore_many_windows windows10-64 pgo e10s stylo     834.42 -> 2,169.08
 52%  sessionrestore_many_windows windows10-64-qr opt e10s stylo  2,609.17 -> 3,955.42


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=13405

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
Component: General → Session Restore
Product: Testing → Firefox
Flags: needinfo?(jkt)
Hey so I haven't run this locally yet as I'm doing another build.

The code itself changed only a few lines:
1. entry.principalToInherit_base64 will potentially now be null instead of undefined
2. entry.triggeringPrincipal_base64 will potentially now be null instead of undefined
3. We check for no triggering principal and generate a debug log and null principal

1. and 2. I can't see there is a perf hit here that is significant, especially was we removed some other statements in exchange for this.
3. This makes more sense as the test doesn't have principals in the file: testing/talos/talos/startup_test/sessionrestore/profile-manywindows/sessionstore.js

This should actually be an edge case so I personally think it's reasonable we increased startup time here (I suspect creating more null principals is expensive). Are we sure that the previous code was even loading these tabs at all given the triggering principal check that was added in which prevented them from loading without a principal.

I personally think we should be serializing a principal into this test and reset the counters to get a more realistic profile.

> Please let us know your plans within 3 business days, or the offending patch(es) will be backed out!

I'm out of the office until Tuesday and I am not sure if I will be able to work on it before then.

Thanks
Jonathan
Flags: needinfo?(jkt)
(In reply to Jonathan Kingston [:jkt] from comment #1)
> Are we sure that the previous code was even loading these tabs
> at all given the triggering principal check that was added in which
> prevented them from loading without a principal.

If this is a question related to the test, I'm not able to answer that. :mdeboer, can you give more details for Jonathan?
Flags: needinfo?(mdeboer)
The test itself certainly doesn't have triggering principals, another patch I am working on made every tab crash without this change (as I'm asserting it's always set on tab creation also).

We should treat a missing principal as an edge case, I'm fine for it to be in the perf testing however this doesn't seem to be in the spirit of what we are checking here.

I'm still rebuilding a clean build that can run talos, however I'm 99% certain those tabs will be blank. This accounts for why the perf increase over the creation of null principals. This means we haven't been checking session restore perf at all.
Good point to update the profile that this test uses - it basically needs a fresh sessionstore.jsonlz4.

Now we _do_ need to think about what has happened when users restored their session after the Firefox update that required the principal data to be present. I mean, following the same logic this would mean that all these users would all have ended up with blank tabs, unresponsive tabs.
Also, wouldn't it make more sense that when the user hits the refresh button or Enter in the address bar, that the document load would be initiated with the current principal, including updating the navigation history entries in that case? Because right now the 'blank tabs' repeat the same thing with invalid principal data over and over again...
Flags: needinfo?(mdeboer)
> Now we _do_ need to think about what has happened when users restored their session after the Firefox update
> that required the principal data to be present.

This was multiple releases after we started writing out the principal data to the session store file, if I recall events correctly.

> that the document load would be initiated with the current principal

What current principal?

> I suspect creating more null principals is expensive

Creating nullprincipals is somewhat expensive (have to create a UUID for each one).  As you note, actually loading stuff is even more expensive.  ;)
(In reply to Boris Zbarsky [:bz] (no decent commit message means r-) from comment #6)
> This was multiple releases after we started writing out the principal data
> to the session store file, if I recall events correctly.

Ah true, I remember this too now. Thanks :)

> What current principal?

Déjà vu! We discussed this already way back when with :ckerschb, I think. Nothing to do here.
> Good point to update the profile that this test uses - it basically needs a fresh sessionstore.jsonlz4.

I'm not sure how these files got generated. Even loading the JS files doesn't seem to generate a lz4 file with the principals. Any tips?
Flags: needinfo?(mdeboer)
Actually I managed to get the lz4, any advice on how to get the json back out? (I assume we want this checked in also)
Attached file mozlz4
Cool that you could make it work! I was about to suggest to select and open each restored tab, so that they are fully loaded to get proper sessionstore data flushed to disk.

You can un-lz4 the file using the `mozlz4` script I posted above - which I got off a StackOverflow answer:

`mozlz4 -d < sessionstore.jsonlz4 > sessionstore.js`
Flags: needinfo?(mdeboer)
Tracking for 62 to make sure we follow up here.
:jkt What's the status here? Have you opened separate bugs for this? If so, please link them to this one.
Flags: needinfo?(jkt)
There is a test fail on try but it seems unrelated: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9880bd5aab92f829a8f59dcef30049950a42b7a1
Flags: needinfo?(jkt)
(In reply to Jonathan Kingston [:jkt] from comment #14)
> There is a test fail on try but it seems unrelated:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=9880bd5aab92f829a8f59dcef30049950a42b7a1

If you're referring to the T-e10s(x) failures from Windows 10, those are expected. Currently, their not functional (see bug 1451682 and bug 1439588 for reasons).
Hey :mythmon,

Did you manage to look at the patch at all? I think it would be great to get this land in this release.

Thanks
Jonathan
Flags: needinfo?(mcooper)
I don't think I'm the right person to be reviewing this patch. I think you meant mdeboer. I've ni?ed him instead.
Flags: needinfo?(mcooper) → needinfo?(mdeboer)
If this doesn't land before next Monday, please request uplift to beta 62 next week after the merge.
Assignee: nobody → jkt
Flags: needinfo?(jkt)
Comment on attachment 8984389 [details]
Bug 1464058 - Rebuild profile for session restore talos test to add principals to the storage.

https://reviewboard.mozilla.org/r/250214/#review257600

rs=me. Thanks Jonathan and my apologies for reviewing this after such a long delay...
Attachment #8984389 - Flags: review?(mdeboer) → review+
Jonathan, did you see the perma-orange of the xperf talos suite in your latest try run? Did you have time to check that out?
Flags: needinfo?(mdeboer)
Per comment 15, those are safe to ignore.
Pushed by mdeboer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe3e2484074b
Rebuild profile for session restore talos test to add principals to the storage. r=mikedeboer
https://hg.mozilla.org/mozilla-central/rev/fe3e2484074b
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 62
Thanks all!
Flags: needinfo?(jkt)
Ionut, can you please verify that the regression is fixed?
Flags: needinfo?(igoldan)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #25)
> Ionut, can you please verify that the regression is fixed?

I'm afraid very little has changed, since the push from comment 23. There are some *very* small improvements, on some of the builds. Still, even for these, I cannot confirm they're real. I'm reopening the bug.
Flags: needinfo?(igoldan)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hey :mikedeboer,

I'm really struggling to work out what this test is supposed to be doing. It currently doesn't use the mitm proxy so appears to fail to load the tab with "unable to connect" error pages, this likely is less deterministic than we might like.

The urls appear to be like "http://localhost:8080/?httpsurl=www.facebook.com/" Is this something the proxy should be handling?

When using the proxy I get the following errors:
400 Bad Request

HttpException('Invalid HTTP request form (expected: authority or absolute, got: relative)',)

It looks like we need to specify recordings / web pages that are in the proxy?

Thanks
Flags: needinfo?(mdeboer)
Emailing Mike with the question from comment 27.
(In reply to Jonathan Kingston [:jkt] from comment #27)
> Hey :mikedeboer,
> 
> I'm really struggling to work out what this test is supposed to be doing. It
> currently doesn't use the mitm proxy so appears to fail to load the tab with
> "unable to connect" error pages, this likely is less deterministic than we
> might like.
> 
> The urls appear to be like
> "http://localhost:8080/?httpsurl=www.facebook.com/" Is this something the
> proxy should be handling?
> 
> When using the proxy I get the following errors:
> 400 Bad Request
> 
> HttpException('Invalid HTTP request form (expected: authority or absolute,
> got: relative)',)
> 
> It looks like we need to specify recordings / web pages that are in the
> proxy?
> 
> Thanks

Sorry for the delay, Jonathan, I've been a bit caught up in, well, many things! :)

You're right. This test is essentially useless if we don't have proper proxy responses. However, relying on proxy responses for consistent perf measurement is problematic as well, but it's the best we have.

Thus, what needs to be done is to fix up the test to get proper responses and '200 OK' page loads. Until then it's safe to ignore this regression, because it doesn't signify trouble for product, but merely trouble for its own trustworthyness.
I'm not saying that this is part of your responsibilities at all, that'd be odd. Instead, it's mine as the owner of this test/ module.
Flags: needinfo?(mdeboer)
we should consider:
1) fixing the test asap (in the next month)
2) reducing the test to tier-2 (m-c only, no sheriffing)
3) turning off the test
4) other options?

not sure what would be best, :mikedeboer, do you have a preference?
Flags: needinfo?(mdeboer)
Let's try 1. I'll reach out to you when I've got time to work on it, to learn how to best fix the test.
Flags: needinfo?(mdeboer)
Mike, have you had a chance to look into fixing this test? Thanks!
Flags: needinfo?(mdeboer)
Liz, thanks for the ping. I haven't had a chance to look at this yet, which is caused by other things on my schedule taking priority over fixing this.
So I do want to fix this, but I can't give a guaranteed timeline for it, so we might want to look into disabling it in the interim.
Flags: needinfo?(mdeboer)
Joel, can you disable the test on beta for now, as discussed in comment 30 and 33? And maybe we can fix it in 63 (I'm leaving the bug tracked for 63.) Thanks!
Flags: needinfo?(jmaher)
:igoldan- can you pick this up sometime in the near future:
1) disable sessionrestore_many_windows on mozilla-beta (tweak a config and land it on mozilla-beta)
2) make sessionrestore_many_windows run on mozilla-central/try only for now; I assume this means breaking it out of the existing suite and creating a new job (yay taskcluster) that only run on mc/try.
Flags: needinfo?(jmaher) → needinfo?(igoldan)
Depends on: 1478295
Depends on: 1478297
(In reply to Joel Maher ( :jmaher ) (UTC+2) (PTO: back Aug 2) from comment #35)
> :igoldan- can you pick this up sometime in the near future:
> 1) disable sessionrestore_many_windows on mozilla-beta (tweak a config and
> land it on mozilla-beta)
> 2) make sessionrestore_many_windows run on mozilla-central/try only for now;
> I assume this means breaking it out of the existing suite and creating a new
> job (yay taskcluster) that only run on mc/try.

Yes, I've prepared bug 1478295 and bug 1478297 for that.
Flags: needinfo?(igoldan)
Ionut, can we close this now?
Flags: needinfo?(igoldan)
(In reply to Mike Taylor [:miketaylr] (62 Regression Engineering Owner) from comment #37)
> Ionut, can we close this now?

Yes, we can close it.
Flags: needinfo?(igoldan)
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: