Closed Bug 1337043 Opened 7 years ago Closed 7 years ago

4.6% tart (windows7-32) regression on push d2a01f76d88f836438078efabf9ca55c048b073e (Sun Feb 5 2017)

Categories

(Core Graveyard :: Plug-ins, defect, P2)

53 Branch
defect

Tracking

(firefox-esr52 wontfix, firefox53 wontfix, firefox54 wontfix, firefox55 fixed)

RESOLVED FIXED
mozilla55
Tracking Status
firefox-esr52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- fixed

People

(Reporter: jmaher, Assigned: Felipe)

References

(Depends on 1 open bug)

Details

(Keywords: perf, regression, talos-regression)

Attachments

(2 files, 1 obsolete file)

Talos has detected a Firefox performance regression from push d2a01f76d88f836438078efabf9ca55c048b073e. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

  5%  tart summary windows7-32 opt     7.4 -> 7.74


You can find links to graphs and comparison views for each of the above tests at: https://treeherder.mozilla.org/perf.html#/alerts?id=5026

On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests

For information on reproducing and debugging the regression, either on try or locally, see: https://wiki.mozilla.org/Buildbot/Talos/Running

*** Please let us know your plans within 3 business days, or the offending patch(es) will be backed out! ***

Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:bytesized, I see you authored this patch- it seems to have caused a regression in our tab animation test where we test opening and closing new tabs.  Would you be able to look at this and figure out why this is happening with your change and if there is anything we can do to reduce/fix the regression?
Component: Untriaged → Plug-ins
Flags: needinfo?(ksteuber)
Product: Firefox → Core
This is the first time I have gotten a Talos regression, so excuse me if I am a bit confused. But how did you determine that this was due to my patch? I clicked the "treeherder alerts" link you gave, then looked at the pushlog [1], which doesn't seem to contain my patch. Am I missing something?

[1] https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=70253d6678af8e2f2b4386c77e825af9f600fe46&tochange=d2a01f76d88f836438078efabf9ca55c048b073e
Flags: needinfo?(ksteuber) → needinfo?(jmaher)
great question, and thanks for asking.  When we alert on performance issues, the algorithm is off on the exact revision due about 40% of the time due to scheduling, noise in the test, build issues, tree closures for failed builds/tests, or things like merges).  

In this case, I looked at the graph:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,0a88b4068b02d57cd2d4d7032bd2da45b7646b86,1,1%5D&selected=%5Bautoland,0a88b4068b02d57cd2d4d7032bd2da45b7646b86,NaN,NaN,1%5D

and the retriggered forward and backward a few revisions, while filling in the holes left by scheduling algorithms.

You have to zoom in on the graph and I see:
https://treeherder.mozilla.org/perf.html#/graphs?series=%5Bautoland,0a88b4068b02d57cd2d4d7032bd2da45b7646b86,1,1%5D&zoom=1486321499260.8838,1486332574452.155,7.048575021963581,8.051741582652998&selected=%5Bautoland,0a88b4068b02d57cd2d4d7032bd2da45b7646b86,167155,74792648,1%5D

after doing that you can see where we consistently have increased the higher end of the range of reported values.  I do make mistakes sometimes, usually when in doubt, push to try with a baseline and a backout and see what happens, here is some try syntax to help out:
./mach try -b o -p win32 -u none -t svgr --rebuild-talos 5
Flags: needinfo?(jmaher)
So I reran the tests:

baseline: https://treeherder.mozilla.org/#/jobs?repo=try&revision=66d08b89402eae5df93ad7f8f7dd3f0da571d994&selectedJob=74905061
backout: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e964ed504c137fa503443ad689622588de603607&selectedJob=74905828
comparison: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=66d08b89402e&newProject=try&newRevision=e964ed504c137fa503443ad689622588de603607&framework=1&showOnlyImportant=0

To me it looks like there is no significant regression, but I am not sure because I don't really have experience in this area. This is consistent with my expectations: the code in that patch runs on startup or on certain pref changes and therefore should not impact tab opening/closing times.

Is my analysis correct?
Flags: needinfo?(jmaher)
this is a noisy test, and a couple random data points in the wrong direction make this look bad.  I had retriggered a handful of times and I see a 2.5% improvement (more like 4% ignoring the outliers).  this has a lower confidence because of the noisy results on the base revision.

I agree this is confusing, I would rather look to see why your changes might be causing this, :mconley helped write the tart test originally, he might have more insight into what tart does.
Flags: needinfo?(jmaher)
Flags: needinfo?(mconley)
(In reply to Joel Maher ( :jmaher) from comment #5)
> I agree this is confusing, I would rather look to see why your changes might
> be causing this, :mconley helped write the tart test originally, he might
> have more insight into what tart does.

Let's make sure to give credit where it's due - I didn't write the tart test, but I've patched it at times, and have definitely wrestled with it. :)

In general, the test works by starting up the browser, waiting for a few seconds, and then opening and closing tabs while measuring how long it takes to paint each frame of the tab opening / closing transition under various circumstances. It does a few other things, but that's the big picture.

My guess is that the (per-window![1]) SafeBrowsing initializing is kicking in during the animation while the browser is idle, and you've added some extra things to write to the database.

You can try to prove this by removing the initialization of SafeBrowsing on a try push at [1] to see if the regression goes away. I can't see how else you could have affected TART.

Once you've proven that it's the writing-to-the-database that's hurting us, we have to decide how to proceed. There are a few things we can do:

1) Modify talos so that when it's setting up the profile for testing, it waits for the mozentries-update-finished observer notification before kicking off the test.
2) Find a way of disabling SafeBrowsing initialization for Talos

I'd veer towards (1), personally, since I'm preeeeetty sure that's what why we do that setting-up-of-the-profile stuff first - to get this kind of thing out of the way.

[1]: http://searchfox.org/mozilla-central/rev/f5077ad52f8b90183e73038869f6140f0afbf427/browser/base/content/browser.js#1261-1262 why does this need to happen after every new window has opened? If this initialization only needs to happen once, it should be moved to nsBrowserGlue.js.
Flags: needinfo?(mconley)
I was not aware that the SafeBrowsing module was initialized on a per window basis.

@francois Is that really necessary?
Flags: needinfo?(francois)
(My guess is that it's accidental)
As this regression stems from a change that is not directly related but manifests here, we could pause our work and fix this test now or we could come back to this later. It's your call as to whether this is severe enough to warrant us stopping to fix it now.
Flags: needinfo?(jmaher)
the next merge day is March 6th, do you think we could have this fixed by then so we don't have to track this on aurora?
Flags: needinfo?(jmaher)
I can't guarantee that, given that this is something new to us. I'm also unclear why bug 779008 was filed and then set to P5 if the result is that collateral performance issues can crop up.

Is it confirmed that there is no clear owner of safe mode (as was mentioned in bug 1326245)? Because that would be the best assignee for this fix, I think.
Flags: needinfo?(jmaher)
I tried waiting in the profile warmup run for mozentries-update-finished, but locally it never fired:
https://hg.mozilla.org/try/rev/f36d068fb585e6052b63557c8a1bb03256498b7c

still a 5 second delay should help, but on try it doesn't appear to help:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=1a33a2290f06&newProject=try&newRevision=4a2ad273d3010a3bf7f3da6f8fb6f6f054a67256&framework=1&filter=tart&showOnlyImportant=0

I am happy to fix talos, but maybe our understanding of this regression is wrong?
Flags: needinfo?(jmaher)
I am unclear on the status here, I tried the mozentries-update-finished and never received an event, either I did it wrong, or that isn't the problem here.  Can we dig in and find a root cause?
Flags: needinfo?(ddurst)
Has anybody here done the step to confirm or refute my assumption in comment 6, that removing the initialization of the SafeBrowsing service eliminates this regression?
We haven't tested that theory yet (does make sense, tho). We've got to tackle the uplift first; we'll pick this up after that (hopefully just later this week).

I would definitely prefer confirmation of that before adjusting anything else.
Flags: needinfo?(ddurst)
(In reply to Kirk Steuber [:bytesized] from comment #7)
> I was not aware that the SafeBrowsing module was initialized on a per window
> basis.
>
> @francois Is that really necessary?

I wasn't aware either. I'm pretty sure it only needs to be initialized once at startup (i.e. in the first window).
Flags: needinfo?(francois)
Very mysterious. In cases like this, our best bet might be to do some try pushes that remove individual portions of the original patch to see which if any of them are involved in causing this regression.
@mconley I'm not sure how to do that. The original patch only touches a single function.
I assume you're referring to this patch: https://reviewboard.mozilla.org/r/107650/diff/5/#index_header

I would break this down into three individual try pushes:

1) Everything except the change to the tables
2) Everything except the change to the update string

I would also be interested in knowing the total running time difference for addMozEntries with and without your patch.
Patch removed: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c9815ee1a0b0183329694257f44cfdfe5f710116
Patch without update string: https://treeherder.mozilla.org/#/jobs?repo=try&revision=3192ca0bb7a0de172375a84ef491ed0c82f231a4
Patch without table names: https://treeherder.mozilla.org/#/jobs?repo=try&revision=9397a118a19120f0033cb569e9fb7bcb306113a3
Full patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5c9e493bbbec9adc61a443a647f54f65a7bc6ec7&selectedJob=79205910

No patch -> Patch without update string: 0.57% increase
No patch -> Patch without table names: 4.93% increase
No patch -> Full patch: 5.18% increase

I am not sure that the patch without table names really changes anything (compared to the full patch). I think that the same tables get updated anyways.
These are the aggregated timestamps taken using |Date.now()| when |addMozEntries()| starts and again when the database has been updated (not when the function ends). This is my quick analysis of the data:

With Patch
    Average: 995.958333333ms
    Min: 750ms
    Max: 2240ms

Without Patch
    Average: 585.697916667ms
    Min: 414ms
    Max: 1311ms

With Patch -Tables
    Average: 978.59375ms
    Min: 749ms
    Max: 2547ms

With Patch -Update String
    Average: 560.177083333ms
    Min: 383ms
    Max: 1203ms
I forgot to mention that the averages I obtained are means. This is the analysis with medians:

With Patch
    Mean: 995.958333333ms
    Median: 965.5ms
    Min: 750ms
    Max: 2240ms

Without Patch
    Mean: 585.697916667ms
    Median: 559.5ms
    Min: 414ms
    Max: 1311ms

With Patch -Tables
    Mean: 978.59375ms
    Median: 911.0ms
    Min: 749ms
    Max: 2547ms

With Patch -Update String
    Mean: 560.177083333ms
    Median: 534.5ms
    Min: 383ms
    Max: 1203ms
So if I understood the results correctly, the problem is the updateStream call with the longer `updates` string.

The docs for that says that it allows incremental updates. I wonder if breaking up this updateStream into several calls would help. Or if we can delay some of these further out.
following up here, it has been a couple of weeks, are there more updated?
Flags: needinfo?(ksteuber)
:mconley and I took a few tries at this and were unable to determine how this patch is affecting the test. Our final attempt placed markers at the beginning of |addMozEntries| and at the end of the database |updateSuccess| callback, and determined that all work for this function is done before the test actually starts.

I don't have a lot of time to dedicate to this, but to be honest, I have no idea how to look into it any further anyways. Do you have any suggestions?
Flags: needinfo?(ksteuber)
jmaher -- as Kirk noted, all of our tests yield no causality from the patch. Felipe's idea in #c34 is worth looking into, because it's the only idea we have left.

I'll assign, as I don't want to back out anything that could jeopardize this study getting out.
Assignee: nobody → felipc
Priority: -- → P2
Getting some updated numbers to start out fresh. Baseline is based on m-c 800ba54a4bd5


Current m-c baseline:
https://hg.mozilla.org/try/rev/de0db1c63bc4545f6bf561c865fcee99913135c8

Current backout baseline:
https://hg.mozilla.org/try/rev/cd880b4bed68e2067698760caf129ab437beed55

Split updateStream into two:
https://hg.mozilla.org/try/rev/521ab45039c24829da5d3c71f7fb83c17ddf37c5
Let's see what happens if we skip classifying about: URLs

https://hg.mozilla.org/try/rev/9e39f4ef9249ba0dd655101b0a089407f3573f70
It seems that that has gotten rid of roughly half of the regression, if I'm reading things correctly. Tomorrow I'll put up a better land-able patch to see the effect this has on mozilla-central.

See: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=cd880b4bed68e2067698760caf129ab437beed55&newProject=try&newRevision=9e39f4ef9249ba0dd655101b0a089407f3573f70&framework=1&showOnlyImportant=0

2.3% regression here. Note that the regression is non-e10s only.

If you look at the sub-tests from some of the attempts here [e.g. 2], there are some red-herrings about hi-dpi stuff (which are highlighted because of having higher confidence), but in absolute numbers what is really making a difference are the following two subtests: newtab-open-preload-no.half and newtab-open-preload-yes.half.

[2] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=cd880b4bed68e2067698760caf129ab437beed55&newProject=try&newRevision=9e39f4ef9249ba0dd655101b0a089407f3573f70&originalSignature=0a88b4068b02d57cd2d4d7032bd2da45b7646b86&newSignature=0a88b4068b02d57cd2d4d7032bd2da45b7646b86&framework=1

In other words, opening a lot of tabs. So it seems that the URL Classifier code pays a linear penalty on the number of existing tables, and since we're adding a bunch at once, it shows up. This was probably creeping up slowly as tables were added one by one for other features. (I really hope it doesn't pay such a tough penalty on number of entries per table..)
This should probably be a follow-up perf bug/improvement to investigate outside of this bug.

nsChannelClassifier already has some code to avoid checking various types of URLs, but I believe "about:" pages are not being ignored right now. I'll run with some logging tomorrow to confirm. If that's the case, a patch to ignore about: should have a nice net effect across the browser. *crosses fingers*
There's a bunch of criteria already to skip URLs classification below, but it turns out that about: URLs don't match any of those. I tried finding a flag there to use but couldn't, so might as well use the scheme.
Attachment #8845931 - Flags: review?(tnguyen)
Attachment #8845931 - Attachment is obsolete: true
Attachment #8845931 - Flags: review?(tnguyen)
Attachment #8845933 - Flags: review?(tnguyen)
Comment on attachment 8845933 [details] [diff] [review]
Don't bother classifying about: URLs

lgtm, but need redirect to Francois :)
Attachment #8845933 - Flags: review?(tnguyen)
Attachment #8845933 - Flags: review?(francois)
Attachment #8845933 - Flags: review+
Comment on attachment 8845933 [details] [diff] [review]
Don't bother classifying about: URLs

Review of attachment 8845933 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, but you should change the commit message to add r=francois to show it has been reviewed by a module peer.

::: netwerk/base/nsChannelClassifier.cpp
@@ +342,5 @@
>      // Don't bother checking certain types of URIs.
> +    bool isAbout = false;
> +    rv = uri->SchemeIs("about", &isAbout);
> +    NS_ENSURE_SUCCESS(rv, rv);
> +    if (isAbout) return NS_ERROR_UNEXPECTED;

That's a **** error code, but that's apparently what we use everywhere in this function already so let's keep it that way.
Attachment #8845933 - Flags: review?(francois) → review+
(In reply to François Marier [:francois] from comment #45)
> Comment on attachment 8845933 [details] [diff] [review]
> Don't bother classifying about: URLs
> 
> Review of attachment 8845933 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good, but you should change the commit message to add r=francois to
> show it has been reviewed by a module peer.

Will do.

> 
> ::: netwerk/base/nsChannelClassifier.cpp
> @@ +342,5 @@
> >      // Don't bother checking certain types of URIs.
> > +    bool isAbout = false;
> > +    rv = uri->SchemeIs("about", &isAbout);
> > +    NS_ENSURE_SUCCESS(rv, rv);
> > +    if (isAbout) return NS_ERROR_UNEXPECTED;
> 
> That's a crappy error code, but that's apparently what we use everywhere in
> this function already so let's keep it that way.

Agreed wholeheartedly :)

Also, NS_URIChainHasFlags looks a bit inefficient to be called many times over to check flag by flag. It looks like it would be a low-hanging fruit to improve this a bit by getting the full flags of the uri and doing the mask manually, or creating a NS_URIChainHasAnyOfTheseFlags. I might file a follow-up on that..
Pushed by felipc@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/985d9b83325f
Don't bother classifying about: URLs. r=tnguyen r=francois
Depends on: 1347043
See bug Bug 1347043 "Bug 1337043 cset 985d9b83325f causing incomplete startup if an about: page tab is to be restored, or an about: page is in session history"
backed out for causing bug 1347043 - Felipe, can you take a look, thanks
Flags: needinfo?(felipc)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/62dd2dbd728b
Backed out changeset 985d9b83325f for causing bug 1337043
(In reply to Pulsebot from comment #50)
> Backout by cbook@mozilla.com:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/62dd2dbd728b
> Backed out changeset 985d9b83325f for causing bug 1337043

Confirming that with builds following this backout, the problems noted in bug 1347043 no longer exist.
Hey aja, I pushed this again to tryserver to get a newer build some days ago, and I tested it to try to reproduce the problem, but I couldn't. I tested on Win10 with both 32 and 64 bits builds..
Could you try it again with these newer builds to see if this is still a problem?

https://archive.mozilla.org/pub/firefox/try-builds/felipc@gmail.com-95c53982eb07c8ab2985786870c66e9ba687c34a/

https://treeherder.mozilla.org/#/jobs?repo=try&revision=95c53982eb07c8ab2985786870c66e9ba687c34a
Flags: needinfo?(felipc) → needinfo?(ajarope)
An update here: Aja spent some time trying to figure out what causes the problem, and turns it that it's only reproduce on
- older profiles
- with some combinations of addons  (looks like BetterPrivacy or HTTP/2 and SPDY Indicator alone, with an old profile, worked on reproducing the bug)

I'll try doing the same with a profile of mine and see what happens
Aja: I tried on *Mac* with my long-time profile and "BetterPrivacy" + "Http/2 and SPDY Indicator", but still couldn't reproduce it. I'm thinking it's Windows specific.

If you're feeling adventurous, one suggestion is bisecting your prefs this way:

- go to your profile and edit the prefs.js file

- remove half of the prefs there (e.g., the top half), and run and see if the problem reproduces. If yes, keep doing this to try to find the pref that causes this.. If not, re-add what you removed and remove the bottom half..
Hey Aja, any success here?
So I sent an e-mail around asking more people to reproduce this problem, and no one could. I'm not sure I can hold this patch indefinitely, so I'm thinking on landing it on Nightly (since it's far away from a merge date) and see what happens. Hopefully it was a very strange corner case
Felipe: Sorry for delay...been away from home and computer for quite some time.  Life happens.

I think it's probably a good idea to just try landing again, as I think non-MPC non-webextensions can no longer be installed anyway (I think that's true from what I've been reading over last day or so...or is just about to happen).
In any event these are LEGACY add-ons which I have a feeling won't be revived.
Flags: needinfo?(ajarope)
Pushed by felipc@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bb439487185f
Don't bother classifying about: URLs. r=tnguyen,francois
https://hg.mozilla.org/mozilla-central/rev/bb439487185f
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Looks like this could use Beta and ESR52 approval requests once we have confirmation that the regression is indeed fixed.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #60)
> Looks like this could use Beta and ESR52 approval requests once we have
> confirmation that the regression is indeed fixed.

Given the potential problem reported on comment 48, I'd prefer to not skip trains on this one.
I see a startup performance improvement from this patch:
== Change summary for alert #6813 (as of May 23 2017 20:01 UTC) ==

Improvements:

  3%  ts_paint windows7-32 opt e10s     1,045.67 -> 1,013.14
  2%  ts_paint windows8-64 opt e10s     866.00 -> 847.77
  2%  ts_paint windows7-32 opt e10s     1,029.33 -> 1,009.25

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=6813
(In reply to Joel Maher ( :jmaher) from comment #62)
> I see a startup performance improvement from this patch:
> == Change summary for alert #6813 (as of May 23 2017 20:01 UTC) ==
> 
> Improvements:
> 
>   3%  ts_paint windows7-32 opt e10s     1,045.67 -> 1,013.14
>   2%  ts_paint windows8-64 opt e10s     866.00 -> 847.77
>   2%  ts_paint windows7-32 opt e10s     1,029.33 -> 1,009.25
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=6813

This is likely due to no longer initializing NSS before the first paint. In my cold startup profiles on the quantum reference device, this was one of the 3 things triggering NSS initialization before first paint. The other two are:
- captive portal (bug 1367450)... this is likely disabled for Talos as it does a network request.
- sometimes search service initialization, triggered by Telemetry during startup (we fixed this in bug 1359031)
(In reply to Joel Maher ( :jmaher) from comment #62)
> I see a startup performance improvement from this patch:
> == Change summary for alert #6813 (as of May 23 2017 20:01 UTC) ==
> 
> Improvements:
> 
>   3%  ts_paint windows7-32 opt e10s     1,045.67 -> 1,013.14
>   2%  ts_paint windows8-64 opt e10s     866.00 -> 847.77
>   2%  ts_paint windows7-32 opt e10s     1,029.33 -> 1,009.25
> 
> For up to date results, see:
> https://treeherder.mozilla.org/perf.html#/alerts?id=6813


\o/
Depends on: 1380448
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: