14% regression in ts paint on linux64 seen on fx-team

VERIFIED FIXED in Firefox 31

Status

defect
VERIFIED FIXED
5 years ago
10 months ago

People

(Reporter: jmaher, Assigned: gfritzsche)

Tracking

({perf, regression})

Trunk
Firefox 31
x86_64
Linux
Dependency tree / graph
Bug Flags:
firefox-backlog +

Firefox Tracking Flags

(firefox30 unaffected, firefox31- verified)

Details

(Whiteboard: [talos_regression] p=5 s=it-31c-30a-29b.3 [qa-])

Attachments

(5 attachments)

here is a view of the graph:
http://graphs.mozilla.org/graph.html#tests=[[83,132,35]]&sel=none&displayrange=7&datatype=running

here is a view on tbpl, where I see did some retriggers:
https://tbpl.mozilla.org/?tree=Fx-Team&fromchange=10a99a32304b&tochange=d22244fa46b4&jobname=Ubuntu%20HW%2012.04%20x64%20fx-team%20talos%20other

Unfortunately there were a few pushes that didn't run tests due to a bad push prior.  So we are missing details on 5 changesets.

If anybody has ideas of why linux 64 would have a large hit on ts (launching the browser), I would appreciate learning more about it.
Bug 989761 is not the culprit - it only messed with Windows Classic CSS.
Looking at the graph, it looks like the regression was introduced somewhere on fx-team between 57a393d69f32 and ddc22f087bec.

http://hg.mozilla.org/integration/fx-team/pushloghtml?fromchange=57a393d69f32&tochange=ddc22f087bec
I'd also suspect that bug 989609 is not the culprit, since that patch was backed out.
(In reply to Joel Maher (:jmaher) from comment #0)
> here is a view of the graph:
> http://graphs.mozilla.org/graph.html#tests=[[83,132,
> 35]]&sel=none&displayrange=7&datatype=running
> 
> here is a view on tbpl, where I see did some retriggers:
> https://tbpl.mozilla.org/?tree=Fx-
> Team&fromchange=10a99a32304b&tochange=d22244fa46b4&jobname=Ubuntu%20HW%2012.
> 04%20x64%20fx-team%20talos%20other
> 
> Unfortunately there were a few pushes that didn't run tests due to a bad
> push prior.  So we are missing details on 5 changesets.
> 
> If anybody has ideas of why linux 64 would have a large hit on ts (launching
> the browser), I would appreciate learning more about it.

Nothing else really jumps out at me. :/

Bisecting might be our only choice here, unless somebody else has another suggestion.
No longer blocks: 989609, 989761
That leaves a few suspects, I will wait until tomorrow to see if the original patch authors have ideas of what might be the problem. 

Then a series of pushes to try while backing out sound like a good idea :)
Summary: 14% regression in ts paint on linux64 seen on fx-team → 14% regression in ts paint on linux 32/64 seen on fx-team
Does anyone know if there are configs for the talos run outside of mozilla-central which might affect whether telemetry is enabled during this test?

The two most-likely pushes in this range are:

Benjamin Smedberg — Bug 986582 - Get rid of the toolkit.telemetry.enabledPreRelease pref and make the toolkit.telemetry.enabled pref do the right thing for beta users who are testing a final release build, r=rnewman
Benjamin Smedberg — Bug 992208 - Add Telemetry Experiments to the package so that they are actually used. Also a basic test that the service exists and can be created. r=gfritzsche

Of those I'd be pretty surprised if bug 992208 were at fault here, and mis-configured telemetry seems more likely.

Although if telemetry being enabled affects startup perf by 15%, that sucks also.

Did this show up on any other perf suites/OSes?
Blocks: 989609, 989761
Flags: needinfo?(jmaher)
Summary: 14% regression in ts paint on linux 32/64 seen on fx-team → 14% regression in ts paint on linux64 seen on fx-team
I need to see if other tests are affected- There are some other failures, but nothing as large as this.  We have different preferences on talos than we do for desktop tests- although many are the same.

Here is the preferences we use:
http://hg.mozilla.org/build/talos/file/tip/talos/PerfConfigurator.py#l237

for ts paint, we don't use other preferences, so this should be it.
Flags: needinfo?(jmaher)
Whiteboard: [talos_regression] → [talos_regression][Australis:P-]
The changes for bug 989626 are only seen when in-content preferences are loaded, not to mention that in-content preferences are currently disabled by default.
No longer blocks: 989626
No longer blocks: 989609, 989761
the code modified in bug 984015 is only used by PlacesTransactions.jsm that is currently unused (disabled behind a pref), so it's not executed in this case.
No longer blocks: 984015
jmaher said on IRC he was going to investigate/trigger some reruns. Hand it off to me if necessary later.
Assignee: nobody → jmaher
I don't believe we have any other tests/platforms which are failing, time to do some try pushes
backout 986582: https://tbpl.mozilla.org/?tree=Try&rev=4d0eaa96acb6
backout 992208: https://tbpl.mozilla.org/?tree=Try&rev=7d7aa86bee45

will do more tonight if these don't help
bug 992208 seems to be the culprit, when it was backed out (https://tbpl.mozilla.org/?tree=Try&rev=7d7aa86bee45), our numbers for ts_paint are <900, and in line with what we were at before the regression.

I assume it has to do with:
https://hg.mozilla.org/try/diff/7d7aa86bee45/browser/installer/package-manifest.in

:bsmedberg, any thoughts on this?  I am sort of surprised- maybe I am reading the numbers wrong?
Flags: needinfo?(benjamin)
That's awesomely weird. Basically what's happening is we're running this code on profile-after-change now when we weren't before:

http://hg.mozilla.org/mozilla-central/annotate/25aeb2bc79f2/browser/experiments/ExperimentsService.js#l31
to 
http://hg.mozilla.org/mozilla-central/annotate/25aeb2bc79f2/browser/experiments/Experiments.jsm#l355

Of the set of code running here, the only think I can see which might be a startup issue is the call to _startWatchingAddons which calls into AddonManager.addAddonListener/addInstallListener.

The only thing that's a little weird is that we don't seem to be checking gExperimentsEnabled anywhere in the main sequence of _loadFromCache/_main/_evaluateExperiments.
Assignee: jmaher → georg.fritzsche
Component: Talos → Client: Desktop
Flags: needinfo?(benjamin) → needinfo?(gps)
Product: Testing → Firefox Health Report
is there anyway the profile we use for talos could be affecting this?  I am open to trying out anything if needed.
All of the imports (https://hg.mozilla.org/mozilla-central/annotate/25aeb2bc79f2/browser/experiments/Experiments.jsm#l14) will also contribute to latency. Things like Metrics.jsm (which should be lazy loaded since it is only used for FHR) pull in a lot of sub-modules and services.

Unless we really need Experiments to be initialized ASAP, we should consider delay loading it until after session restore and possibly a few seconds after that.

FWIW, I don't think Experiments.jsm should be adding significantly new memory, CPU, or I/O to Firefox initialization. It's a thousand or so lines of JS + a few kb of read I/O to pull in state. However, timing is everything. It is likely initializing things a few seconds earlier than before and that's causing the regression. I'd be very interested in comparisons of usage 60s after startup to confirm or disprove my hypothesis: that will tell us if there is a deeper performance problem to investigate.
Flags: needinfo?(gps)
If there is an active experiment we certainly don't want to wait: the experiment should be activated as soon as we know that it's still applicable.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #17)
> If there is an active experiment we certainly don't want to wait: the
> experiment should be activated as soon as we know that it's still applicable.

That may require tighter Add-on Manager integration. This is a result of having experiment add-ons disabled by default. In our current world, we require the Experiments service to be initialized to activate experiments at start-up time. That has performance implications. Switching back to having the Experiments service disable an experiment if needed sometime after initialization would be better for perf. We could have the Add-on Manager refuse to activate an experiment add-on unless the Experiments XPCOM component is present to plug that hole.

Can we just delay experiment init today and fix the activation window as a followup?
Comment on attachment 8403559 [details] [diff] [review]
Attempt lazy loading, rev. 1

Review of attachment 8403559 [details] [diff] [review]:
-----------------------------------------------------------------

It will be interesting to see how much this moves the needle.
Attachment #8403559 - Flags: review?(gps) → review+
I'm currently checking through the code path that we hit early to properly honor the pref.
I assume async tasks triggered from "profile-after-change" don't contribute to this?
I have no idea about the async tasks and profile-after-change events.  Talos does a warmup run to initialize the new profile.  This is a simple launch of the browser to get some meta data and let the profile initialization take place outside of the performance numbers.
I dealt with Talos foo with FHR. See bug 828153 (which appears to be missing some of the context I have in my head - there must be another bug).

Anything that Experiments.jsm does (including async tasks) after profile-after-change will affect first paint. We've now introduced new code that runs very early after application startup and anything it does will delay first paint, session restore, and make early browser performance worse. Async or not, it is code that is executing and thus contending for system resources.

It is in the best interest of any new feature to delay *all* code execution until as far down the line as possible. If this is not possible, my recollection is you need sign-off from someone on the perf team (Vladan?) and that sign-off will only likely occur once there is satisfaction the performance impact is as minimal as possible. This means minimizing JSMs, compartments, I/O, etc.
Alright, that makes sense. I'll do try pushes tomorrow with some changes.
I'm not sure what effectively makes the most impact here, but it seems like we should definitely get lower impact for when experiments are preffed off (which should be what we worry about for release).
thanks gps and gfritzsche.
So, adding the XPCOM component manifest to the package manifest regressed performance. Since the XPCOM component (Experiments) is currently prefed off, this is a sign that the Experiments XPCOM component is implemented sub-optimally.

Recommendations:

1) We should not be loading Experiments.jsm if the preference is not enabled.
2) We should be calling Experiments.instance() if the preference is not enabled.
3) We should optionally move the preference observers from Experiments.jsm into ExperimentsService.js.

Tree rules dictate that 1bd9fa893bf8 should be backed out for regressing performance until a fix is in place. I recommend the patch be written quickly.
No longer blocks: 991034
Whiteboard: [talos_regression][Australis:P-] → [talos_regression]
Blocks: 994727
Flags: firefox-backlog?
Next things to check for with the feature turned on:
* timers around loading Experiments.jsm
* timers around loading the AddonManager
No longer blocks: 989046
try push with timing around imports and first accessing the Experiments instance:
https://tbpl.mozilla.org/?tree=Try&rev=64bd9134ae15
Comment on attachment 8404740 [details] [diff] [review]
Skip Experiments.jsm load if pref is off

This avoids any real impact if the feature is off, but means that we now require a restart to turn it on.
Attachment #8404740 - Flags: review?(benjamin)
Comment on attachment 8404740 [details] [diff] [review]
Skip Experiments.jsm load if pref is off

We *could* add pref observers here to do that, but I agree in the short term that doesn't matter.
Attachment #8404740 - Flags: review?(benjamin) → review+
Timing imports et al didn't reveal anything major:
* first access to Experiments.instance() is 8-9ms
* first import of Experiments.jsm is 5-6ms

Metrics.jsm is the only (sub-)import that stands out with 3-4ms.
Per IRC we are hitting the XPI database via AddonManager.getAddonsByTypes(). Caching of active addons avoids hitting it for addon activation for most startup scenarios.

We can avoid hitting the XPI DB by storing in a pref whether we have an active experiment.
If not, we can delay Experiments initalization until 30s after session restore
If we do, we initialize right away, triggering cache load, AM sync, ...
Flags: firefox-backlog? → firefox-backlog+
Whiteboard: [talos_regression] → [talos_regression][p=5]
Status: NEW → ASSIGNED
Whiteboard: [talos_regression][p=5] → [talos_regression] p=5 s=it-31c-30a-29b.2 [qa?]
QA will assume this bug is fixed based on the pre-existing tests. If this needs any special manual test coverage then please needinfo me.
Whiteboard: [talos_regression] p=5 s=it-31c-30a-29b.2 [qa?] → [talos_regression] p=5 s=it-31c-30a-29b.2 [qa-]
Whiteboard: [talos_regression] p=5 s=it-31c-30a-29b.2 [qa-] → [talos_regression] p=5 s=it-31c-30a-29b.3 [qa-]
this looks to be fixed according to graph server!  Is there a reason not to close  this bug?
The patch that was checked in was a bandaid which would regress when experiments are enabled by default again. This bug remains open to fix the performance issue when experiments are enabled by default.
Blocks: telex
Talos other looks fine, an xpcshell test fixup was needed to:
https://tbpl.mozilla.org/?tree=Try&rev=8d7d400b6a59
Comment on attachment 8406773 [details] [diff] [review]
Delay initializing Experiments if there is no experiment running

>+XPCOMUtils.defineLazyGetter(
>+  this, "gActiveExperiment", () => {
>+    try {
>+      let prefs = Services.prefs;
>+      return prefs.getBoolPref(PREF_ACTIVE_EXPERIMENT);
>+    } catch (e) {
>+      return false;
>+    }
>+  });

The try/catch here is unfortunate. Is that really better than using Preferences.jsm and .get(PREF_ACTIVE_EXPERIMENT, false)?
Attachment #8406773 - Flags: review?(benjamin) → review+
Attachment #8406774 - Flags: review?(benjamin) → review+
Attachment #8406775 - Flags: review?(benjamin) → review+
Keywords: leave-open
Status: RESOLVED → VERIFIED
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.