Closed Bug 1230661 Opened 4 years ago Closed 4 years ago

4% win8 e10s ts_paint regression on fx-team (v.45) on dec 2, 2015 from revision 2d33fa3346b2

Categories

(Firefox :: Search, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 46
Tracking Status
e10s + ---
firefox45 + fixed
firefox46 --- fixed

People

(Reporter: jmaher, Assigned: florian)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][e10s])

Attachments

(2 files)

Talos has detected a Firefox performance regression from your commit 2d33fa3346b2e4bfd7720a9f1f19e4a59b84e143 in bug 1227045.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=2d33fa3346b2e4bfd7720a9f1f19e4a59b84e143&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#ts_paint

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p win64 -u none -t other  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a ts_paint

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page outlines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
            

here is a compareview to show the raw numbers on retrigger:
https://treeherder.allizom.org/perf.html#/compare?originalProject=fx-team&originalRevision=872043399490&newProject=fx-team&newRevision=2d33fa3346b2&showOnlyImportant=0&showOnlyConfident=1

there is a lot of noise in the compare view, but the ts_paint does show a distinct difference.
Summary: 4% win8 e10s ts_paint regression on fx-team (v.445) on dec 2, 2015 from revision 2d33fa3346b2 → 4% win8 e10s ts_paint regression on fx-team (v.45) on dec 2, 2015 from revision 2d33fa3346b2
it would be more likely if the regression is due to bug 1228627 or panorama removal than the search change... But I didn't check if those landed close enough, so I'm just guessing.

The only other possibility I may think of is that on startup we read currentEngine and that now causes us to write to disk... But it sounds unlikely, especially cause Ts runs multiple times but this may happen just once. Florian would know better.
ok looks like the suspect changes I mentioned landed far enough from this to be connected with the regression, so they are off the hook.
(In reply to Marco Bonardo [::mak] from comment #2)
> it would be more likely if the regression is due to bug 1228627 or panorama
> removal than the search change... But I didn't check if those landed close
> enough, so I'm just guessing.
> 
> The only other possibility I may think of is that on startup we read
> currentEngine and that now causes us to write to disk...

Any disk write of the cache file is delayed by a 1000ms timer, and the file is written asynchronously with OS.File. Also, if the tests are using a clean profile, we were already writing the cache file to disk 1000ms after the search service's initialization.

I don't see how that search change could regress performance, but I also don't know well what is being run for ts_paint.
Hmm... random guess: we may now be calling the currentEngine setter in a case when we were not before; this could send a notification that causes about:home to update/redraw something.
Thanks for the quick comments!

The backout on try does show the improvement, so it looks like we found the right two patches.  Now to figure out why that is the case.  It sounds like there are some ideas about what might be happening.

ts_paint shouldn't be going to about:home, it loads a simple page on the commandline when launching firefox:
https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/startup_test/tspaint_test.html

then again, it could be loading about:home, then the page requested.
It would probably be interesting to run this code locally with the search service logging enabled, to compare a log of before/after and see if there are obvious differences.
[Tracking Requested - why for this release]:
impacts fx45 release
this is on aurora now, :florian, can you update this bug of when this will be fixed?
Assignee: nobody → florian
Flags: needinfo?(florian)
(In reply to Joel Maher (:jmaher) from comment #9)
> this is on aurora now, :florian, can you update this bug of when this will
> be fixed?

I'll try to look at this early next week, but I'm not really sure how to get started on debugging this.

Running the test by following the instructions in comment 0 only gives me a set of numbers in the output. Is there an easy way to run the same test but with the console output of the Firefox process visible, so that I can use dump() to check if some code within the search service is reached?
Flags: needinfo?(florian) → needinfo?(jmaher)
when you run this locally, it creates a local.out and local.json.  this has the raw data.  local.json has more raw data, is parsable (this is what perfherder uses), but might take a bit to get used to.  local.out is exactly what would be sent to graph server.

Pick what you like, then you can run- save off a copy, run again, and compare the two files.

while running locally, it should print out all the output from the console, if not, add --debug flag to the command line and it will print out the console info.  Then dump(...) should work.  I know I depended on dump() recently while tracking down a common timeout in talos earlier this week.
Flags: needinfo?(jmaher)
I added some dumps in the currentEngine setter. A surprise there is that when running ts_paint, the setter is called 1 to 9 times, and this doesn't seem deterministic at all. I need to investigate more to decide if the way forward is to try to make this more deterministic, or to prevent the setter from being called at all by the getter when we are starting with a fresh profile and not customizing the search engine.
I looked at the stacks.

The code that's called consistently is BrowserGlue.prototype._syncSearchEngines from nsBrowserGlue.js (I'm looking forward to removing this code as part of my next wave or search service cleanups, likely during Q1).

Sometimes there's code from ContentSearch.jsm that is called from the notification fired by the currentEngine setter. This generates a message for the content process.

Sometimes there's also code from the searchbar (search.xml) that runs as a result of the notification from the currentEngine setter. This can lead to initializing the some of the searchbar visible UI elements twice: once from the notification, and once from the normal initialization code path that can happen later.

Rarely, there's code from BrowserUITelemetry.jsm that joins the party. I think this will happen only if the search service is already fully initialized when BrowserUITelemetry is getting initialized.

All of these are potential candidates for the perf hit, but the double initialization of the searchbar UI seems more likely to have a significant impact. I still think the safest solution is to avoid firing notifications from the search service's currentEngine setter during startup.
This changes the implementation of the currentEngine getter to avoid calling the setter unless it's really necessary. With this patch applied, the currentEngine setter is no longer called during ts_paint (and the currentEngine getter is still called 1 to 4 times... but that's definitely not a regression caused by my previous patch).

The change I had to make to the test_geodefaults.js test seems to indicate that even before my previous patch we were already calling the setter (and so updating the cache on disk) more often than needed.
Attachment #8701089 - Flags: review?(mak77)
Comment on attachment 8701089 [details] [diff] [review]
avoid calling the search service's currentEngine setter during startup, r=?

Please ignore the unrelated aboutDialog.xul changes.
great stuff!
Attachment #8701089 - Flags: review?(mak77) → review+
https://hg.mozilla.org/integration/fx-team/rev/a6a1f948a4665f48205e819c3defa7adf1de1b00
Bug 1230661 - avoid calling the search service's currentEngine setter during startup, r=mak.
Approval Request Comment
[Feature/regressing bug #]: bug 1227045
[User impact if declined]: startup performance regression
[Describe test coverage new/current, TreeHerder]: tested the fix on try.
[Risks and why]: low risk, as the code is well covered by unit tests.
[String/UUID change made/needed]: none.

(note: I would usually wait for the patch to reach mozilla-central before requesting approval, but I'll likely be off by the time this happens, so requesting now.)
Attachment #8701475 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/a6a1f948a466
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 46
Comment on attachment 8701475 [details] [diff] [review]
Patch (as landed)

Fix a perf regression, taking it.
Attachment #8701475 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
[bugday-20160323]

Status: RESOLVED,FIXED -> UNVERIFIED

Comments:
STR: Not clear.
Developer specific testing

Component: 
Name			Firefox
Version			46.0b9
Build ID		20160322075646
Update Channel	beta
User Agent		Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0
OS				Windows 7 SP1 x86_64

Expected Results: 
Developer specific testing

Actual Results: 
As expected
You need to log in before you can comment on or make changes to this bug.