Closed Bug 1262203 Opened 4 years ago Closed 4 years ago
.54% tp5o responsiveness (windowsxp) regression on push dc32b86990d5 (Wed Mar 30 2016)
Talos has detected a Firefox performance regression from push dc32b86990d5: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=6c7e6cf636af As author of one of the patches included in that push, we need your help to address this regression. This is a list of all known regressions and improvements related to the push: https://treeherder.mozilla.org/perf.html#/alerts?id=689 On the page above you can see an alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format. To learn more about the regressing test(s), please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#tp5 Reproducing and debugging the regression: If you would like to re-run this Talos test on a potential fix, use try with the following syntax: try: -b o -p win32 -u none -t tp5o-e10s[Windows XP] --rebuild 5 # add "mozharness: --spsProfile" to generate profile data (we suggest --rebuild 5 to be more confident in the results) To run the test locally and do a more in-depth investigation, first set up a local Talos environment: https://wiki.mozilla.lorg/Buildbot/Talos/Running#Running_locally_-_Source_Code Then run the following command from the directory where you set up Talos: talos --develop -e [path]/firefox -a tp5o --e10s Making a decision: As the patch author we need your feedback to help us handle this regression. *** Please let us know your plans by Friday, or the offending patch(es) will be backed out! *** Our wiki page outlines the common responses and expectations: https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
:terrence, I see that :jonco is on pto for another week- he authored all the patches in the push, and you had reviewed them all. Could you help weigh in here on why we are seeing the responsiveness regress on winxp e10s only? I did many retriggers of this push vs the previous push on the tree and have a compare view: https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=cbf86e5e944f&newProject=mozilla-inbound&newRevision=6c7e6cf636af&framework=1&filter=e10s no other issues showing up that I can see. If you need me to bisect it down on try, I would be happy to- I thought I would ask in case something obvious stands out.
Product: Firefox → Core
I'm honestly pretty surprised to see any of this regressing anything. 6c7e6cf636af - Only adds code to DEBUG builds, so probably not implicated. dbbe7c7e305d - Again only changes some assertions around, so also not implicated. 457e7d5b1136 - Traces one new edge in some shapes, but only for very large objects. This is a core GC invariant that we were violating and getting lucky on. This is required for compacting Shapes. The |table| member should be in the same cache line as |this|, so I doubt it's the regression. 24b56ce6d8db - Changes the order of the members in a frequently used GC data structure. This not the most frequently touched field and it should have improved the alignment characteristics of the more accessed fields. That said, I see a couple things we could be doing better here.
thanks Terrence- want me to push to try with/without 24b56ce6d8db and see if this changes at all? do you have a few things you want to try out to optimize 24b56ce6d8db slightly?
(In reply to Joel Maher (:jmaher) from comment #3) > thanks Terrence- want me to push to try with/without 24b56ce6d8db and see if > this changes at all? do you have a few things you want to try out to > optimize 24b56ce6d8db slightly? Yes, please! It's sometimes still possible to see what might have caused a 20% regression, but on modern machines it's basically impossible to tell if something is going to cause a 5% regression. Maybe the compiler got confused and causes the pipeline to stall now; maybe we went slightly outside the bounds of intel's optimal caching strategy; maybe that slow divide is the culprit, but maybe it doesn't matter because we're already waiting on a prior load; maybe we changed the instruction's address and now our icache is shared with something that's also hot and we keep getting evicted. I've even heard of one case where just moving a block of code below its usage "won" a 15% speedup. There is just really no telling. At least not without vTune and a free afternoon. Unless there's major algorithmic change in the patch *and* the regression is larger than 20%, it's probably going to be nearly impossible to track down the exact cause and even if you do, it's almost certainly going to be some aspect of the platform that is out of your control anyway.
Reordering the object initialization writes to be in order won us 2% on Early Boyer (which create ~2B objects). Maybe if this test is dominated by nursery GC's that tenure most things? I seriously doubt this will do anything, but it's good hygiene in any case.
Attachment #8738245 - Flags: review?(sphink)
Comment on attachment 8738245 [details] [diff] [review] do_relocation_writes_in_order-v0.diff Review of attachment 8738245 [details] [diff] [review]: ----------------------------------------------------------------- Hopefully something is doing this automatically -- if not the compiler, then the hardware. But anything could happen, and you're right, this is better hygiene.
Attachment #8738245 - Flags: review?(sphink) → review+
3 pushes to try: https://email@example.com&fromchange=3065885fc201&tochange=d6fdb966d939 1) baseline (tip of inbound) 2) baseline + backout of patch 3) baseline + above patch I will probably get this data sorted out tomorrow morning.
all 3 try pushes are not showing any improvement- they fall on the same range of the regressed responsiveness number. I realized I didn't backout the right patch, so here are two backouts (1 each of the non debug only changes): https://firstname.lastname@example.org&fromchange=e52220ff7dc4&tochange=2097ad115305 a few hours and we should have results.
backing out 457e7d5b1136 seems to fix this regression. this was the patch which most likely wasn't the suspect.
(In reply to Joel Maher (:jmaher) from comment #9) > backing out 457e7d5b1136 seems to fix this regression. this was the patch > which most likely wasn't the suspect. Well, that's pretty unfortunate, as that's also the only patch fixing a real bug. :-(
https://hg.mozilla.org/integration/mozilla-inbound/rev/b548578a1980f2da63e4ca9c4b41b55f17be53de Bug 1262203 - Do GC relocation writes in order; r=sfink
:terrence, given the fact that the other patch from revision 457e7d5b1136 was the problem, do you know of anything else we can do here?
I don't see anything obvious, maybe Jon will have some ideas.
Status: RESOLVED → REOPENED
Flags: needinfo?(terrence) → needinfo?(jcoppeard)
Resolution: FIXED → ---
Here's a patch to skip tracing any shape table if we've arrived at a base shape by tracing a shape. I think this should do the trick and cut out most shape table tracing. Currently waiting for talos runs to confirm this. Terrence, hopefully this isn't too horrible!
Attachment #8740068 - Flags: review?(terrence)
Comment on attachment 8740068 [details] [diff] [review] bug1262203-skip-shape-table-tracing Review of attachment 8740068 [details] [diff] [review]: ----------------------------------------------------------------- Yeah, that's pretty reasonable.
Attachment #8740068 - Flags: review?(terrence) → review+
Status: REOPENED → RESOLVED
Closed: 4 years ago → 4 years ago
Resolution: --- → FIXED
verified this is fixed.
You need to log in before you can comment on or make changes to this bug.