Closed Bug 1188510 Opened 5 years ago Closed 3 months ago

3-5% Linux*/Win* damp/tps regression on Mozilla-Inbound-Non-PGO on July 27, 2015 from push 1fbefe335299

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: jmaher, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression])

Talos has detected a Firefox performance regression from commit 1fbefe335299:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=1fbefe335299

This is a list of regressions/improvements:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=e039b166890f&newProject=mozilla-inbound&newRevision=1fbefe335299

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#DAMP

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux,win64,win32 -u none -t g2  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a damp

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Friday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
this was a checkin-needed bundle of patches landed by a sheriff.  Ideally we can push each to try, maybe there is a chance that we can pick our best candidate or two?

:bgrins, it looks like netmonitor suffered a lot:
https://treeherder.allizom.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=e039b166890f&newProject=mozilla-inbound&newRevision=1fbefe335299&originalSignature=e45b6c2d9826f236d63f5c810074472611535279&newSignature=e45b6c2d9826f236d63f5c810074472611535279

any thoughts on what in this range might be the culprit:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?changeset=1fbefe335299
Flags: needinfo?(bgrinstead)
I don't think this is bug 1187399 because that just adds something that isn't even used yet.
thanks Nick, that seems logical.
No longer blocks: 1187399
Bug 1181976 is just a string caption change so I'd be very surprise if it was it but I'll keep an eye out.
I would guess it is the bug 1148593 patch.
There's also bug 1152893.
the winner is revision 01675d584873 from bug 1148593!
No longer blocks: 1181976
Component: Talos → DOM
Product: Testing → Core
:tromey, it looks like one of your two patches is the winner.  

You can see the raw data of what is regressed by following links here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=e039b166890f&newProject=try&newRevision=7f9b250f597d&hideMinorChanges=1

this appears to be all windows platforms and linux64, although windows 8 seems to be the largest damp regression and a higher tps regression.

questions for you:
* we should try to understand this and fix it as much as possible, could you weigh in here on your thoughts?
* if we believe we can reduce/fix the regression, what is the general timeline for doing so?
Flags: needinfo?(ttromey)
(In reply to Joel Maher (:jmaher) from comment #8)
> :tromey, it looks like one of your two patches is the winner.  
> 
> You can see the raw data of what is regressed by following links here:
> https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-
> inbound&originalRevision=e039b166890f&newProject=try&newRevision=7f9b250f597d
> &hideMinorChanges=1
> 
> this appears to be all windows platforms and linux64, although windows 8
> seems to be the largest damp regression and a higher tps regression.
> 
> questions for you:
> * we should try to understand this and fix it as much as possible, could you
> weigh in here on your thoughts?
> * if we believe we can reduce/fix the regression, what is the general
> timeline for doing so?

I believe that tromey will be unavailable for the remainder of the week.  From what I understand, this regression is the same set of problems in Bug 1152893.

Based on a conversation with fitzgen, here's what I gathered.  There are two things going on that probably cause the regression:

1. Capturing the stack that will become the async stack for some later call
* This got sped up by his work in Bug 1028418, if you are doing it a lot during the same JS run-to-completion.  Which DAMP probably isn't doing, I'm not sure about TPS.

2. Trimming the length of the stack so that we don't get out-of-memory.
* This is pretty expensive because it means you have to actually copy the n frames you want to keep around (rather than just hold a strong ref to the whole stack, and its async stack, and the async stack's async stack, etc)

So we need to figure out how to fix part 2.  Will leave the ni? open - I'm not sure of the possibility of coming up with a meaningful fix for that before the merge day on 8-10.
Blocks: async-stacks
FWIW IIUC then the async stacks pref is only set on Nightly, so it might not be too bad to let that commit ship with the uplift.
(In reply to Joel Maher (:jmaher) from comment #8)

> questions for you:
> * we should try to understand this and fix it as much as possible, could you
> weigh in here on your thoughts?

Brian was correct, this is bug 1152893.

> * if we believe we can reduce/fix the regression, what is the general
> timeline for doing so?

Nick was correct that this is disabled on release.
https://dxr.mozilla.org/mozilla-central/source/modules/libpref/init/all.js?from=all.js&case=true#1100

If the current state is unacceptable I would propose disabling the pref in the short term.
Flags: needinfo?(ttromey)
(In reply to Tom Tromey :tromey from comment #11)
> If the current state is unacceptable I would propose disabling the pref in
> the short term.

It's not as much as unacceptable as asking for your opinion on this.

E.g. what is the value from not disabling it in the short term? If there's a big value in keeping it enabled, then maybe we should.
:tromey, I think :avih was asking you for your advice on benefits of leaving it in vs disabling it temporarily.
Flags: needinfo?(ttromey)
(In reply to Joel Maher (:jmaher) from comment #13)
> :tromey, I think :avih was asking you for your advice on benefits of leaving
> it in vs disabling it temporarily.

Sorry about that.

(In reply to Avi Halachmi (:avih) from comment #12)

> It's not as much as unacceptable as asking for your opinion on this.
> 
> E.g. what is the value from not disabling it in the short term? If there's a
> big value in keeping it enabled, then maybe we should.

The benefit is that it gives more useful stack traces in some situations;
in particular it can associate asynchronous events with their cause.  This can
be handy for debugging.

Also, as Nick said, it is disabled on release.

So I tend to think that the upsides outweigh the downsides.
Flags: needinfo?(ttromey)
Component: DOM → DOM: Core & HTML

Closing as a 4y old DAMP regression is probably not going to be picked up any time soon.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.