2% Linux 32 v8_7/sessionrestore/a11y regression on Mozilla-Inbound-Non-PGO on June 03, 2015 from push 012638ffaacc

RESOLVED FIXED in Firefox 41

Status

()

RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: jmaher, Assigned: terrence)

Tracking

Trunk
mozilla42
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox41 fixed, firefox42 fixed)

Details

(Whiteboard: [talos_regression])

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

4 years ago
Talos has detected a Firefox performance regression from your commit 012638ffaacc in bug 1153382.  We need you to address this regression.

This is a list of all known regressions and improvements related to your bug:
http://alertmanager.allizom.org:8080/alerts.html?rev=012638ffaacc&showAll=1

On the page above you can see Talos alert for each affected platform as well as a link to a graph showing the history of scores for this test. There is also a link to a treeherder page showing the Talos jobs in a pushlog format.

To learn more about the regressing test, please see: https://wiki.mozilla.org/Buildbot/Talos/Tests#V8.2C_version_7

Reproducing and debugging the regression:
If you would like to re-run this Talos test on a potential fix, use try with the following syntax:
try: -b o -p linux -u none -t dromaeojs  # add "mozharness: --spsProfile" to generate profile data

To run the test locally and do a more in-depth investigation, first set up a local Talos environment:
https://wiki.mozilla.org/Buildbot/Talos/Running#Running_locally_-_Source_Code

Then run the following command from the directory where you set up Talos:
talos --develop -e <path>/firefox -a v8_7

Making a decision:
As the patch author we need your feedback to help us handle this regression.
*** Please let us know your plans by Monday, or the offending patch will be backed out! ***

Our wiki page oulines the common responses and expectations:
https://wiki.mozilla.org/Buildbot/Talos/RegressionBugsHandling
(Reporter)

Comment 1

4 years ago
I had done some retriggers on tbpl:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&fromchange=3d316d59eace&tochange=834479b06849&filter-searchStr=Ubuntu%20HW%2012.04%20mozilla-inbound%20talos%20dromaeojs

the numbers do drop for this push.

:terrence, can you take a look at this and help determine if we can fix this or need to live with it?
Flags: needinfo?(terrence)
(Reporter)

Comment 3

4 years ago
I don't understand from the referenced bug how this will or won't affect the performance regression.  Could you explain it a bit better.
(Reporter)

Comment 4

4 years ago
I see a11y regressions on pgo from this push, likewise session restore tests.  

:terrence, if I understand bug 1153382 fully, next wednesday you plan to make the code you added debug only (i.e. we won't see issues with talos)?
Flags: needinfo?(terrence)
Summary: 2% Linux 32 v8_7 regression on Mozilla-Inbound-Non-PGO on June 03, 2015 from push 012638ffaacc → 2% Linux 32 v8_7/sessionrestore/a11y regression on Mozilla-Inbound-Non-PGO on June 03, 2015 from push 012638ffaacc
(Assignee)

Comment 5

4 years ago
Yes.
Flags: needinfo?(terrence)
(Reporter)

Comment 6

4 years ago
has this been backed out?  I don't see any improvements/etc. on graph server.
(Reporter)

Comment 7

4 years ago
has this been backed out?
Flags: needinfo?(terrence)
(Assignee)

Comment 8

4 years ago
It turned up a can of worms last week, so I haven't backed out yet. It's part of --enable-js-crash-diagnostics, so we don't have to worry about it uplifting, regardless of when it gets backed out.
Flags: needinfo?(terrence)
(Reporter)

Comment 9

3 years ago
it looks like this might have made it to mozilla-aurora?  I don't have a lot of other talos regressions for sesion_restore_test and we had a nice hefty one.  It is hard to tell.

:terrence, can you confirm that this is not on Aurora?
Flags: needinfo?(terrence)
(Assignee)

Comment 10

3 years ago
Well, it certainly shouldn't be enabled. However, in poking about I found that recently releng has morphed the meaning of "nightly" to mean "this is a CI build", so now all builds on TreeHerder, even mozilla-release are built with the --enable-go-slow flag. Luckily, actual release builds are not yet made from this config, so we're not shipping it, but the intention is to start shipping from TH.

So, I guess that the regression you are seeing is "real" in the sense that it is present and that the safety has been removed on a major footgun. I think at this point our best bet is probably to make poisoning debug-only. Poisoning does make nightly failures more reliable, but it's not clear how useful this is, given that we can't debug them.
Flags: needinfo?(terrence)
(Assignee)

Comment 11

3 years ago
Uhg, actually compartment checking is also under the js-crash-diagnostics flag, and that's much more important. Seems like we're going to need to find a way to make this nightly only again.
(Reporter)

Comment 12

3 years ago
is there a timeline for this?  maybe a way I could turn this off for a try run and test to see what impact this has on the large pile of regressions we have on aurora.
Flags: needinfo?(terrence)
(Reporter)

Comment 13

3 years ago
Terrence, I assumed this would have been resolved last week- what help do you need?
(Assignee)

Comment 14

3 years ago
(In reply to Joel Maher (:jmaher) from comment #12)
> is there a timeline for this?  maybe a way I could turn this off for a try
> run and test to see what impact this has on the large pile of regressions we
> have on aurora.

Oh, you can just run firefox or the shell with JSGC_DISABLE_POISONING=true to disable the poisoning and see the actual performance.
Flags: needinfo?(terrence)
(Assignee)

Comment 15

3 years ago
Created attachment 8632111 [details] [diff] [review]
remove_js_crash_diagnostics_and_switch_to_nightly-v0.diff

After much discussion, we think this is probably the least bad solution.

This patch does:
* Remove --enable-js-diagnostics
* Keep JS_CRASH_DIAGNOSTICS define, but also enable if NIGHTLY_RELEASE
* Use memset for poisoning in release
* Enable JS_EXTRA_POISON in DEBUG builds
Assignee: nobody → terrence
Status: NEW → ASSIGNED
Attachment #8632111 - Flags: review?(sphink)
(Assignee)

Comment 16

3 years ago
And a try build to make sure I didn't typo anything or cause new timeouts with JS_EXTRA_POISON:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2b99071f8955
(Assignee)

Comment 17

3 years ago
Created attachment 8632155 [details] [diff] [review]
remove_js_crash_diagnostics_and_switch_to_nightly-v0.diff

And let's upload the patch for the change, not the patch for the try run.
Attachment #8632111 - Attachment is obsolete: true
Attachment #8632111 - Flags: review?(sphink)
Attachment #8632155 - Flags: review?(sphink)
(Reporter)

Comment 18

3 years ago
on try server I verified that we do in fact see the improvements with this disabled on aurora:

try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=67595ccaf3b8

aurora:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=f89d2e9d776a&filter-searchStr=linux%20talos

our comparison tools assume try is always non-pgo.  In this case aurora is pgo only and I ensured my try push was pgo, so you can compare the new values on the 'opt' with the old values in 'pgo' and it matches up:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-aurora&originalRevision=f89d2e9d776a&newProject=try&newRevision=67595ccaf3b8


Thanks for getting a patch up, I look forward to seeing this live and resolved!
Comment on attachment 8632155 [details] [diff] [review]
remove_js_crash_diagnostics_and_switch_to_nightly-v0.diff

Review of attachment 8632155 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsutil.h
@@ +303,4 @@
>  
> +    // Without a valid Value tag, a poisoned Value may look like a valid
> +    // floating point number. To ensure that we crash more readily when
> +    // observing a poised Value, we make the poison an invalid ObjectValue.

*poisoned Value

@@ +304,5 @@
> +    // Without a valid Value tag, a poisoned Value may look like a valid
> +    // floating point number. To ensure that we crash more readily when
> +    // observing a poised Value, we make the poison an invalid ObjectValue.
> +    // Unfortunately, this adds about 2% more overhead, so we can only
> +    // enable it in debug.

That makes me wonder if we'd still see the 2% loss if this function were templatized on a constant value, and kept the obj value cached. But maybe it's the slop handling that causes the slowdown?

@@ +339,5 @@
>  #else
>  # define JS_POISON(p, val, size) ((void) 0)
>  #endif
>  
> +/* Enable even more poisoing in purely debug builds. */

*poisoning
Attachment #8632155 - Flags: review?(sphink) → review+
This has as a consequence that a normal spidermonkey build (cd js/src; autoconf2.13; ./configure --disable-debug --enable-optimize; make) have poisoning enabled.

Is that the intention? It is already very hard to do measurements. And now by default the shell also doesn't give representable (with release) numbers. This was already an issue with nightly builds. Now we also have it in the shell ...

(In the meantime I've committed https://github.com/h4writer/arewefastyet/commit/d2fedf0774c531838ee6878812fa3030819dcf3b that sets JSGC_DISABLE_POISONING)


Adding nbp, since this also affects our FFOS benchmark scores!
https://hg.mozilla.org/mozilla-central/rev/10e53b335c48
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
status-firefox42: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
(Reporter)

Comment 23

3 years ago
Comment on attachment 8632155 [details] [diff] [review]
remove_js_crash_diagnostics_and_switch_to_nightly-v0.diff

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]:
[Describe test coverage new/current, TreeHerder]:
[Risks and why]: 
[String/UUID change made/needed]:
Attachment #8632155 - Flags: approval-mozilla-aurora?
Joel, could you please provide more info on the patch so I can determine the risk involved in uplifting to Aurora?

Approval Request Comment
[Feature/regressing bug #]: ???
[User impact if declined]: ???
[Describe test coverage new/current, TreeHerder]: ???
[Risks and why]: ???
[String/UUID change made/needed]: ???
Flags: needinfo?(jmaher)
(Reporter)

Comment 25

3 years ago
I am afraid I need to pass onto :terrence, this is just a fix for a series of performance regressions which is now on aurora.

:terrence, could you please provide more info on the patch so :ritu can determine the risk involved in uplifting to Aurora?
Flags: needinfo?(jmaher) → needinfo?(terrence)
(Assignee)

Comment 26

3 years ago
(In reply to Ritu Kothari (:ritu) from comment #24)
> Joel, could you please provide more info on the patch so I can determine the
> risk involved in uplifting to Aurora?
> 
Approval Request Comment
[Feature/regressing bug #]: Unknown. Someone decided it would be a good idea to start treating the "nightly" config as "everything built by infra". This caused some diagnostics code which is supposed to only run on Firefox Nightly to be uplifted to all branches.
[User impact if declined]: An approximate 10% slowdown across the board if this hits a release.
[Describe test coverage new/current, TreeHerder]: Extensive. Both on and off modes are tested in 40+ releases.
[Risks and why]: None. This is just changing when we enable some code to restore things to the way they used to be.
[String/UUID change made/needed]: None.
Flags: needinfo?(terrence)
Comment on attachment 8632155 [details] [diff] [review]
remove_js_crash_diagnostics_and_switch_to_nightly-v0.diff

This seems like a backout, approved.
Attachment #8632155 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(Assignee)

Comment 29

3 years ago
(In reply to Hannes Verschore [:h4writer] from comment #21)
> This has as a consequence that a normal spidermonkey build (cd js/src;
> autoconf2.13; ./configure --disable-debug --enable-optimize; make) have
> poisoning enabled.
> 
> Is that the intention? It is already very hard to do measurements. And now
> by default the shell also doesn't give representable (with release) numbers.
> This was already an issue with nightly builds. Now we also have it in the
> shell ...

No, that was not intentional. I did not expect MOZ_NIGHTLY to be set in shell builds. I'll find something additional to mask out opt shell builds.
You need to log in before you can comment on or make changes to this bug.