710296 - turn on responsiveness for linux in Talos

Reporter

Description

•

13 years ago

I notice this comment at http://hg.mozilla.org/build/talos/file/705a12151bf2/talos/ttest.py#l256 : # ignore responsiveness tests on linux until we fix Bug 697555 bug 697555 is now fixed. Can we remove this code and turn on responsiveness for linux again?

(not currently active) Ted Mielczarek

Comment 1

•

13 years ago

No, that bug only wound up dealing with the OS X side of things. There's still something broken with Linux where it crashes on shutdown.

Benoit Girard (:BenWa)

Updated

•

13 years ago

Blocks: 683229

(not currently active) Ted Mielczarek

Comment 2

•

13 years ago

We should try this after the patch in bug 730282 lands, since it fixes a potential crash on Linux/OSX.

Depends on: 730282

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

13 years ago

out of 40 runs, I had 2 crashes: https://tbpl.mozilla.org/?tree=Try&rev=c60b10defc46

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

9 years ago

do we have a need to do this? If so, I would like to get someone looking at this, otherwise with so many years gone by, this isn't much of a priority.

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 5

•

9 years ago

I would ask the e10s folks. If they want it then yes. If they are ambivalent or don't want it, then let's just WONTFIX this.

Flags: needinfo?(ted)

Joel Maher ( :jmaher ) (UTC -8)

Comment 6

•

9 years ago

:blassey, can you determine if we want to look into this, or help find someone who could have that discussion and drive a decision?

Flags: needinfo?(blassey.bugs)

Jim Mathies [:jimm]

Updated

•

9 years ago

Flags: needinfo?(blassey.bugs) → needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 7

•

9 years ago

Where are the results for these tests? I'm looking through tp5 results on mc and I'm not seeing them. We currently use telemetry to check for regressions in main thread jank. Having a 'canary in the coal mine' talos test that measures this is useful.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

9 years ago

it appears we don't run these at all- my memory isn't the best, but here is some historical data: http://graphs.mozilla.org/graph.html#tests=[[275,1,49],[275,1,47],[275,1,31],[275,1,25]]&sel=none&displayrange=365&datatype=geo it appears we turned this off, I suspect it was that a few reasons: 1) failure to collect data often: https://bugzilla.mozilla.org/show_bug.cgi?id=1174160 2) interest in this, but nothing conclusive: https://bugzilla.mozilla.org/show_bug.cgi?id=1170241 3) we accidentally disabled this 5.5 months ago I couldn't find any reason why this was disabled specifically. If this is useful we would need to: * turn it on * ensure it is stable and doesn't fail to collect, hang jobs * find out how to not post 8000 data points, ideally just 1 * ensure the results are normal, not random so if we get a regression it is worthwhile to track down. :jimm, what are your thoughts here?

Flags: needinfo?(jmaher)

Jim Mathies [:jimm]

Comment 9

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #8) > it appears we don't run these at all- my memory isn't the best, but here is > some historical data: > http://graphs.mozilla.org/graph.html#tests=[[275,1,49],[275,1,47],[275,1,31], > [275,1,25]]&sel=none&displayrange=365&datatype=geo > > it appears we turned this off, I suspect it was that a few reasons: > 1) failure to collect data often: > https://bugzilla.mozilla.org/show_bug.cgi?id=1174160 > 2) interest in this, but nothing conclusive: > https://bugzilla.mozilla.org/show_bug.cgi?id=1170241 > 3) we accidentally disabled this 5.5 months ago > > I couldn't find any reason why this was disabled specifically. If this is > useful we would need to: > * turn it on > * ensure it is stable and doesn't fail to collect, hang jobs > * find out how to not post 8000 data points, ideally just 1 > * ensure the results are normal, not random so if we get a regression it is > worthwhile to track down. > > :jimm, what are your thoughts here? This sounds like a useful test to me. If someone landed code that janked the main browser thread it'd be nice to know it. I'm not sure how often that happens but a test to be sure it doesn't would be useful. Especially now that we are doing multiple processes where one process might spend (too much) time blocked on another.

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 10

•

9 years ago

After reading it over, this kinds feels like a dupe of bug 710296.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

9 years ago

this is sounding like a useful thing- did you mean to link another bug as a possible dupe? At the very least I can turn this on for a try run and see what works/doesn't

Jim Mathies [:jimm]

Comment 12

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #11) > this is sounding like a useful thing- did you mean to link another bug as a > possible dupe? > > At the very least I can turn this on for a try run and see what works/doesn't Whoops, I meant bug 1170241. We had a discussion there last spring about the usefulness of this test in tracking regressions under e10s. My opinion hasn't changed since then, I think we should keep it and improve on it over time. :)

Joel Maher ( :jmaher ) (UTC -8)

Comment 13

•

9 years ago

ok, I turned this on for all platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6193e1d7008 it seems to be recording data, I am going to get this fix pushed and see how stable the data is and work on tracking it. Is it useful to track e10s and non e10s?

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 14

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #13) > ok, I turned this on for all platforms: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6193e1d7008 > > it seems to be recording data, I am going to get this fix pushed and see how > stable the data is and work on tracking it. Is it useful to track e10s and > non e10s? Yes!

Flags: needinfo?(jmathies)

Jim Mathies [:jimm]

Comment 15

•

9 years ago

I wonder if tp5o_responsiveness_paint is broken under e10s? the delta there is huge.

(not currently active) Ted Mielczarek

Comment 16

•

9 years ago

The actual data collection here is pretty simple--it just fires tracer events into the native event queue and times how long they take to get handled. This is only done in the chrome process, so I could believe that in certain scenarios e10s would have way lower results, where the content process is doing the bulk of the work and not tying up the chrome process event loop.

Joel Maher ( :jmaher ) (UTC -8)

Comment 17

•

9 years ago

yeah, we would need a custom measurement for e10s if we wanted to measure something outside of the chrome process.

Joel Maher ( :jmaher ) (UTC -8)

Comment 18

•

9 years ago

osx is quite crazy, but windows and linux seem to be stable.

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

9 years ago

Attached file MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss — Details

Review commit: https://reviewboard.mozilla.org/r/33483/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/33483/

Attachment #8715485 - Flags: review?(j.parkouss)

Joel Maher ( :jmaher ) (UTC -8)

Comment 20

•

9 years ago

and a corresponding try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e43d82e4b369 this will ignore osx, but turn on responsiveness for linux/win* on tp5.

Julien Pagès (:parkouss)

Comment 21

•

9 years ago

Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss https://reviewboard.mozilla.org/r/33483/#review30187 ::: testing/talos/talos/config.py:326 (Diff revision 1) > + Unless I'm missing something, I think you can bring back 'responsiveness' in GLOBAL_OVERRIDES, and that should allow to remove this block of code (it will be handled automatically with the "for key in GLOBAL_OVERRIDES" above). ::: testing/talos/talos/ttest.py:93 (Diff revision 1) > - if test_config.get('responsiveness') and \ > + if test_config.get('responsiveness', False) and \ False is not required here, you can remove it.

Attachment #8715485 - Flags: review?(j.parkouss)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

9 years ago

I actually found out that we were always failing to run responsiveness as the global_configs set it to False because it is a cli arg with a False default value. This overrides the value set in test.py (True for tp5o). So we accidentally turned this off with the update to config.py/etc.

Julien Pagès (:parkouss)

Comment 23

•

9 years ago

Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss (In reply to Joel Maher (:jmaher) from comment #22) > I actually found out that we were always failing to run responsiveness as > the global_configs set it to False because it is a cli arg with a False > default value. This overrides the value set in test.py (True for tp5o). > > So we accidentally turned this off with the update to config.py/etc. Oh ok ! Nevermind then, r+. Maybe add a comment to explain that ? We should probably try to improve this weird arg parsing logic (again), but that is another story.

Attachment #8715485 - Flags: review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 24

•

9 years ago

Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss Review request updated; see interdiff: https://reviewboard.mozilla.org/r/33483/diff/1-2/

Attachment #8715485 - Flags: review+ → review?(j.parkouss)

Julien Pagès (:parkouss)

Comment 25

•

9 years ago

Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss https://reviewboard.mozilla.org/r/33483/#review30287 Thanks for looking deeper into this! This looks fine to me, no more --responsiveness parameter which is automatically set on tests that requires it. (e.g. tp5o).

Attachment #8715485 - Flags: review?(j.parkouss) → review+

Pulsebot

Comment 26

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d9520249a4a0

Carsten Book [:Tomcat]

Comment 27

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d9520249a4a0

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox47: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla47

William Lachance (:wlach)

Comment 28

•

9 years ago

This wierdly seems to have "regressed" tp5o % cpu on win7. Any ideas? https://treeherder.allizom.org/perf.html#/alerts?id=20 It doesn't make any sense, but looking at the graph the new behaviour unmistakably started with this push. Unless something else is going on....

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 29

•

9 years ago

yes, this would be fine as we are doing more cpu and work to record and dump the output.

Flags: needinfo?(jmaher)

William Lachance (:wlach)

Comment 30

•

9 years ago

But that should happen on *linux* not windows. This makes no sense, on :Callek's suggestion I'm doing a retrigger on a older revision off a windows xp job which exhibits this behavior: https://treeherder.allizom.org/perf.html#/graphs?series=[mozilla-inbound,1fc0f3ae22619bd040312c9c7dc14e77247d7939,1]&highlightedRevisions=633d74f817b2 Retriggers here: https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=751e2bc571ab&filter-searchStr=windows%207%20talos&selectedJob=18442797

William Lachance (:wlach)

Comment 31

•

9 years ago

Ok that seemed to give the old set of values. I wonder if this changeset actually did somehow change the windows numbers? Doing a bunch of retriggers to confirm, results to be reported here: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=865b01fa7947&newProject=mozilla-inbound&newRevision=d9520249a4a0&framework=1&filter=%25&showOnlyImportant=0

Joel Maher ( :jmaher ) (UTC -8)

Comment 32

•

9 years ago

we had an error and accidentally turned this off for windows, so this fix turned it on for linux and back on for windows. osx is still disabled.

Avi Halachmi (:avih)

Comment 33

•

9 years ago

(In reply to Jim Mathies [:jimm] from comment #15) > I wonder if tp5o_responsiveness_paint is broken under e10s? the delta there > is huge. Reading the description here https://wiki.mozilla.org/Buildbot/Talos/Tests#Responsiveness Suggests that the test result value is: > return sum([float(x)*float(x) / 1000000.0 for x in val_list]) This seem to me to depend on the number of collected events, and it's not normalized for the number of events. So less events -> much "better" result. Maybe that's the main culprit here?

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Blocks: 1244255

Joel Maher ( :jmaher ) (UTC -8)

Comment 34

•

9 years ago

we see the tp5o memory/cpu issue in pgo on aurora now: https://treeherder.mozilla.org/perf.html#/alerts?id=320