Closed Bug 710296 Opened 13 years ago Closed 9 years ago

turn on responsiveness for linux in Talos

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox47 fixed)

RESOLVED FIXED
mozilla47
Tracking Status
firefox47 --- fixed

People

(Reporter: k0scist, Unassigned)

References

Details

Attachments

(1 file)

I notice this comment at http://hg.mozilla.org/build/talos/file/705a12151bf2/talos/ttest.py#l256 : # ignore responsiveness tests on linux until we fix Bug 697555 bug 697555 is now fixed. Can we remove this code and turn on responsiveness for linux again?
No, that bug only wound up dealing with the OS X side of things. There's still something broken with Linux where it crashes on shutdown.
Blocks: 683229
We should try this after the patch in bug 730282 lands, since it fixes a potential crash on Linux/OSX.
Depends on: 730282
do we have a need to do this? If so, I would like to get someone looking at this, otherwise with so many years gone by, this isn't much of a priority.
Flags: needinfo?(ted)
I would ask the e10s folks. If they want it then yes. If they are ambivalent or don't want it, then let's just WONTFIX this.
Flags: needinfo?(ted)
:blassey, can you determine if we want to look into this, or help find someone who could have that discussion and drive a decision?
Flags: needinfo?(blassey.bugs)
Flags: needinfo?(blassey.bugs) → needinfo?(jmathies)
Where are the results for these tests? I'm looking through tp5 results on mc and I'm not seeing them. We currently use telemetry to check for regressions in main thread jank. Having a 'canary in the coal mine' talos test that measures this is useful.
Flags: needinfo?(jmaher)
it appears we don't run these at all- my memory isn't the best, but here is some historical data: http://graphs.mozilla.org/graph.html#tests=[[275,1,49],[275,1,47],[275,1,31],[275,1,25]]&sel=none&displayrange=365&datatype=geo it appears we turned this off, I suspect it was that a few reasons: 1) failure to collect data often: https://bugzilla.mozilla.org/show_bug.cgi?id=1174160 2) interest in this, but nothing conclusive: https://bugzilla.mozilla.org/show_bug.cgi?id=1170241 3) we accidentally disabled this 5.5 months ago I couldn't find any reason why this was disabled specifically. If this is useful we would need to: * turn it on * ensure it is stable and doesn't fail to collect, hang jobs * find out how to not post 8000 data points, ideally just 1 * ensure the results are normal, not random so if we get a regression it is worthwhile to track down. :jimm, what are your thoughts here?
Flags: needinfo?(jmaher)
(In reply to Joel Maher (:jmaher) from comment #8) > it appears we don't run these at all- my memory isn't the best, but here is > some historical data: > http://graphs.mozilla.org/graph.html#tests=[[275,1,49],[275,1,47],[275,1,31], > [275,1,25]]&sel=none&displayrange=365&datatype=geo > > it appears we turned this off, I suspect it was that a few reasons: > 1) failure to collect data often: > https://bugzilla.mozilla.org/show_bug.cgi?id=1174160 > 2) interest in this, but nothing conclusive: > https://bugzilla.mozilla.org/show_bug.cgi?id=1170241 > 3) we accidentally disabled this 5.5 months ago > > I couldn't find any reason why this was disabled specifically. If this is > useful we would need to: > * turn it on > * ensure it is stable and doesn't fail to collect, hang jobs > * find out how to not post 8000 data points, ideally just 1 > * ensure the results are normal, not random so if we get a regression it is > worthwhile to track down. > > :jimm, what are your thoughts here? This sounds like a useful test to me. If someone landed code that janked the main browser thread it'd be nice to know it. I'm not sure how often that happens but a test to be sure it doesn't would be useful. Especially now that we are doing multiple processes where one process might spend (too much) time blocked on another.
Flags: needinfo?(jmathies)
After reading it over, this kinds feels like a dupe of bug 710296.
this is sounding like a useful thing- did you mean to link another bug as a possible dupe? At the very least I can turn this on for a try run and see what works/doesn't
(In reply to Joel Maher (:jmaher) from comment #11) > this is sounding like a useful thing- did you mean to link another bug as a > possible dupe? > > At the very least I can turn this on for a try run and see what works/doesn't Whoops, I meant bug 1170241. We had a discussion there last spring about the usefulness of this test in tracking regressions under e10s. My opinion hasn't changed since then, I think we should keep it and improve on it over time. :)
ok, I turned this on for all platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6193e1d7008 it seems to be recording data, I am going to get this fix pushed and see how stable the data is and work on tracking it. Is it useful to track e10s and non e10s?
Flags: needinfo?(jmathies)
(In reply to Joel Maher (:jmaher) from comment #13) > ok, I turned this on for all platforms: > https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6193e1d7008 > > it seems to be recording data, I am going to get this fix pushed and see how > stable the data is and work on tracking it. Is it useful to track e10s and > non e10s? Yes!
Flags: needinfo?(jmathies)
I wonder if tp5o_responsiveness_paint is broken under e10s? the delta there is huge.
The actual data collection here is pretty simple--it just fires tracer events into the native event queue and times how long they take to get handled. This is only done in the chrome process, so I could believe that in certain scenarios e10s would have way lower results, where the content process is doing the bulk of the work and not tying up the chrome process event loop.
yeah, we would need a custom measurement for e10s if we wanted to measure something outside of the chrome process.
osx is quite crazy, but windows and linux seem to be stable.
and a corresponding try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e43d82e4b369 this will ignore osx, but turn on responsiveness for linux/win* on tp5.
Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss https://reviewboard.mozilla.org/r/33483/#review30187 ::: testing/talos/talos/config.py:326 (Diff revision 1) > + Unless I'm missing something, I think you can bring back 'responsiveness' in GLOBAL_OVERRIDES, and that should allow to remove this block of code (it will be handled automatically with the "for key in GLOBAL_OVERRIDES" above). ::: testing/talos/talos/ttest.py:93 (Diff revision 1) > - if test_config.get('responsiveness') and \ > + if test_config.get('responsiveness', False) and \ False is not required here, you can remove it.
Attachment #8715485 - Flags: review?(j.parkouss)
I actually found out that we were always failing to run responsiveness as the global_configs set it to False because it is a cli arg with a False default value. This overrides the value set in test.py (True for tp5o). So we accidentally turned this off with the update to config.py/etc.
Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss (In reply to Joel Maher (:jmaher) from comment #22) > I actually found out that we were always failing to run responsiveness as > the global_configs set it to False because it is a cli arg with a False > default value. This overrides the value set in test.py (True for tp5o). > > So we accidentally turned this off with the update to config.py/etc. Oh ok ! Nevermind then, r+. Maybe add a comment to explain that ? We should probably try to improve this weird arg parsing logic (again), but that is another story.
Attachment #8715485 - Flags: review+
Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss Review request updated; see interdiff: https://reviewboard.mozilla.org/r/33483/diff/1-2/
Attachment #8715485 - Flags: review+ → review?(j.parkouss)
Comment on attachment 8715485 [details] MozReview Request: Bug 710296 - turn on responsiveness for linux in Talos. r?parkouss https://reviewboard.mozilla.org/r/33483/#review30287 Thanks for looking deeper into this! This looks fine to me, no more --responsiveness parameter which is automatically set on tests that requires it. (e.g. tp5o).
Attachment #8715485 - Flags: review?(j.parkouss) → review+
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
This wierdly seems to have "regressed" tp5o % cpu on win7. Any ideas? https://treeherder.allizom.org/perf.html#/alerts?id=20 It doesn't make any sense, but looking at the graph the new behaviour unmistakably started with this push. Unless something else is going on....
Flags: needinfo?(jmaher)
yes, this would be fine as we are doing more cpu and work to record and dump the output.
Flags: needinfo?(jmaher)
But that should happen on *linux* not windows. This makes no sense, on :Callek's suggestion I'm doing a retrigger on a older revision off a windows xp job which exhibits this behavior: https://treeherder.allizom.org/perf.html#/graphs?series=[mozilla-inbound,1fc0f3ae22619bd040312c9c7dc14e77247d7939,1]&highlightedRevisions=633d74f817b2 Retriggers here: https://treeherder.allizom.org/#/jobs?repo=mozilla-inbound&revision=751e2bc571ab&filter-searchStr=windows%207%20talos&selectedJob=18442797
Ok that seemed to give the old set of values. I wonder if this changeset actually did somehow change the windows numbers? Doing a bunch of retriggers to confirm, results to be reported here: https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=865b01fa7947&newProject=mozilla-inbound&newRevision=d9520249a4a0&framework=1&filter=%25&showOnlyImportant=0
we had an error and accidentally turned this off for windows, so this fix turned it on for linux and back on for windows. osx is still disabled.
(In reply to Jim Mathies [:jimm] from comment #15) > I wonder if tp5o_responsiveness_paint is broken under e10s? the delta there > is huge. Reading the description here https://wiki.mozilla.org/Buildbot/Talos/Tests#Responsiveness Suggests that the test result value is: > return sum([float(x)*float(x) / 1000000.0 for x in val_list]) This seem to me to depend on the number of collected events, and it's not normalized for the number of events. So less events -> much "better" result. Maybe that's the main culprit here?
we see the tp5o memory/cpu issue in pgo on aurora now: https://treeherder.mozilla.org/perf.html#/alerts?id=320
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: