Closed Bug 1352664 Opened 7 years ago Closed 7 years ago

Intermittent e10s toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Sanity check (parent process: 0): totalSystemTime of component is not impossibly larger than that of process: 64097 <= 61005 - false == true - JS frame :: chrome:/

Categories

(Toolkit :: Performance Monitoring, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: aryx, Assigned: Gijs)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1230018 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=87991592&repo=autoland

16:41:45     INFO - TEST-PASS | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Searching by title, the frames don't show up in the list of components - true == true - 
16:41:45     INFO - Searching for window title 'Main frame for test browser_compartments.js 0.845897513513775' in [] (I hope to find it)
16:41:45     INFO - Searching by title, we didn't find the main frame
16:41:45     INFO - Console message: [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: "http://example.com/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.html?test=0.0453446058246062" line: 0}]
16:41:45     INFO - Setting titles
16:41:45     INFO - Buffered messages finished
16:41:45     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Sanity check (parent process: 0): totalSystemTime of component is not impossibly larger than that of process: 64097 <= 61005 - false == true - JS frame :: chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js :: ok :: line 85
16:41:45     INFO - Stack trace:
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:ok:85
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:leq:88
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:monotinicity_tester/frameCheck<:168
16:41:45     INFO -     setInterval handler*monotinicity_tester@chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:206:18
16:41:45     INFO -     test@chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:234:5
16:41:45     INFO -     TaskImpl_run@resource://gre/modules/Task.jsm:319:42
16:41:45     INFO -     process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:922:23
16:41:45     INFO -     walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:806:7
16:41:45     INFO -     Promise*scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:739:11
16:41:45     INFO -     schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:770:7
16:41:45     INFO -     completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:707:7
16:41:45     INFO -     receiveMessage@resource://testing-common/ContentTask.jsm:113:9
16:41:45     INFO - Titles set
16:41:45     INFO - Searching for frame title 'Subframe for test browser_compartments.js 0.9269153926364134' in [] (I hope not to find it)
16:41:45     INFO - TEST-PASS | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Searching by title, the frames don't show up in the list of components - true == true -
48 failures in the last week so far...this picked up frequency, all on osx debug :)
Whiteboard: [stockwell needswork]
yay for someone retriggering:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=osx%20debug%20mochitest-e10s-browser-chrome&tochange=77f5615436bb7be1279c367eb188bb5c3c3c6c97&fromchange=f6bba83342c4a2bf8888b623f40c0e8aaf38a0f7&group_state=expanded&selectedJob=89065460

this looks to be related to bug 1310150 based on 3 failures in 16 runs and 0 failures in 16 runs on the 3 previous commits.

:mattn, do you have ideas on why this would be happening on your change which looks to be eslint related comments?
Blocks: 1310150
Flags: needinfo?(MattN+bmo)
(In reply to Joel Maher ( :jmaher) from comment #3)
> yay for someone retriggering:
> https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-
> searchStr=osx%20debug%20mochitest-e10s-browser-
> chrome&tochange=77f5615436bb7be1279c367eb188bb5c3c3c6c97&fromchange=f6bba8334
> 2c4a2bf8888b623f40c0e8aaf38a0f7&group_state=expanded&selectedJob=89065460
> 
> this looks to be related to bug 1310150 based on 3 failures in 16 runs and 0
> failures in 16 runs on the 3 previous commits.
> 
> :mattn, do you have ideas on why this would be happening on your change
> which looks to be eslint related comments?

This makes no sense, there are no code changes in this commit. :-\

If I had to guess, I would have thought this was bug 1309946, which is a few revs down from there and actually looks related.
I agree with Gijs' analysis.
No longer blocks: 1310150
Flags: needinfo?(MattN+bmo)
let me do more retriggers- I was confused by this as well.
:gijs, you were right!  bug 1309946 looks to be the culprit and it makes sense since the test was adjusted there.  Any thoughts on how to fix this?
Blocks: 1309946
Flags: needinfo?(gijskruitbosch+bugs)
This assert fails:

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js#169

The code looks like this:

    // Sanity check on components data.
    let map = new Map();
    for (let item of snapshot.componentsData) {
      for (let [probe, k] of [
        ["jank", "totalUserTime"],
        ["jank", "totalSystemTime"],
        ["cpow", "totalCPOWTime"]
      ]) {
        // Note that we cannot expect components data to be always smaller
        // than process data, as `getrusage` & co are not monotonic.
        SilentAssert.leq(item[probe][k], 3 * snapshot.processData[probe][k],
          `Sanity check (${name}): ${k} of component is not impossibly larger than that of process`);
      }
...

I don't understand why this would ever happen. Feels like a bug to me. How can the totalSystemTime we measure for a single component be larger than 3 times the totalSystemTime we measure for the entire process? Yoric?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dteller)
(In reply to :Gijs from comment #9)
> I don't understand why this would ever happen. Feels like a bug to me. How
> can the totalSystemTime we measure for a single component be larger than 3
> times the totalSystemTime we measure for the entire process? Yoric?

... because time measures are not reliable and sometimes, time flows backwards. Of course, different kinds of times may flow in different directions at the same time. Fun, isn't it? The work on performance monitoring has been a non-stop nightmare dealing with this kind of issues.
Flags: needinfo?(dteller)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #11)
> (In reply to :Gijs from comment #9)
> > I don't understand why this would ever happen. Feels like a bug to me. How
> > can the totalSystemTime we measure for a single component be larger than 3
> > times the totalSystemTime we measure for the entire process? Yoric?
> 
> ... because time measures are not reliable and sometimes, time flows
> backwards. Of course, different kinds of times may flow in different
> directions at the same time. Fun, isn't it? The work on performance
> monitoring has been a non-stop nightmare dealing with this kind of issues.

Uh, OK... should we just disable the test? Is there something we can do to debug? Under what circumstances does time (according to the perf monitoring system / computer) flow backwards?
Flags: needinfo?(dteller)
(In reply to :Gijs (away until Tuesday 18) from comment #12)
> Uh, OK... should we just disable the test? Is there something we can do to
> debug? Under what circumstances does time (according to the perf monitoring
> system / computer) flow backwards?

Unspecified by the OS, but typically, each CPU (or even each core, depending on the architecture) has a different clock, so this may happen whenever the OS reschedules the thread/process to another CPU/core. Also, I believe the clocks readjust themselves to each other from time to time, unpredictably. So, in practice, whenever.
Flags: needinfo?(dteller)
ouch, this is failing a lot more, I will let :gijs get back and caught up- will ping on Wednesday/Thursday to see if we should disable this temporarily or finish up a fix.
Assignee: nobody → gijskruitbosch+bugs
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #14)
> (In reply to :Gijs (away until Tuesday 18) from comment #12)
> > Uh, OK... should we just disable the test? Is there something we can do to
> > debug? Under what circumstances does time (according to the perf monitoring
> > system / computer) flow backwards?
> 
> Unspecified by the OS, but typically, each CPU (or even each core, depending
> on the architecture) has a different clock, so this may happen whenever the
> OS reschedules the thread/process to another CPU/core. Also, I believe the
> clocks readjust themselves to each other from time to time, unpredictably.
> So, in practice, whenever.

Actually, I think the problem here is with the code and/or test, not with clocks. AIUI the snapshot code includes both parent and child process time, here: 

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/perfmonitoring/PerformanceStats.jsm#847-860

Which would explain why the time for any given component in the child process could easily surpass the total time for the parent process, because they're in different processes. So if the parent is idle and the child isn't, the child time is going to be (much) larger.

I don't think we can make meaningful assertions about the totalSystemTime of the parent process against the totalSystemTime of a component in the child process. Am I missing something?
Flags: needinfo?(dteller)
While the clock issues exist, it's entirely possible that I missed an unrelated problem. I don't have time to check today, but I'll try to check by the end of the week.
Flags: needinfo?(dteller)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #18)
> While the clock issues exist, it's entirely possible that I missed an
> unrelated problem. I don't have time to check today, but I'll try to check
> by the end of the week.

OK, but given the failure rate, we should kill the intermittent here and that will give us breathing space to look into this properly. I'll put up a patch to simply bump the limit and ask you for a rubberstamp.
Thanks for working on this bug, I think a temporary fix/disabling this week sounds like a reasonable solution so we can fix this right.
Comment on attachment 8859578 [details]
Bug 1352664 - temporarily increase threshold for per-compartment timing checks to avoid intermittent orange,

https://reviewboard.mozilla.org/r/131582/#review135044

r=me, but please mention in the commit message that it's a temporary workaround
Attachment #8859578 - Flags: review?(dteller) → review+
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ba94dfa2c7f3
temporarily increase threshold for per-compartment timing checks to avoid intermittent orange, r=Yoric
https://hg.mozilla.org/mozilla-central/rev/ba94dfa2c7f3
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell needswork] → [stockwell fixed]
Blocks: 1359002
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: