Intermittent e10s toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Sanity check (parent process: 0): totalSystemTime of component is not impossibly larger than that of process: 64097 <= 61005 - false == true - JS frame :: chrome:/

RESOLVED FIXED in Firefox 55

Status

()

Toolkit
Performance Monitoring
P5
normal
RESOLVED FIXED
9 months ago
2 months ago

People

(Reporter: aryx, Assigned: Gijs)

Tracking

(Blocks: 1 bug, {intermittent-failure})

Trunk
mozilla55
intermittent-failure
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [stockwell fixed])

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

+++ This bug was initially created as a clone of Bug #1230018 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=87991592&repo=autoland

16:41:45     INFO - TEST-PASS | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Searching by title, the frames don't show up in the list of components - true == true - 
16:41:45     INFO - Searching for window title 'Main frame for test browser_compartments.js 0.845897513513775' in [] (I hope to find it)
16:41:45     INFO - Searching by title, we didn't find the main frame
16:41:45     INFO - Console message: [JavaScript Error: "The character encoding of the HTML document was not declared. The document will render with garbled text in some browser configurations if the document contains characters from outside the US-ASCII range. The character encoding of the page must be declared in the document or in the transfer protocol." {file: "http://example.com/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.html?test=0.0453446058246062" line: 0}]
16:41:45     INFO - Setting titles
16:41:45     INFO - Buffered messages finished
16:41:45     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Sanity check (parent process: 0): totalSystemTime of component is not impossibly larger than that of process: 64097 <= 61005 - false == true - JS frame :: chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js :: ok :: line 85
16:41:45     INFO - Stack trace:
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:ok:85
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:leq:88
16:41:45     INFO -     chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:monotinicity_tester/frameCheck<:168
16:41:45     INFO -     setInterval handler*monotinicity_tester@chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:206:18
16:41:45     INFO -     test@chrome://mochitests/content/browser/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js:234:5
16:41:45     INFO -     TaskImpl_run@resource://gre/modules/Task.jsm:319:42
16:41:45     INFO -     process@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:922:23
16:41:45     INFO -     walkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:806:7
16:41:45     INFO -     Promise*scheduleWalkerLoop@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:739:11
16:41:45     INFO -     schedulePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:770:7
16:41:45     INFO -     completePromise@resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:707:7
16:41:45     INFO -     receiveMessage@resource://testing-common/ContentTask.jsm:113:9
16:41:45     INFO - Titles set
16:41:45     INFO - Searching for frame title 'Subframe for test browser_compartments.js 0.9269153926364134' in [] (I hope not to find it)
16:41:45     INFO - TEST-PASS | toolkit/components/perfmonitoring/tests/browser/browser_compartments.js | Searching by title, the frames don't show up in the list of components - true == true -

Comment 1

8 months ago
6 failures in 845 pushes (0.007 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 3
* mozilla-inbound: 2
* mozilla-central: 1

Platform breakdown:
* osx-10-10: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-03-27&endday=2017-04-02&tree=all
48 failures in the last week so far...this picked up frequency, all on osx debug :)
Whiteboard: [stockwell needswork]
yay for someone retriggering:
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=osx%20debug%20mochitest-e10s-browser-chrome&tochange=77f5615436bb7be1279c367eb188bb5c3c3c6c97&fromchange=f6bba83342c4a2bf8888b623f40c0e8aaf38a0f7&group_state=expanded&selectedJob=89065460

this looks to be related to bug 1310150 based on 3 failures in 16 runs and 0 failures in 16 runs on the 3 previous commits.

:mattn, do you have ideas on why this would be happening on your change which looks to be eslint related comments?
Blocks: 1310150
Flags: needinfo?(MattN+bmo)
(Assignee)

Comment 4

8 months ago
(In reply to Joel Maher ( :jmaher) from comment #3)
> yay for someone retriggering:
> https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-
> searchStr=osx%20debug%20mochitest-e10s-browser-
> chrome&tochange=77f5615436bb7be1279c367eb188bb5c3c3c6c97&fromchange=f6bba8334
> 2c4a2bf8888b623f40c0e8aaf38a0f7&group_state=expanded&selectedJob=89065460
> 
> this looks to be related to bug 1310150 based on 3 failures in 16 runs and 0
> failures in 16 runs on the 3 previous commits.
> 
> :mattn, do you have ideas on why this would be happening on your change
> which looks to be eslint related comments?

This makes no sense, there are no code changes in this commit. :-\

If I had to guess, I would have thought this was bug 1309946, which is a few revs down from there and actually looks related.
I agree with Gijs' analysis.
No longer blocks: 1310150
Flags: needinfo?(MattN+bmo)
let me do more retriggers- I was confused by this as well.
:gijs, you were right!  bug 1309946 looks to be the culprit and it makes sense since the test was adjusted there.  Any thoughts on how to fix this?
Blocks: 1309946
Flags: needinfo?(gijskruitbosch+bugs)

Comment 8

8 months ago
54 failures in 867 pushes (0.062 failures/push) were associated with this bug in the last 7 days. 

This is the #30 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 26
* mozilla-inbound: 19
* try: 3
* mozilla-central: 3
* graphics: 2
* oak: 1

Platform breakdown:
* osx-10-10: 54

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-04-03&endday=2017-04-09&tree=all
(Assignee)

Comment 9

8 months ago
This assert fails:

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/perfmonitoring/tests/browser/browser_compartments.js#169

The code looks like this:

    // Sanity check on components data.
    let map = new Map();
    for (let item of snapshot.componentsData) {
      for (let [probe, k] of [
        ["jank", "totalUserTime"],
        ["jank", "totalSystemTime"],
        ["cpow", "totalCPOWTime"]
      ]) {
        // Note that we cannot expect components data to be always smaller
        // than process data, as `getrusage` & co are not monotonic.
        SilentAssert.leq(item[probe][k], 3 * snapshot.processData[probe][k],
          `Sanity check (${name}): ${k} of component is not impossibly larger than that of process`);
      }
...

I don't understand why this would ever happen. Feels like a bug to me. How can the totalSystemTime we measure for a single component be larger than 3 times the totalSystemTime we measure for the entire process? Yoric?
Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(dteller)

Comment 10

8 months ago
15 failures in 154 pushes (0.097 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 10
* mozilla-inbound: 3
* mozilla-central: 1
* graphics: 1

Platform breakdown:
* osx-10-10: 15

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-04-12&endday=2017-04-12&tree=all
(In reply to :Gijs from comment #9)
> I don't understand why this would ever happen. Feels like a bug to me. How
> can the totalSystemTime we measure for a single component be larger than 3
> times the totalSystemTime we measure for the entire process? Yoric?

... because time measures are not reliable and sometimes, time flows backwards. Of course, different kinds of times may flow in different directions at the same time. Fun, isn't it? The work on performance monitoring has been a non-stop nightmare dealing with this kind of issues.
Flags: needinfo?(dteller)
(Assignee)

Comment 12

8 months ago
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #11)
> (In reply to :Gijs from comment #9)
> > I don't understand why this would ever happen. Feels like a bug to me. How
> > can the totalSystemTime we measure for a single component be larger than 3
> > times the totalSystemTime we measure for the entire process? Yoric?
> 
> ... because time measures are not reliable and sometimes, time flows
> backwards. Of course, different kinds of times may flow in different
> directions at the same time. Fun, isn't it? The work on performance
> monitoring has been a non-stop nightmare dealing with this kind of issues.

Uh, OK... should we just disable the test? Is there something we can do to debug? Under what circumstances does time (according to the perf monitoring system / computer) flow backwards?
Flags: needinfo?(dteller)

Comment 13

8 months ago
18 failures in 205 pushes (0.088 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* autoland: 11
* mozilla-inbound: 6
* mozilla-central: 1

Platform breakdown:
* osx-10-10: 18

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-04-13&endday=2017-04-13&tree=all
(In reply to :Gijs (away until Tuesday 18) from comment #12)
> Uh, OK... should we just disable the test? Is there something we can do to
> debug? Under what circumstances does time (according to the perf monitoring
> system / computer) flow backwards?

Unspecified by the OS, but typically, each CPU (or even each core, depending on the architecture) has a different clock, so this may happen whenever the OS reschedules the thread/process to another CPU/core. Also, I believe the clocks readjust themselves to each other from time to time, unpredictably. So, in practice, whenever.
Flags: needinfo?(dteller)

Comment 15

8 months ago
66 failures in 894 pushes (0.074 failures/push) were associated with this bug in the last 7 days. 

This is the #18 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* autoland: 40
* mozilla-inbound: 17
* graphics: 5
* mozilla-central: 4

Platform breakdown:
* osx-10-10: 66

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-04-10&endday=2017-04-16&tree=all
ouch, this is failing a lot more, I will let :gijs get back and caught up- will ping on Wednesday/Thursday to see if we should disable this temporarily or finish up a fix.
(Assignee)

Updated

8 months ago
Assignee: nobody → gijskruitbosch+bugs
(Assignee)

Comment 17

8 months ago
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #14)
> (In reply to :Gijs (away until Tuesday 18) from comment #12)
> > Uh, OK... should we just disable the test? Is there something we can do to
> > debug? Under what circumstances does time (according to the perf monitoring
> > system / computer) flow backwards?
> 
> Unspecified by the OS, but typically, each CPU (or even each core, depending
> on the architecture) has a different clock, so this may happen whenever the
> OS reschedules the thread/process to another CPU/core. Also, I believe the
> clocks readjust themselves to each other from time to time, unpredictably.
> So, in practice, whenever.

Actually, I think the problem here is with the code and/or test, not with clocks. AIUI the snapshot code includes both parent and child process time, here: 

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/perfmonitoring/PerformanceStats.jsm#847-860

Which would explain why the time for any given component in the child process could easily surpass the total time for the parent process, because they're in different processes. So if the parent is idle and the child isn't, the child time is going to be (much) larger.

I don't think we can make meaningful assertions about the totalSystemTime of the parent process against the totalSystemTime of a component in the child process. Am I missing something?
Flags: needinfo?(dteller)
While the clock issues exist, it's entirely possible that I missed an unrelated problem. I don't have time to check today, but I'll try to check by the end of the week.
Flags: needinfo?(dteller)
Comment hidden (mozreview-request)
(Assignee)

Comment 20

8 months ago
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #18)
> While the clock issues exist, it's entirely possible that I missed an
> unrelated problem. I don't have time to check today, but I'll try to check
> by the end of the week.

OK, but given the failure rate, we should kill the intermittent here and that will give us breathing space to look into this properly. I'll put up a patch to simply bump the limit and ask you for a rubberstamp.
Thanks for working on this bug, I think a temporary fix/disabling this week sounds like a reasonable solution so we can fix this right.

Comment 22

8 months ago
mozreview-review
Comment on attachment 8859578 [details]
Bug 1352664 - temporarily increase threshold for per-compartment timing checks to avoid intermittent orange,

https://reviewboard.mozilla.org/r/131582/#review135044

r=me, but please mention in the commit message that it's a temporary workaround
Attachment #8859578 - Flags: review?(dteller) → review+
Comment hidden (mozreview-request)

Comment 24

8 months ago
Pushed by gijskruitbosch@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/ba94dfa2c7f3
temporarily increase threshold for per-compartment timing checks to avoid intermittent orange, r=Yoric

Comment 25

8 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ba94dfa2c7f3
Status: NEW → RESOLVED
Last Resolved: 8 months ago
status-firefox55: affected → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55

Comment 26

8 months ago
26 failures in 817 pushes (0.032 failures/push) were associated with this bug in the last 7 days. 

This is the #42 most frequent failure this week.  

Repository breakdown:
* autoland: 14
* mozilla-inbound: 10
* mozilla-central: 2

Platform breakdown:
* osx-10-10: 26

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-04-17&endday=2017-04-23&tree=all
Whiteboard: [stockwell needswork] → [stockwell fixed]
(Assignee)

Updated

8 months ago
Blocks: 1359002

Comment 27

6 months ago
1 failures in 814 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-06-12&endday=2017-06-18&tree=all

Comment 28

3 months ago
1 failures in 939 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-08-28&endday=2017-09-03&tree=all

Comment 29

2 months ago
1 failures in 947 pushes (0.001 failures/push) were associated with this bug in the last 7 days.    

Repository breakdown:
* mozilla-esr52: 1

Platform breakdown:
* osx-10-10: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1352664&startday=2017-10-09&endday=2017-10-15&tree=all
You need to log in before you can comment on or make changes to this bug.