Closed Bug 1447643 Opened 2 years ago Closed 11 months ago

Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase

Categories

(Core :: JavaScript: GC, defect, P5, critical)

defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: kats)

References

(Depends on 1 open bug)

Details

(Keywords: crash, intermittent-failure)

Crash Data

Attachments

(1 file)

Filed by: ncsoregi [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=169407754&repo=mozilla-central

https://queue.taskcluster.net/v1/task/AcyqZ_9NTuezoIto7Jn5oQ/runs/0/artifacts/public/logs/live_backing.log

11:25:37     INFO -  TEST-START | xpcshell.ini:toolkit/components/extensions/test/xpcshell/test_ext_content_security_policy.js
11:25:38     INFO -  TEST-PASS | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_contentscript.js | took 244944ms
11:25:38     INFO -  mozcrash Saved minidump as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp
11:25:38     INFO -  mozcrash Saved app info as Z:\task_1521630280\build\blobber_upload_dir\6df75ff5-3f80-4ff9-8612-6391034909ff.extra
11:25:38  WARNING -  PROCESS-CRASH | xpcshell-remote.ini:toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const)]
11:25:38     INFO -  Crash dump filename: c:\users\genericworker\appdata\local\temp\xpc-other-nrhzta\6df75ff5-3f80-4ff9-8612-6391034909ff.dmp
11:25:38     INFO -  Operating system: Windows NT
11:25:38     INFO -                    10.0.15063
11:25:38     INFO -  CPU: amd64
11:25:38     INFO -       family 6 model 63 stepping 2
11:25:38     INFO -       8 CPUs
11:25:38     INFO -  GPU: UNKNOWN
11:25:38     INFO -  Crash reason:  EXCEPTION_BREAKPOINT
11:25:38     INFO -  Crash address: 0x7fffca0d8c39
11:25:38     INFO -  Assertion: Unknown assertion type 0x00000000
11:25:38     INFO -  Process uptime: 27 seconds
11:25:38     INFO -  Thread 0 (crashed)
11:25:38     INFO -   0  xul.dll!static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase::LIMIT,mozilla::BaseTimeDuration<mozilla::TimeDurationValueCalculator> > & const) [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 873 + 0x0]
11:25:38     INFO -      rax = 0x0000000000000000   rdx = 0x0000000000000000
11:25:38     INFO -      rcx = 0x00000000ffffffff   rbx = 0x00007fffd7a1a989
11:25:38     INFO -      rsi = 0x00007fffd7930979   rdi = 0x00007ff801cf8d20
11:25:38     INFO -      rbp = 0x00007fffdca316a0   rsp = 0x000000793c3fdc80
11:25:38     INFO -       r8 = 0x000000793c3fc638    r9 = 0x0000023204e38183
11:25:38     INFO -      r10 = 0x0000000000000000   r11 = 0x000000793c3fdb00
11:25:38     INFO -      r12 = 0x000000793c3fdfb8   r13 = 0x000000000000001b
11:25:38     INFO -      r14 = 0x00007fffdca2af60   r15 = 0x0000023205b55570
11:25:38     INFO -      rip = 0x00007fffca0d8c39
11:25:38     INFO -      Found by: given as instruction pointer in context
11:25:38     INFO -   1  xul.dll!js::gcstats::Statistics::endSlice() [Statistics.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1049 + 0x9]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe2c0   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffca0d7bf8
11:25:38     INFO -      Found by: call frame info
11:25:38     INFO -   2  xul.dll!js::gc::GCRuntime::gcCycle(bool,js::SliceBudget &,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7407 + 0x32]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe3a0   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffc9fc9659
11:25:38     INFO -      Found by: call frame info
11:25:38     INFO -   3  xul.dll!js::gc::GCRuntime::collect(bool,js::SliceBudget,JS::gcreason::Reason) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7538 + 0x14]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe4b0   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffc9fcb754
11:25:38     INFO -      Found by: call frame info
11:25:38     INFO -   4  xul.dll!js::gc::GCRuntime::gcSlice(JS::gcreason::Reason,__int64) [GC.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 7627 + 0x2e]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe610   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffc9fcc2dc
11:25:38     INFO -      Found by: call frame info
11:25:38     INFO -   5  xul.dll!nsJSContext::GarbageCollectNow(JS::gcreason::Reason,nsJSContext::IsIncremental,nsJSContext::IsShrinking,__int64) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1225 + 0xd]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe680   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffc06cfba8
11:25:38     INFO -      Found by: call frame info
11:25:38     INFO -   6  xul.dll!InterSliceGCRunnerFired(mozilla::TimeStamp,void *) [nsJSEnvironment.cpp:f4ddf30ecf57e0d4bc5dac5707c6c153ffb9c726 : 1834 + 0x1d]
11:25:38     INFO -      rbx = 0x00007fffd7a1a989   rbp = 0x00007fffdca316a0
11:25:38     INFO -      rsp = 0x000000793c3fe720   r12 = 0x000000793c3fdfb8
11:25:38     INFO -      r13 = 0x000000000000001b   r14 = 0x00007fffdca2af60
11:25:38     INFO -      r15 = 0x0000023205b55570   rip = 0x00007fffc06d96c2
11:25:38     INFO -      Found by: call frame info
Pretty sure this isn't actually the right component, but its less wrong than Add-ons Manager
Component: Add-ons Manager → WebExtensions: Request Handling
This is a GC assertion failure:

11:25:39     INFO -  PID 8960 | Parent Sweep time = 187.338ms with 2.396ms remaining, child Sweep Shape time 2.887ms
11:25:39     INFO -  PID 8960 | Assertion failure: ok (Inconsistent time data), at z:/build/build/src/js/src/gc/Statistics.cpp:873
Component: WebExtensions: Request Handling → JavaScript: GC
Product: Toolkit → Core
Duplicate of this bug: 1450175
Duplicate of this bug: 1446149
Duplicate of this bug: 1451905
Duplicate of this bug: 1452217
Duplicate of this bug: 1434237
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → INCOMPLETE
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: Last yearLast year
Resolution: --- → INCOMPLETE
Summary: Intermittent toolkit/components/extensions/test/xpcshell/test_ext_webRequest_auth.js | application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase → Intermittent application crashed [@ static js::gcstats::PhaseKind LongestPhaseSelfTimeInMajorGC(const class mozilla::EnumeratedArray<js::gcstats::Phase,js::gcstats::Phase
Duplicate of this bug: 1489396
Duplicate of this bug: 1494106
Duplicate of this bug: 1495536
Duplicate of this bug: 1502261
Duplicate of this bug: 1502504
Duplicate of this bug: 1503090
Comment on attachment 9020763 [details] [diff] [review]
Drop assertion

Redirecting to GC triage owner due to lack of movement
Attachment #9020763 - Flags: review?(sphink) → review?(jcoppeard)
Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and is also showing up in some of the logs for these failures.
See Also: → 1306896
Comment on attachment 9020763 [details] [diff] [review]
Drop assertion

Review of attachment 9020763 [details] [diff] [review]:
-----------------------------------------------------------------

I don't mind turning off this assertion for now, but I'd rather we only did this on Windows (nearly every failure is on Windows).

If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN ?
Attachment #9020763 - Flags: review?(jcoppeard) → review+
(In reply to Jon Coppeard (:jonco) from comment #49)
> Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> is also showing up in some of the logs for these failures.

Indeed, that's a good point. I took a quick look at that assertion failure and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing non-monotonic values, but I don't see how that would happen since the low-res time just comes from GetTickCount64 on Windows, which should be monotonic.

(In reply to Jon Coppeard (:jonco) from comment #50)
> I don't mind turning off this assertion for now, but I'd rather we only did
> this on Windows (nearly every failure is on Windows).
> 
> If you want to do this, can you instead wrap the assertion in #ifndef XP_WIN
> ?

Thanks, will do.
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c916195bb47d
Remove frequently failing assertion. r=jonco
https://hg.mozilla.org/mozilla-central/rev/c916195bb47d
Status: REOPENED → RESOLVED
Closed: Last year11 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51)
> (In reply to Jon Coppeard (:jonco) from comment #49)
> > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> > is also showing up in some of the logs for these failures.
> 
> Indeed, that's a good point. I took a quick look at that assertion failure
> and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing
> non-monotonic values, but I don't see how that would happen since the
> low-res time just comes from GetTickCount64 on Windows, which should be
> monotonic.

Yeah, that was the purpose of this assertion. The GC statistics code doesn't actually care that much about this problem (it'll result in a smattering of weird telemetry results, but that's about it). But I had heard that other code also depends on this timer being monotonic, and the belief was that all remaining issues (all on Windows) had been fixed. So I added an assert to help track it down. When I saw that it was firing, I also found that the logic was pretty complex, so it couldn't immediately be assumed to be the clock going backward. I added a bunch of information to make that easier to track down, but then never circled back to really look at it. I'm going to needinfo Nathan here in case this is of use to him; otherwise, it may as well just be removed where it's problematic (#ifdef XP_WIN).

nfroyd, there should be a bunch of log files with fairly verbose diagnostic output that at least tries to point a finger at what's happening.
Flags: needinfo?(nfroyd)
(In reply to Steve Fink [:sfink] [:s:] from comment #57)
> nfroyd, there should be a bunch of log files with fairly verbose diagnostic
> output that at least tries to point a finger at what's happening.

Where are those log files?

This sounds vaguely like bug 1487778, but we "fixed" that two months ago by adding monotonicity guards in the TimeStamp code.  So I'm not completely sure what the next step is.  Honza fixed the aforementioned bug, perhaps the log files might inspire a similar fix someplace else?
Flags: needinfo?(nfroyd)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #51)
> (In reply to Jon Coppeard (:jonco) from comment #49)
> > Bug 1306896 seems like the same kind of issue (inconsistent timestamps) and
> > is also showing up in some of the logs for these failures.
> 
> Indeed, that's a good point. I took a quick look at that assertion failure
> and left a comment in the bug. AFAICT it's likely TimeStamp::Now() producing
> non-monotonic values, but I don't see how that would happen since the
> low-res time just comes from GetTickCount64 on Windows, which should be
> monotonic.

As Nathan points in comment 58, I've only fixed monotonicity of QueryPerformanceCounter (the hi-res version) = TimeStamp::Now.    The low res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic on a single thread.  It's a software based counter, not HW, and is under control of the OS.
(In reply to Nathan Froyd [:froydnj] from comment #58)
> (In reply to Steve Fink [:sfink] [:s:] from comment #57)
> > nfroyd, there should be a bunch of log files with fairly verbose diagnostic
> > output that at least tries to point a finger at what's happening.
> 
> Where are those log files?

Bug 1400153 (the log messages all have that bug number in them now).

> This sounds vaguely like bug 1487778, but we "fixed" that two months ago by
> adding monotonicity guards in the TimeStamp code.  So I'm not completely
> sure what the next step is.  Honza fixed the aforementioned bug, perhaps the
> log files might inspire a similar fix someplace else?

(In reply to Honza Bambas (:mayhemer) from comment #59)
> As Nathan points in comment 58, I've only fixed monotonicity of
> QueryPerformanceCounter (the hi-res version) = TimeStamp::Now.    The low
> res version (TimeStamp::NowLoRes) is using GTC64, which AFAIK is monotonic
> on a single thread.  It's a software based counter, not HW, and is under
> control of the OS.

Hm. It looks like we're using a function called ReallyNow(), to bypass record-and-replay, which calls NowUnfuzzed(), which calls NowUnfuzzed(aHighResolution=true), which on Windows calls NowInternal(aHighResolution=true), which does something depending on sUseQPC (that's probably your stuff?)

Ok, so looking at an example log https://treeherder.mozilla.org/logviewer.html#?job_id=209485576&repo=mozilla-inbound&lineNumber=70490 I see

Parent Mark During Sweeping time = 499.000ms with 498.890ms remaining, child Mark Gray time 499.000ms
Assertion failure: ok (Inconsistent time data; see bug 1400153), at z:/build/build/src/js/src/gc/Statistics.cpp:981

That first line comes from

    if (selfTimes[parent] < childTime) {
        fprintf(stderr,
                "Parent %s time = %.3fms with %.3fms remaining, child %s time %.3fms\n",
                phases[parent].name,
                times[parent].ToMilliseconds(),
                selfTimes[parent].ToMilliseconds(),
                phases[child].name,
                childTime.ToMilliseconds());
    }

which means that we have a little piece of the time recording tree where the parent node (labeled "Mark During Sweeping") was calculated to take 499.000ms total (end timestamp minus start timestamp), but it has a child "Mark Gray" that took all of that 499.000ms plus at least one other child that took 1.11ms. Children are actions that run within the scope of their parents (as in, a parent starts, then a child starts and eventually ends, perhaps more children run, then finally the parent ends -- but the whole thing *can* be repeated and the totals accumulated, though the inequalities should still hold.) SWEEP_MARK ("Mark During Sweeping") has a whole bunch of potential children https://searchfox.org/mozilla-central/source/js/src/gc/GenerateStatsPhases.py#103 but we're noticing the imbalance here on the 2nd to last possible child.

This is a little indirect. For other assertion sites, I have more specific logging. Let me look at some others... huh. Every single one of them is detected in the same place (line 981). I would have expected, or at least hoped, that a real problem would show up in one of the other sites... and since you disabled this specific check for Windows, I see no bug 1400153 failures in the last 21 days. So it is possible that this check (currently on line 981) is bogus, given that none of the other checks are seeing anything? This may be bad logic after all. Though I don't know why it would only happen on Windows?

The possibility of having multiple runs of these phases is annoying, because it makes it harder to simply record start and end times and report them in these crashes. I have an assertion that when a parent ends, it checks to be sure its recorded end is >= all of its children's. And the same for start times. I don't understand how this can happen if those constraints hold.

Wow, this log https://treeherder.mozilla.org/logviewer.html#?job_id=207654318&repo=autoland&lineNumber=5395 is off by 8ms! All these failures are in LongestPhaseSelfTimeInMajorGC(), which is called for both the regular case and a funky case where we're recording parallel times in another thread. But that 8ms case is the non-parallel one.

If I really wanted to track this down, I guess I'd create an in-memory rolling log of all start and end times and dump it out on failure. Sadly, I'm not feeling that motivated at the moment; the code already handles these failures gracefully in non-debug builds, and it doesn't happen often enough to be worth a lot of effort.
You need to log in before you can comment on or make changes to this bug.