Closed
Bug 1487778
Opened 6 years ago
Closed 6 years ago
Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))
Categories
(Core :: DOM: Core & HTML, defect, P2)
Tracking
()
RESOLVED
FIXED
mozilla65
People
(Reporter: marcia, Assigned: mayhemer)
References
(Depends on 1 open bug)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(1 file, 3 obsolete files)
1.39 KB,
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is report bp-df5061ee-f87c-4cf5-8c0a-870270180831. ============================================================= Seen while looking at nightly crash data - crashes started using 20180821100053: Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=77433149bfdcc82287acaed94b942818785b556e&tochange=7c96ad3ab6731287646616610e159447c69d6beb. 20 crashes/6 installs so far and Win 10/Win 7 are affected. Top 10 frames of crashing thread: 0 xul.dll void mozilla::dom::TimeoutExecutor::MaybeExecute dom/base/TimeoutExecutor.cpp:172 1 xul.dll mozilla::dom::TimeoutExecutor::Notify dom/base/TimeoutExecutor.cpp:240 2 xul.dll nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:704 3 xul.dll nsTimerEvent::Run xpcom/threads/TimerThread.cpp:297 4 xul.dll mozilla::ThrottledEventQueue::Inner::Executor::Run xpcom/threads/ThrottledEventQueue.cpp:72 5 xul.dll mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:337 6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1167 7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519 8 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:125 9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:318 =============================================================
Comment 1•6 years ago
|
||
Any ideas, Ehsan? It looks like this crash has been happening for a while, in low volume, so perhaps this is just a fluke spike. I looked at the regression range in comment 0, and nothing popped out at me.
Flags: needinfo?(ehsan)
Priority: -- → P2
Comment 2•6 years ago
|
||
This is the diagnostic assert that bug 1378394 added firing: https://hg.mozilla.org/mozilla-central/rev/c0cdf2fd1fdf#l1.45 As far as I can tell (NIing farre to double check), that assertion is essentially checking that our TimeStamp::Now() function is monotonic. Turns out that it is not, on Windows at least. :-( That means the bug is almost certainly somewhere in this function: <https://searchfox.org/mozilla-central/rev/5a18fb5aeeec99f1ca1c36a697082c221189a3b9/mozglue/misc/TimeStamp_windows.cpp#303>. This function computes the difference of two TimeStamp objects and is what operator<= uses under the hood. If that is where the problem is, we need Honza since I believe he is one of the few who understands how that code works...
Blocks: 1378394
Flags: needinfo?(ehsan) → needinfo?(afarre)
Comment 3•6 years ago
|
||
Good catch. I didn't notice that in the crash. I did a search in Socorro for crashes that have this signature. I see a few instances with the signature [@ mozilla::dom::TimeoutManager::RunTimeout ] that look like a Linux-specific version: a677b7e2-ee72-4605-bc84-9379e0180904 I also see one instance of an older crash (bug 1398934), so maybe this is a signature change? b781fef3-bdcc-4f51-aeca-20e680180829
Crash Signature: [@ mozilla::dom::TimeoutExecutor::MaybeExecute]
[@ void mozilla::dom::TimeoutExecutor::MaybeExecute] → [@ mozilla::dom::TimeoutExecutor::MaybeExecute]
[@ void mozilla::dom::TimeoutExecutor::MaybeExecute] [@ mozilla::dom::TimeoutManager::RunTimeout ]
Comment 4•6 years ago
|
||
That should be bp-a677b7e2-ee72-4605-bc84-9379e0180904 and bp-b781fef3-bdcc-4f51-aeca-20e680180829
Comment 5•6 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #4) > That should be > bp-a677b7e2-ee72-4605-bc84-9379e0180904 > and > bp-b781fef3-bdcc-4f51-aeca-20e680180829 More of the same! But the fact that it also happens on Linux throws the theory in the last paragraph of comment 2 out of the window...
Comment 6•6 years ago
|
||
Yeah, that TimeStamp::Now() isn't monotonic is what I concluded as well. The code leading up to the call to TimeoutManager::RescheduleTimeout is clean enough to make that out.
Flags: needinfo?(afarre)
Comment 7•6 years ago
|
||
That is *terrible* if that's true. :-(
Updated•6 years ago
|
Comment 8•6 years ago
|
||
Honza, do you have any clue what's going on here? See esp comment 2
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 9•6 years ago
|
||
Yep, it's on my radar already. I refreshed my memory by looking at the code. There are following 3 scenarios: 1. the processor reports the TSC counter as stable between processor cores that makes us use the QueryPerformanceCounter() (QPC) function w/o any further monotonizing protection since we believe it is monotonic ; we never use GetTickCount64() (GTC) values for the <= and '-' operators then 2. the processor reports the TSC as unstable, but it behaves in a stable way for the whole time of the browser session we still use QueryPerformanceCounter() but we protect it (thread safely) for being monotonic ; we never use GetTickCount64 for difference calculation 3. the processor reports the TSC as unstable and we detect QueryPerformanceCounter() be too erratic -> we switch from using QueryPerformanceCounter to GetTickCount64 at any moment when inside the <= operator this still should be monotonic: we protect QueryPerformanceCounter() and *assume* that GetTickCount64() is monotonic too. if the switch happens between the two lines at [1] we will use values captured by GetTickCount64() in respective moments (note that we ALWAYS capture the GTC value in Now()) Based on that: - if we hit this in scenario 1, QueryPerformanceCounter() jumps regardless the announced TSC stability - if we hit this in scenario 3, GetTickCount64() is not monotonic and we need to use something else or protect it too - we can't hit this in scenario 2, unless there is some bug with synchronization (maybe So, in the changelog I can see we have switched to clang. clang could reveal some mistake in sync we have, but I honestly don't see anything wrong. Note that the 'switch' is defined as: static bool volatile sUseQPC = true; which is likely wrong and should be atomic. still, this flag incorrect atomicity should not effect result of the '-' operator as we base the calculation on result of (presumably) monotonic values. To sum: - let's change the monotonicity forcing for QPC even though the TSC is reported stable to find out if QPC behavior is wrong with TSC stable=1. - if it fixes the issue, I would spend some time on finding out if this could be some clang interference - if not, the bug is likely in gettickcount64 or clang anyway ; hence, let's protect GTC as well and see - if none of it helps, then clang likely does something weird here... [1] https://searchfox.org/mozilla-central/rev/de7676288a78b70d2b9927c79493adbf294faad5/dom/base/TimeoutManager.cpp#826,931
Flags: needinfo?(honzab.moz)
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → honzab.moz
Assignee | ||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
Comment on attachment 9009564 [details] Bug 1487778 - Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj Nathan Froyd [:froydnj] has approved the revision.
Attachment #9009564 -
Flags: review+
Assignee | ||
Comment 12•6 years ago
|
||
Updated•6 years ago
|
Attachment #9009902 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Attachment #9009564 -
Attachment is obsolete: true
Assignee | ||
Updated•6 years ago
|
Keywords: checkin-needed
Assignee | ||
Updated•6 years ago
|
Keywords: leave-open
Comment 14•6 years ago
|
||
Pushed by ebalazs@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ada6a14794e9 Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj
Keywords: checkin-needed
Comment 15•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ada6a14794e9
Assignee | ||
Comment 16•6 years ago
|
||
The crash is gone, the patch is currently on Beta, in 64. I can see crash reports for 63 or 64 prior landing the patch only. The last crash on record: https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6-d16c50180920 Ehsan, Nathan, looks like we want to make the monotonicity enforcement become a standard. What do you think? It means an overhead, tho...
Flags: needinfo?(nfroyd)
Flags: needinfo?(ehsan)
Comment 17•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16) > The crash is gone, the patch is currently on Beta, in 64. I can see crash > reports for 63 or 64 prior landing the patch only. > > The last crash on record: > https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6- > d16c50180920 Great! > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > become a standard. What do you think? It means an overhead, tho... Can you please explain what you mean by this? I'm not sure if I understand...
Flags: needinfo?(ehsan)
Comment 18•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16) > The crash is gone, the patch is currently on Beta, in 64. I can see crash > reports for 63 or 64 prior landing the patch only. I am still seeing a few crashes with 64 beta while the patch landed during the 64 nightly cycle. ex: https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8-2dcba0181116
status-firefox64:
--- → affected
status-firefox-esr60:
--- → affected
Assignee | ||
Comment 19•6 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #18) > (In reply to Honza Bambas (:mayhemer) from comment #16) > > The crash is gone, the patch is currently on Beta, in 64. I can see crash > > reports for 63 or 64 prior landing the patch only. > > I am still seeing a few crashes with 64 beta while the patch landed during > the 64 nightly cycle. ex: > https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8- > 2dcba0181116 It's a different crash: EXCEPTION_PRIV_INSTRUCTION != MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow) I can see few other types of crashes in this code, but none of them is the one we are dealing with here since comment 2. (In reply to :Ehsan Akhgari from comment #17) > > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > > become a standard. What do you think? It means an overhead, tho... > > Can you please explain what you mean by this? I'm not sure if I > understand... I've landed this change only as a test to confirm the theory: https://bugzilla.mozilla.org/attachment.cgi?id=9009904&action=diff#a/mozglue/misc/TimeStamp_windows.cpp_sec2 We may want to turn it from `if (true || !sHasStableTSC) {` to just running the code all the time (actually, to clean up the code).
Summary: Crash in mozilla::dom::TimeoutExecutor::MaybeExecute → Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))
Comment 20•6 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16) > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > become a standard. What do you think? It means an overhead, tho... If I understand correctly, the monotonicity check is compensating for bad hardware and/or bad software, which seems like a perfectly reasonable thing to keep around long-term.
Flags: needinfo?(nfroyd)
Assignee | ||
Comment 21•6 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #20) > (In reply to Honza Bambas (:mayhemer) from comment #16) > > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > > become a standard. What do you think? It means an overhead, tho... > > If I understand correctly, the monotonicity check is compensating for bad > hardware and/or bad software, which seems like a perfectly reasonable thing > to keep around long-term. Yes, it just adds a synchronization overhead. I will write a patch.
Comment 22•6 years ago
|
||
Yes, makes sense to me as well. Thank you for the extreme diligence here!
Assignee | ||
Comment 23•6 years ago
|
||
Experiment turned to a permanent thing.
Attachment #9009904 -
Attachment is obsolete: true
Attachment #9027481 -
Flags: review?(nfroyd)
Updated•6 years ago
|
Attachment #9027481 -
Flags: review?(nfroyd) → review+
Comment 25•6 years ago
|
||
Pushed by ncsoregi@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c766f7eaa9f9 Protect QueryPerformanceCounter against going backward, r=froydnj
Keywords: checkin-needed
Comment 26•6 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/c766f7eaa9f9
Status: NEW → RESOLVED
Closed: 6 years ago
status-firefox65:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Updated•6 years ago
|
Comment 27•6 years ago
|
||
Please nominate this for Beta approval when you get a chance.
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 28•6 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #27) > Please nominate this for Beta approval when you get a chance. This patch this bug has eventually been fixed with is only a cleanup. We already have exactly the same behavior on Beta. No need to uplift.
Flags: needinfo?(honzab.moz)
Updated•6 years ago
|
Comment 29•6 years ago
|
||
Honza, we were discussing this issue over lunch today. We were wondering if instead of doing this as a global fix, we should address this only for the CPUs that were affected by the original problem. Looking at this aggregate view of the original crashers we saw here <https://crash-stats.mozilla.com/signature/?product=Firefox&reason=~EXCEPTION_BREAKPOINT&signature=mozilla%3A%3Adom%3A%3ATimeoutExecutor%3A%3AMaybeExecute&date=%3E%3D2018-06-01T08%3A51%3A00.000Z&date=%3C2018-12-18T07%3A51%3A27.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#aggregations>, it seems that the majority of the crashers are coming from "family 21 model 2" CPUs. Some googling around shows that this CPU model belongs to some AMD chips (e.g. https://en.wikichip.org/wiki/WikiChip:model_numbers) which is consistent with the manufacturer of chips where we've seen readtsc instability on before. Perhaps we should restrict the fix here only to family 21 model 2 CPUs?
Flags: needinfo?(odvarko)
Comment 30•6 years ago
|
||
I think you want :mayhemer, not :Honza.
Assignee | ||
Comment 31•6 years ago
|
||
Yes, this seems reasonable and I believe I've seen something similar in chromium. My intention was to have a general fix, but that apparently is not good enough. Thanks for digging so much into this! Are you willing to file a bug for this?
Flags: needinfo?(odvarko)
Comment 32•6 years ago
|
||
Yes, certainly! Bug 1515155. (Apologies for mistaking you...)
Depends on: 1515155
Updated•6 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•