Closed Bug 1487778 Opened 6 years ago Closed 6 years ago

Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))

Categories

(Core :: DOM: Core & HTML, defect, P2)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: marcia, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is report bp-df5061ee-f87c-4cf5-8c0a-870270180831. ============================================================= Seen while looking at nightly crash data - crashes started using 20180821100053: Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=77433149bfdcc82287acaed94b942818785b556e&tochange=7c96ad3ab6731287646616610e159447c69d6beb. 20 crashes/6 installs so far and Win 10/Win 7 are affected. Top 10 frames of crashing thread: 0 xul.dll void mozilla::dom::TimeoutExecutor::MaybeExecute dom/base/TimeoutExecutor.cpp:172 1 xul.dll mozilla::dom::TimeoutExecutor::Notify dom/base/TimeoutExecutor.cpp:240 2 xul.dll nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:704 3 xul.dll nsTimerEvent::Run xpcom/threads/TimerThread.cpp:297 4 xul.dll mozilla::ThrottledEventQueue::Inner::Executor::Run xpcom/threads/ThrottledEventQueue.cpp:72 5 xul.dll mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:337 6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1167 7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519 8 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:125 9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:318 =============================================================
Any ideas, Ehsan? It looks like this crash has been happening for a while, in low volume, so perhaps this is just a fluke spike. I looked at the regression range in comment 0, and nothing popped out at me.
Flags: needinfo?(ehsan)
Priority: -- → P2
This is the diagnostic assert that bug 1378394 added firing: https://hg.mozilla.org/mozilla-central/rev/c0cdf2fd1fdf#l1.45 As far as I can tell (NIing farre to double check), that assertion is essentially checking that our TimeStamp::Now() function is monotonic. Turns out that it is not, on Windows at least. :-( That means the bug is almost certainly somewhere in this function: <https://searchfox.org/mozilla-central/rev/5a18fb5aeeec99f1ca1c36a697082c221189a3b9/mozglue/misc/TimeStamp_windows.cpp#303>. This function computes the difference of two TimeStamp objects and is what operator<= uses under the hood. If that is where the problem is, we need Honza since I believe he is one of the few who understands how that code works...
Blocks: 1378394
Flags: needinfo?(ehsan) → needinfo?(afarre)
Good catch. I didn't notice that in the crash. I did a search in Socorro for crashes that have this signature. I see a few instances with the signature [@ mozilla::dom::TimeoutManager::RunTimeout ] that look like a Linux-specific version: a677b7e2-ee72-4605-bc84-9379e0180904 I also see one instance of an older crash (bug 1398934), so maybe this is a signature change? b781fef3-bdcc-4f51-aeca-20e680180829
Crash Signature: [@ mozilla::dom::TimeoutExecutor::MaybeExecute] [@ void mozilla::dom::TimeoutExecutor::MaybeExecute] → [@ mozilla::dom::TimeoutExecutor::MaybeExecute] [@ void mozilla::dom::TimeoutExecutor::MaybeExecute] [@ mozilla::dom::TimeoutManager::RunTimeout ]
(In reply to Andrew McCreight [:mccr8] from comment #4) > That should be > bp-a677b7e2-ee72-4605-bc84-9379e0180904 > and > bp-b781fef3-bdcc-4f51-aeca-20e680180829 More of the same! But the fact that it also happens on Linux throws the theory in the last paragraph of comment 2 out of the window...
Yeah, that TimeStamp::Now() isn't monotonic is what I concluded as well. The code leading up to the call to TimeoutManager::RescheduleTimeout is clean enough to make that out.
Flags: needinfo?(afarre)
That is *terrible* if that's true. :-(
Honza, do you have any clue what's going on here? See esp comment 2
Flags: needinfo?(honzab.moz)
Yep, it's on my radar already. I refreshed my memory by looking at the code. There are following 3 scenarios: 1. the processor reports the TSC counter as stable between processor cores that makes us use the QueryPerformanceCounter() (QPC) function w/o any further monotonizing protection since we believe it is monotonic ; we never use GetTickCount64() (GTC) values for the <= and '-' operators then 2. the processor reports the TSC as unstable, but it behaves in a stable way for the whole time of the browser session we still use QueryPerformanceCounter() but we protect it (thread safely) for being monotonic ; we never use GetTickCount64 for difference calculation 3. the processor reports the TSC as unstable and we detect QueryPerformanceCounter() be too erratic -> we switch from using QueryPerformanceCounter to GetTickCount64 at any moment when inside the <= operator this still should be monotonic: we protect QueryPerformanceCounter() and *assume* that GetTickCount64() is monotonic too. if the switch happens between the two lines at [1] we will use values captured by GetTickCount64() in respective moments (note that we ALWAYS capture the GTC value in Now()) Based on that: - if we hit this in scenario 1, QueryPerformanceCounter() jumps regardless the announced TSC stability - if we hit this in scenario 3, GetTickCount64() is not monotonic and we need to use something else or protect it too - we can't hit this in scenario 2, unless there is some bug with synchronization (maybe So, in the changelog I can see we have switched to clang. clang could reveal some mistake in sync we have, but I honestly don't see anything wrong. Note that the 'switch' is defined as: static bool volatile sUseQPC = true; which is likely wrong and should be atomic. still, this flag incorrect atomicity should not effect result of the '-' operator as we base the calculation on result of (presumably) monotonic values. To sum: - let's change the monotonicity forcing for QPC even though the TSC is reported stable to find out if QPC behavior is wrong with TSC stable=1. - if it fixes the issue, I would spend some time on finding out if this could be some clang interference - if not, the bug is likely in gettickcount64 or clang anyway ; hence, let's protect GTC as well and see - if none of it helps, then clang likely does something weird here... [1] https://searchfox.org/mozilla-central/rev/de7676288a78b70d2b9927c79493adbf294faad5/dom/base/TimeoutManager.cpp#826,931
Flags: needinfo?(honzab.moz)
Assignee: nobody → honzab.moz
Comment on attachment 9009564 [details] Bug 1487778 - Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj Nathan Froyd [:froydnj] has approved the revision.
Attachment #9009564 - Flags: review+
Attachment #9009902 - Attachment is obsolete: true
Attachment #9009564 - Attachment is obsolete: true
Attached patch v1 (obsolete) — Splinter Review
Good old way!
Attachment #9009904 - Flags: review+
Keywords: checkin-needed
Keywords: leave-open
Pushed by ebalazs@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ada6a14794e9 Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj
Keywords: checkin-needed
The crash is gone, the patch is currently on Beta, in 64. I can see crash reports for 63 or 64 prior landing the patch only. The last crash on record: https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6-d16c50180920 Ehsan, Nathan, looks like we want to make the monotonicity enforcement become a standard. What do you think? It means an overhead, tho...
Flags: needinfo?(nfroyd)
Flags: needinfo?(ehsan)
(In reply to Honza Bambas (:mayhemer) from comment #16) > The crash is gone, the patch is currently on Beta, in 64. I can see crash > reports for 63 or 64 prior landing the patch only. > > The last crash on record: > https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6- > d16c50180920 Great! > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > become a standard. What do you think? It means an overhead, tho... Can you please explain what you mean by this? I'm not sure if I understand...
Flags: needinfo?(ehsan)
(In reply to Honza Bambas (:mayhemer) from comment #16) > The crash is gone, the patch is currently on Beta, in 64. I can see crash > reports for 63 or 64 prior landing the patch only. I am still seeing a few crashes with 64 beta while the patch landed during the 64 nightly cycle. ex: https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8-2dcba0181116
(In reply to Pascal Chevrel:pascalc from comment #18) > (In reply to Honza Bambas (:mayhemer) from comment #16) > > The crash is gone, the patch is currently on Beta, in 64. I can see crash > > reports for 63 or 64 prior landing the patch only. > > I am still seeing a few crashes with 64 beta while the patch landed during > the 64 nightly cycle. ex: > https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8- > 2dcba0181116 It's a different crash: EXCEPTION_PRIV_INSTRUCTION != MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow) I can see few other types of crashes in this code, but none of them is the one we are dealing with here since comment 2. (In reply to :Ehsan Akhgari from comment #17) > > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > > become a standard. What do you think? It means an overhead, tho... > > Can you please explain what you mean by this? I'm not sure if I > understand... I've landed this change only as a test to confirm the theory: https://bugzilla.mozilla.org/attachment.cgi?id=9009904&action=diff#a/mozglue/misc/TimeStamp_windows.cpp_sec2 We may want to turn it from `if (true || !sHasStableTSC) {` to just running the code all the time (actually, to clean up the code).
Summary: Crash in mozilla::dom::TimeoutExecutor::MaybeExecute → Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))
(In reply to Honza Bambas (:mayhemer) from comment #16) > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > become a standard. What do you think? It means an overhead, tho... If I understand correctly, the monotonicity check is compensating for bad hardware and/or bad software, which seems like a perfectly reasonable thing to keep around long-term.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #20) > (In reply to Honza Bambas (:mayhemer) from comment #16) > > Ehsan, Nathan, looks like we want to make the monotonicity enforcement > > become a standard. What do you think? It means an overhead, tho... > > If I understand correctly, the monotonicity check is compensating for bad > hardware and/or bad software, which seems like a perfectly reasonable thing > to keep around long-term. Yes, it just adds a synchronization overhead. I will write a patch.
Yes, makes sense to me as well. Thank you for the extreme diligence here!
Attached patch v2 (protect QPC)Splinter Review
Experiment turned to a permanent thing.
Attachment #9009904 - Attachment is obsolete: true
Attachment #9027481 - Flags: review?(nfroyd)
Attachment #9027481 - Flags: review?(nfroyd) → review+
Thanks.
Pushed by ncsoregi@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c766f7eaa9f9 Protect QueryPerformanceCounter against going backward, r=froydnj
Keywords: checkin-needed
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Please nominate this for Beta approval when you get a chance.
Flags: needinfo?(honzab.moz)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #27) > Please nominate this for Beta approval when you get a chance. This patch this bug has eventually been fixed with is only a cleanup. We already have exactly the same behavior on Beta. No need to uplift.
Flags: needinfo?(honzab.moz)
Honza, we were discussing this issue over lunch today. We were wondering if instead of doing this as a global fix, we should address this only for the CPUs that were affected by the original problem. Looking at this aggregate view of the original crashers we saw here <https://crash-stats.mozilla.com/signature/?product=Firefox&reason=~EXCEPTION_BREAKPOINT&signature=mozilla%3A%3Adom%3A%3ATimeoutExecutor%3A%3AMaybeExecute&date=%3E%3D2018-06-01T08%3A51%3A00.000Z&date=%3C2018-12-18T07%3A51%3A27.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#aggregations>, it seems that the majority of the crashers are coming from "family 21 model 2" CPUs. Some googling around shows that this CPU model belongs to some AMD chips (e.g. https://en.wikichip.org/wiki/WikiChip:model_numbers) which is consistent with the manufacturer of chips where we've seen readtsc instability on before. Perhaps we should restrict the fix here only to family 21 model 2 CPUs?
Flags: needinfo?(odvarko)
I think you want :mayhemer, not :Honza.
Yes, this seems reasonable and I believe I've seen something similar in chromium. My intention was to have a general fix, but that apparently is not good enough. Thanks for digging so much into this! Are you willing to file a bug for this?
Flags: needinfo?(odvarko)
Yes, certainly! Bug 1515155. (Apologies for mistaking you...)
Depends on: 1515155
Component: DOM → DOM: Core & HTML
See Also: → 1671141
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: