Closed Bug 1487778 Opened 6 years ago Closed 6 years ago

Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))

Categories

(Core :: DOM: Core & HTML, defect, P2)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox-esr60 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fixed

People

(Reporter: marcia, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file, 3 obsolete files)

This bug was filed from the Socorro interface and is
report bp-df5061ee-f87c-4cf5-8c0a-870270180831.
=============================================================

Seen while looking at nightly crash data - crashes started using 20180821100053: 

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=77433149bfdcc82287acaed94b942818785b556e&tochange=7c96ad3ab6731287646616610e159447c69d6beb. 20 crashes/6 installs so far and Win 10/Win 7 are affected.

Top 10 frames of crashing thread:

0 xul.dll void mozilla::dom::TimeoutExecutor::MaybeExecute dom/base/TimeoutExecutor.cpp:172
1 xul.dll mozilla::dom::TimeoutExecutor::Notify dom/base/TimeoutExecutor.cpp:240
2 xul.dll nsTimerImpl::Fire xpcom/threads/nsTimerImpl.cpp:704
3 xul.dll nsTimerEvent::Run xpcom/threads/TimerThread.cpp:297
4 xul.dll mozilla::ThrottledEventQueue::Inner::Executor::Run xpcom/threads/ThrottledEventQueue.cpp:72
5 xul.dll mozilla::SchedulerGroup::Runnable::Run xpcom/threads/SchedulerGroup.cpp:337
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1167
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:519
8 xul.dll mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:125
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:318

=============================================================
Any ideas, Ehsan? It looks like this crash has been happening for a while, in low volume, so perhaps this is just a fluke spike. I looked at the regression range in comment 0, and nothing popped out at me.
Flags: needinfo?(ehsan)
Priority: -- → P2
This is the diagnostic assert that bug 1378394 added firing: https://hg.mozilla.org/mozilla-central/rev/c0cdf2fd1fdf#l1.45

As far as I can tell (NIing farre to double check), that assertion is essentially checking that our TimeStamp::Now() function is monotonic.  Turns out that it is not, on Windows at least.  :-(

That means the bug is almost certainly somewhere in this function: <https://searchfox.org/mozilla-central/rev/5a18fb5aeeec99f1ca1c36a697082c221189a3b9/mozglue/misc/TimeStamp_windows.cpp#303>.  This function computes the difference of two TimeStamp objects and is what operator<= uses under the hood.  If that is where the problem is, we need Honza since I believe he is one of the few who understands how that code works...
Blocks: 1378394
Flags: needinfo?(ehsan) → needinfo?(afarre)
Good catch. I didn't notice that in the crash.

I did a search in Socorro for crashes that have this signature. I see a few instances with the signature [@ mozilla::dom::TimeoutManager::RunTimeout ] that look like a Linux-specific version: a677b7e2-ee72-4605-bc84-9379e0180904

I also see one instance of an older crash (bug 1398934), so maybe this is a signature change? b781fef3-bdcc-4f51-aeca-20e680180829
Crash Signature: [@ mozilla::dom::TimeoutExecutor::MaybeExecute] [@ void mozilla::dom::TimeoutExecutor::MaybeExecute] → [@ mozilla::dom::TimeoutExecutor::MaybeExecute] [@ void mozilla::dom::TimeoutExecutor::MaybeExecute] [@ mozilla::dom::TimeoutManager::RunTimeout ]
(In reply to Andrew McCreight [:mccr8] from comment #4)
> That should be
>   bp-a677b7e2-ee72-4605-bc84-9379e0180904
> and
>   bp-b781fef3-bdcc-4f51-aeca-20e680180829

More of the same!

But the fact that it also happens on Linux throws the theory in the last paragraph of comment 2 out of the window...
Yeah, that TimeStamp::Now() isn't monotonic is what I concluded as well. The code leading up to the call to TimeoutManager::RescheduleTimeout is clean enough to make that out.
Flags: needinfo?(afarre)
That is *terrible* if that's true.  :-(
Honza, do you have any clue what's going on here?  See esp comment 2
Flags: needinfo?(honzab.moz)
Yep, it's on my radar already.  I refreshed my memory by looking at the code.  

There are following 3 scenarios:

1. the processor reports the TSC counter as stable between processor cores

that makes us use the QueryPerformanceCounter() (QPC) function w/o any further monotonizing protection since we believe it is monotonic ; we never use GetTickCount64() (GTC) values for the <= and '-' operators then


2. the processor reports the TSC as unstable, but it behaves in a stable way for the whole time of the browser session

we still use QueryPerformanceCounter() but we protect it (thread safely) for being monotonic ; we never use GetTickCount64 for difference calculation


3. the processor reports the TSC as unstable and we detect QueryPerformanceCounter() be too erratic -> we switch from using QueryPerformanceCounter to GetTickCount64 at any moment when inside the <= operator

this still should be monotonic: we protect QueryPerformanceCounter() and *assume* that GetTickCount64() is monotonic too.  if the switch happens between the two lines at [1] we will use values captured by GetTickCount64() in respective moments (note that we ALWAYS capture the GTC value in Now())



Based on that:
- if we hit this in scenario 1, QueryPerformanceCounter() jumps regardless the announced TSC stability
- if we hit this in scenario 3, GetTickCount64() is not monotonic and we need to use something else or protect it too
- we can't hit this in scenario 2, unless there is some bug with synchronization (maybe 


So, in the changelog I can see we have switched to clang.  clang could reveal some mistake in sync we have, but I honestly don't see anything wrong.

Note that the 'switch' is defined as:
static bool volatile sUseQPC = true;

which is likely wrong and should be atomic.  still, this flag incorrect atomicity should not effect result of the '-' operator as we base the calculation on result of (presumably) monotonic values.


To sum:
- let's change the monotonicity forcing for QPC even though the TSC is reported stable to find out if QPC behavior is wrong with TSC stable=1.
- if it fixes the issue, I would spend some time on finding out if this could be some clang interference
- if not, the bug is likely in gettickcount64 or clang anyway ; hence, let's protect GTC as well and see
- if none of it helps, then clang likely does something weird here...



[1] https://searchfox.org/mozilla-central/rev/de7676288a78b70d2b9927c79493adbf294faad5/dom/base/TimeoutManager.cpp#826,931
Flags: needinfo?(honzab.moz)
Assignee: nobody → honzab.moz
Comment on attachment 9009564 [details]
Bug 1487778 - Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj

Nathan Froyd [:froydnj] has approved the revision.
Attachment #9009564 - Flags: review+
Attachment #9009902 - Attachment is obsolete: true
Attachment #9009564 - Attachment is obsolete: true
Attached patch v1 (obsolete) — Splinter Review
Good old way!
Attachment #9009904 - Flags: review+
Keywords: checkin-needed
Keywords: leave-open
Pushed by ebalazs@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ada6a14794e9
Sentinel QueryPerformanceCounter results for monotonicity even when TSC is reported stable via CPU features, r=froydnj
Keywords: checkin-needed
The crash is gone, the patch is currently on Beta, in 64.  I can see crash reports for 63 or 64 prior landing the patch only.

The last crash on record: https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6-d16c50180920

Ehsan, Nathan, looks like we want to make the monotonicity enforcement become a standard.  What do you think?  It means an overhead, tho...
Flags: needinfo?(nfroyd)
Flags: needinfo?(ehsan)
(In reply to Honza Bambas (:mayhemer) from comment #16)
> The crash is gone, the patch is currently on Beta, in 64.  I can see crash
> reports for 63 or 64 prior landing the patch only.
> 
> The last crash on record:
> https://crash-stats.mozilla.com/report/index/899c564a-9816-4dc2-83c6-
> d16c50180920

Great!

> Ehsan, Nathan, looks like we want to make the monotonicity enforcement
> become a standard.  What do you think?  It means an overhead, tho...

Can you please explain what you mean by this?  I'm not sure if I understand...
Flags: needinfo?(ehsan)
(In reply to Honza Bambas (:mayhemer) from comment #16)
> The crash is gone, the patch is currently on Beta, in 64.  I can see crash
> reports for 63 or 64 prior landing the patch only.

I am still seeing a few crashes with 64 beta while the patch landed during the 64 nightly cycle. ex:
https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8-2dcba0181116
(In reply to Pascal Chevrel:pascalc from comment #18)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > The crash is gone, the patch is currently on Beta, in 64.  I can see crash
> > reports for 63 or 64 prior landing the patch only.
> 
> I am still seeing a few crashes with 64 beta while the patch landed during
> the 64 nightly cycle. ex:
> https://crash-stats.mozilla.com/report/index/d9d31997-b58a-449d-b1d8-
> 2dcba0181116

It's a different crash: EXCEPTION_PRIV_INSTRUCTION != MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow)

I can see few other types of crashes in this code, but none of them is the one we are dealing with here since comment 2.

(In reply to :Ehsan Akhgari from comment #17)
> > Ehsan, Nathan, looks like we want to make the monotonicity enforcement
> > become a standard.  What do you think?  It means an overhead, tho...
> 
> Can you please explain what you mean by this?  I'm not sure if I
> understand...

I've landed this change only as a test to confirm the theory:
https://bugzilla.mozilla.org/attachment.cgi?id=9009904&action=diff#a/mozglue/misc/TimeStamp_windows.cpp_sec2

We may want to turn it from `if (true || !sHasStableTSC) {` to just running the code all the time (actually, to clean up the code).
Summary: Crash in mozilla::dom::TimeoutExecutor::MaybeExecute → Crash in mozilla::dom::TimeoutExecutor::MaybeExecute (MOZ_RELEASE_ASSERT(aLastCallbackTime <= aCurrentNow))
(In reply to Honza Bambas (:mayhemer) from comment #16)
> Ehsan, Nathan, looks like we want to make the monotonicity enforcement
> become a standard.  What do you think?  It means an overhead, tho...

If I understand correctly, the monotonicity check is compensating for bad hardware and/or bad software, which seems like a perfectly reasonable thing to keep around long-term.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #20)
> (In reply to Honza Bambas (:mayhemer) from comment #16)
> > Ehsan, Nathan, looks like we want to make the monotonicity enforcement
> > become a standard.  What do you think?  It means an overhead, tho...
> 
> If I understand correctly, the monotonicity check is compensating for bad
> hardware and/or bad software, which seems like a perfectly reasonable thing
> to keep around long-term.

Yes, it just adds a synchronization overhead.  I will write a patch.
Yes, makes sense to me as well.  Thank you for the extreme diligence here!
Attached patch v2 (protect QPC)Splinter Review
Experiment turned to a permanent thing.
Attachment #9009904 - Attachment is obsolete: true
Attachment #9027481 - Flags: review?(nfroyd)
Attachment #9027481 - Flags: review?(nfroyd) → review+
Thanks.
Pushed by ncsoregi@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c766f7eaa9f9
Protect QueryPerformanceCounter against going backward, r=froydnj
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c766f7eaa9f9
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla65
Please nominate this for Beta approval when you get a chance.
Flags: needinfo?(honzab.moz)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #27)
> Please nominate this for Beta approval when you get a chance.

This patch this bug has eventually been fixed with is only a cleanup.  We already have exactly the same behavior on Beta.  No need to uplift.
Flags: needinfo?(honzab.moz)
Honza, we were discussing this issue over lunch today.  We were wondering if instead of doing this as a global fix, we should address this only for the CPUs that were affected by the original problem.

Looking at this aggregate view of the original crashers we saw here <https://crash-stats.mozilla.com/signature/?product=Firefox&reason=~EXCEPTION_BREAKPOINT&signature=mozilla%3A%3Adom%3A%3ATimeoutExecutor%3A%3AMaybeExecute&date=%3E%3D2018-06-01T08%3A51%3A00.000Z&date=%3C2018-12-18T07%3A51%3A27.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#aggregations>, it seems that the majority of the crashers are coming from "family 21 model 2" CPUs.

Some googling around shows that this CPU model belongs to some AMD chips (e.g. https://en.wikichip.org/wiki/WikiChip:model_numbers) which is consistent with the manufacturer of chips where we've seen readtsc instability on before.

Perhaps we should restrict the fix here only to family 21 model 2 CPUs?
Flags: needinfo?(odvarko)
I think you want :mayhemer, not :Honza.
Yes, this seems reasonable and I believe I've seen something similar in chromium.  My intention was to have a general fix, but that apparently is not good enough.  Thanks for digging so much into this!

Are you willing to file a bug for this?
Flags: needinfo?(odvarko)
Yes, certainly!  Bug 1515155.  (Apologies for mistaking you...)
Depends on: 1515155
Component: DOM → DOM: Core & HTML
See Also: → 1671141
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: