Closed Bug 1402717 Opened 7 years ago Closed 7 months ago

Intermittent Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:674

Categories

(Core :: DOM: Core & HTML, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(2 files)

Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=132956649&repo=autoland https://queue.taskcluster.net/v1/task/RV3Wl-dUQd-6mv8uGI9XAw/runs/0/artifacts/public/logs/live_backing.log 20:19:57 INFO - 1034 INFO None1035 INFO TEST-START | toolkit/content/tests/chrome/test_closemenu_attribute.xul 20:19:57 INFO - GECKO(2240) | ++DOMWINDOW == 64 (17B58800) [pid = 2240] [serial = 156] [outer = 1677EC00] 20:19:57 INFO - GECKO(2240) | firstTimeStamp 1.000000, mTimers[0]->Timeout() 2.000000, initialFirstTimer 0EFCB620, current first 0DBACFA0 20:19:57 INFO - GECKO(2240) | Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:663 20:19:57 INFO - GECKO(2240) | #01: TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp,unsigned int) [xpcom/threads/TimerThread.cpp:663] 20:19:57 INFO - GECKO(2240) | #02: NS_GetTimerDeadlineHintOnCurrentThread(mozilla::TimeStamp,unsigned int) [xpcom/threads/nsTimerImpl.cpp:54] 20:19:57 INFO - GECKO(2240) | #03: mozilla::MainThreadIdlePeriod::GetIdlePeriodHint(mozilla::TimeStamp *) [xpcom/threads/MainThreadIdlePeriod.cpp:33] 20:19:57 INFO - GECKO(2240) | #04: mozilla::PrioritizedEventQueue<mozilla::EventQueue>::GetIdleDeadline() [xpcom/threads/PrioritizedEventQueue.cpp:90] 20:19:57 INFO - GECKO(2240) | #05: mozilla::PrioritizedEventQueue<mozilla::EventQueue>::HasReadyEvent(mozilla::BaseAutoLock<mozilla::Mutex> const &) [xpcom/threads/PrioritizedEventQueue.cpp:291] 20:19:57 INFO - GECKO(2240) | #06: mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::EventQueue> >::HasPendingEvent() [xpcom/threads/ThreadEventQueue.cpp:152] 20:19:57 INFO - GECKO(2240) | #07: nsThread::HasPendingEvents(bool *) [xpcom/threads/nsThread.cpp:818] 20:19:57 INFO - GECKO(2240) | #08: hasPendingEvents(nsIThread *) [xpcom/threads/nsThreadUtils.cpp:479] 20:19:57 INFO - GECKO(2240) | #09: NS_HasPendingEvents(nsIThread *) [xpcom/threads/nsThreadUtils.cpp:497] 20:19:57 INFO - GECKO(2240) | #10: nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal *,bool) [widget/nsBaseAppShell.cpp:283] 20:19:57 INFO - GECKO(2240) | #11: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:955] 20:19:58 INFO - GECKO(2240) | #12: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/threads/nsThreadUtils.cpp:521] 20:19:58 INFO - GECKO(2240) | #13: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:97] 20:19:58 INFO - GECKO(2240) | #14: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:326] 20:19:58 INFO - GECKO(2240) | #15: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:320] 20:19:58 INFO - GECKO(2240) | #16: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:300] 20:19:58 INFO - GECKO(2240) | #17: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:160] 20:19:58 INFO - GECKO(2240) | #18: nsAppShell::Run() [widget/windows/nsAppShell.cpp:230] 20:19:58 INFO - GECKO(2240) | #19: nsAppStartup::Run() [toolkit/components/startup/nsAppStartup.cpp:289] 20:19:58 INFO - GECKO(2240) | #20: XREMain::XRE_mainRun() [toolkit/xre/nsAppRunner.cpp:4701] 20:19:58 INFO - GECKO(2240) | #21: XREMain::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4865] 20:19:58 INFO - GECKO(2240) | #22: XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/nsAppRunner.cpp:4960] 20:19:58 INFO - GECKO(2240) | #23: mozilla::BootstrapImpl::XRE_main(int,char * * const,mozilla::BootstrapConfig const &) [toolkit/xre/Bootstrap.cpp:45] 20:19:58 INFO - GECKO(2240) | #24: do_main [browser/app/nsBrowserApp.cpp:237] 20:19:58 INFO - GECKO(2240) | #25: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:309] 20:19:58 INFO - GECKO(2240) | #26: wmain [toolkit/xre/nsWindowsWMain.cpp:118] 20:19:58 INFO - GECKO(2240) | #27: __scrt_common_main_seh [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:253] 20:19:58 INFO - GECKO(2240) | #28: kernel32.dll + 0x53c45 20:19:58 INFO - GECKO(2240) | #29: ntdll.dll + 0x637f5 20:19:58 INFO - GECKO(2240) | #30: ntdll.dll + 0x637c8 20:19:58 INFO - TEST-INFO | Main app process: exit 1
So, while looking into this, I noticed the horrorshow that underpins all of the timestamp comparison logic on win32: https://dxr.mozilla.org/mozilla-central/source/mozglue/misc/TimeStamp_windows.cpp?q=TimeStamp_windows.cpp&redirect_type=direct#303 Depending on certain error conditions, this function can *modify how subsequent TimeStampValues are created*, here: https://dxr.mozilla.org/mozilla-central/source/mozglue/misc/TimeStamp_windows.cpp#357 This function essentially boils down to comparing two TimeStampValues by |mGTC| if either does not support QPC, and in most other cases uses QPC (it is more complex than this, which muddies the water enough that the problem is non-obvious). If QPC is disabled, this code *does not impose a well ordering on TimeStampValues* anymore!!! Consider the following win32 TimeStampValues: A : { mGTC : 3, mQPC : 1, mHasQPC : true } B : { mGTC : 1, mQPC : 3, mHasQPC : true } C : { mGTC : 2, mQPC : 2, mHasQPC : false <--- Created after QPC was deemed unreliable, and disabled } Because both A and B use QPC, A < B. Because C does _not_ use QPC, B < C. Because C does _not_ use QPC, C < A. Ooooops. This could definitely invalidate the timer queue, and lead to the error we see here. I'm seeing if disabling QPC clears this up.
Also: note that the above comparison logic means that high precision timestamps cannot be compared to non-high precision timestamps in a consistent fashion.
needinfoing Honza, since it looks like he's been the one working with this win32 GTC/QPC code: I've looked into this some more, and it seems more reasonable to consider a scenario like the following: A : { mGTC : 1, mQPC : 2, mHasQPC : true } B : { mGTC : 1, mQPC : 3, mHasQPC : true } C : { mGTC : 1, mQPC : 4, mHasQPC : false <--- Created after QPC was deemed unreliable, and disabled } So we have B is incomparable (neither < or >) to C, and C is incomparable to A, but A < B, which is a violation of strict weak ordering, which means that stl sorting algorithms will go off the rails. So I've been thinking about ways to get ourselves out of this mess. It isn't pretty at all. We seem to have two broad ways to approach this: 1. Make the comparison logic impose a strict weak ordering. or 2. Accept that the comparison logic does not impose a strict weak ordering, and compensate for it across the codebase. Personally, I think 1) is the better option here. 2) sounds like a nightmare, and will require ongoing effort to clean up when people trip over this gotcha. So how do we restore a strict weak ordering? Presumably QPC is important, or we would not have devoted so much code to its implementation. Feel free to correct me if we don't _really_ need it, in which case this becomes an easy fix. If my understanding is correct, QPC is valuable because it is more precise than GTC, but it is inaccurate over long durations. This makes it seem like QPC only matters when comparing two timestamps that are close together, and makes it seem like trying to rectify long-term drift in QPC is a waste of effort. If we can assume that A.QPC == B.QPC (high precision tick counts are equal) implies A.GTC - B.GTC < 100ms, the following will get us a strict weak ordering: 1. Record GTC and QPC for each timestamp. 2. When calculating the elapsed time between two timestamps, if GTC differs by more than 100ms worth of ticks, use that, otherwise use QPC. If A.QPC == B.QPC doesn't imply anything about the difference between A.GTC and B.GTC, I would argue that at least one of these measures is useless, and we're better off just going with one. Also, we would need to do something about TimeStampValue::operator+=(const int64_t aOther) and TimeStampValue::operator-=(const int64_t aOther); these can cause cases where QPC is equal, but GTC differs by a lot (whatever the drift is between QPC and GTC).
Flags: needinfo?(honzab.moz)
Another option is to toss in a "tie breaker" when comparing a TimeStampValue with QPC to one without, in cases where their GTC values are the same. Here's a try push with that modification: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67ef58b1a7a4322de51be688f68c74e9eef3ba4b&selectedJob=141136692 I've retriggered win7 debug jsreftest (a common place for this to occur) a couple hundred times, and could not reproduce. Getting a simpler approach (like I described in comment 12) working might be hard; since we use TimeStampValue::operator+=(const int64_t aOther) and TimeStampValue::operator-=(const int64_t aOther) in the timer code, we're upsetting a lot of stuff.
(In reply to Byron Campen [:bwc] from comment #12) > needinfoing Honza, since it looks like he's been the one working with this > win32 GTC/QPC code: > > I've looked into this some more, and it seems more reasonable to consider a > scenario like the following: > > A : { > mGTC : 1, > mQPC : 2, > mHasQPC : true > } > > B : { > mGTC : 1, > mQPC : 3, > mHasQPC : true > } > > C : { > mGTC : 1, > mQPC : 4, > mHasQPC : false <--- Created after QPC was deemed unreliable, and disabled > } FYI, a timestamp like this can't exist. If mHasQPC is false then mQPC is 0, but I get the point. > > So we have B is incomparable (neither < or >) to C, and C is incomparable to > A, Not sure what you mean by incomparable in this case. If any of the two timestamps being compared is missing QPC value we still compare but use the GTC value (lower but more reliable resolution). > but A < B, which is a violation of strict weak ordering, which means that > stl sorting algorithms will go off the rails. Got it! Hmm... have to think more about this... In general, only thing you can assume about two timestamps taken one by one (A and B) is that A <= B, never A < B! We could also keep a global atomic counter on each captured timestamp value (a fourth member) that would override the < operator result when there is a dispute. > > So I've been thinking about ways to get ourselves out of this mess. It isn't > pretty at all. We seem to have two broad ways to approach this: > > 1. Make the comparison logic impose a strict weak ordering. > > or > > 2. Accept that the comparison logic does not impose a strict weak ordering, > and compensate for it across the codebase. > > Personally, I think 1) is the better option here. 2) sounds like a > nightmare, and will require ongoing effort to clean up when people trip over > this gotcha. > > So how do we restore a strict weak ordering? Presumably QPC is important, or > we would not have devoted so much code to its implementation. Feel free to > correct me if we don't _really_ need it, in which case this becomes an easy > fix. It's the best resolution vs performance (by means of CPU cycles spent in it) timer api available in modern (or less modern) versions of windows and hardware. > > If my understanding is correct, QPC is valuable because it is more precise > than GTC, but it is inaccurate over long durations. On some specific hardware/os combinations or when S3+ processor states are involved (GTC shifts differently than QPC after wakeup) > This makes it seem like > QPC only matters when comparing two timestamps that are close together, and > makes it seem like trying to rectify long-term drift in QPC is a waste of > effort. We may want to measure time durations up to 1 minute, so not sure if that is "close" or not.. > If we can assume that A.QPC == B.QPC (high precision tick counts are > equal) implies A.GTC - B.GTC < 100ms, the following will get us a strict > weak ordering: > > 1. Record GTC and QPC for each timestamp. > 2. When calculating the elapsed time between two timestamps, if GTC differs > by more than 100ms worth of ticks, use that, otherwise use QPC. Hmm.. that was my very first tough too long ago. Though, 100ms is way too short, I definitely want a high precision even for way longer intervals (several seconds). The problem with QPC is that it sometimes can jump forward or backwards unexpectedly. The code I've added (doesn't mean it's the best way to try to correct QPC!!) is trying to keep the QPC skewing in some narrowed range. Maybe the whole auto-tuning idea is just wrong and we should do something else here. Maybe when TSC is not advertised as stable just use GTC (or some other timer API that may be sensitive to timeBeginPeriod, see https://bugzilla.mozilla.org/show_bug.cgi?id=676349#c1 > > If A.QPC == B.QPC doesn't imply anything about the difference between A.GTC > and B.GTC, I would argue that at least one of these measures is useless, and > we're better off just going with one. > > Also, we would need to do something about TimeStampValue::operator+=(const > int64_t aOther) and TimeStampValue::operator-=(const int64_t aOther); these > can cause cases where QPC is equal, but GTC differs by a lot (whatever the > drift is between QPC and GTC).
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #14) > (In reply to Byron Campen [:bwc] from comment #12) > > If my understanding is correct, QPC is valuable because it is more precise > > than GTC, but it is inaccurate over long durations. > > On some specific hardware/os combinations or when S3+ processor states are > involved (GTC shifts differently than QPC after wakeup) GTC's shift is considered more correct in this case, I'm guessing? > We could also keep a global atomic counter on each captured timestamp value (a fourth member) that would override the < operator result when there is a dispute. I thought about this some. Things get weird when you use TimeStampValue::operator+=(const > int64_t aOther) or TimeStampValue::operator-=(const int64_t aOther). But we may need something like this anyway; w3c says that two subsequent calls to setTimeout() with the same duration must fire in the order that they are scheduled. If we are in a situation where QPC is disabled, the TimeStampValues created are likely to compare equal, and the data structure inside TimerThread is not stable...
I wonder what would happen if we used a composite tick count. Use the upper bits of GTC, the lower bits of QPC, and then a sequence number. As long as we have some filtering in place to ensure that GTC and QPC are monotonic (we do have this for QPC already), no two values will compare equal, and even if QPC just stops advancing we will get reasonable durations. Or maybe we could build the sequence number into the QPC bits itself (ie; instead of having |PerformanceCounter| ensure that QPC doesn't decrease, we have it ensure that the value it returns increases by at least one). That would avoid the need for a separate sequence number.
Flags: needinfo?(honzab.moz)
(In reply to Byron Campen [:bwc] from comment #16) > I wonder what would happen if we used a composite tick count. Use the upper > bits of GTC, the lower bits of QPC, And how exactly would you do that when those two values are completely different? > and then a sequence number. As long as > we have some filtering in place to ensure that GTC and QPC are monotonic (we > do have this for QPC already), no two values will compare equal, and even if > QPC just stops advancing we will get reasonable durations. > > Or maybe we could build the sequence number into the QPC bits itself (ie; > instead of having |PerformanceCounter| ensure that QPC doesn't decrease, we > have it ensure that the value it returns increases by at least one). That > would avoid the need for a separate sequence number. I'm not a big fan of that too. (In reply to Byron Campen [:bwc] from comment #15) > (In reply to Honza Bambas (:mayhemer) from comment #14) > > (In reply to Byron Campen [:bwc] from comment #12) > > > If my understanding is correct, QPC is valuable because it is more precise > > > than GTC, but it is inaccurate over long durations. > > > > On some specific hardware/os combinations or when S3+ processor states are > > involved (GTC shifts differently than QPC after wakeup) > > GTC's shift is considered more correct in this case, I'm guessing? Yep, GTC is known to be consistent (doesn't go backward and reflects the actual time, even though it may not update every 15.6ms, it's discretely linear and monotonic) > > > We could also keep a global atomic counter on each captured timestamp value (a fourth member) that would override the < operator result when there is a dispute. > > I thought about this some. Things get weird when you use > TimeStampValue::operator+=(const > > int64_t aOther) or TimeStampValue::operator-=(const int64_t aOther). But we may need something like this anyway; w3c says that two subsequent calls to setTimeout() with the same duration must fire in the order that they are scheduled. If we are in a situation where QPC is disabled, the TimeStampValues created are likely to compare equal, and the data structure inside TimerThread is not stable... Good point, OTOH the global counter would be used only in cases two timestamps being compared were equal. The only goal here is to keep sorting consistent and for two different timestamps never be ==. Hence, if every new instance of TimeStamp is assigned new counter value (++) every time, regardless if it's coming from Now()/NowLoRes() or is a result of a calculation, we ensure that the compare op will always have the same result for the two. That's enough.
Flags: needinfo?(honzab.moz)
I think I have something that will work, and massively simplify things. Stay tuned.
I'm pretty sure this is related to bug 1367182, which seems to happen when TimeStamps are not consistent between parent and content processes.
See Also: → 1367182
Ok, so here's an interesting result. I tried modifying the timestamp logic on windows to use a single composite timestamp based on GTC, QPC, and a little drift correction. It increased the frequency of bug 1367182 a _lot_. https://treeherder.mozilla.org/#/jobs?repo=try&revision=a5b1f4f8feb7b77b281cf4bd85af666eb8cbd93f&selectedJob=141735691 So, I tried removing that assert. https://treeherder.mozilla.org/#/jobs?repo=try&revision=18adb6a9735c5566d1315b60ef55a8c9009cb3ea&selectedJob=141755197 Now I see a number of other pre-existing intermittents becoming more frequent. My hypothesis is that we have lots of timestamp-consistency intermittents caused when the parent and content process generate TimeStampValues in different ways (ie; one of them has disabled QPC due to drift, but the other has not). My composite timestamp code does drift correction as needed, which will cause inconsistency to arise between processes more frequently, which could explain why these intermittents become more frequent. The bad news is no matter how careful we are to ensure timestamp consistency in a single process, if we do any kind of drift correction/compensation when creating timestamps, it is going to lead to inter-process inconsistencies in how timestamps are created.
Flags: needinfo?(honzab.moz)
I tried using timeGetTime, which is supposed to have a little better resolution than GetTickCount, but that hit the same intermittents. Maybe we could generate all timestamps in the parent process's memory space, and share that memory with the child processes? Ugh. We may just have to make the small tweak from comment 13, wait a decade, and hope that QueryPerformanceCounter becomes more reliable... :(
I would more think of replacing QPC with timeGetTime, not GTC. We could also do a following thing: - use QPC+GTC when TSC is reported stable (this can be count on to my knowledge, we already do that) - use timeGetTime+GTC when not - use only GTC when NowLoRes is used (since both QPC and timeGetTime are a bit expensive to call and a resolution of 16+ms is sufficient for the case) - remove the auto-tune logic completely (leaving GTC only to allow compare of NowLoRes() and Now())
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #23) > I would more think of replacing QPC with timeGetTime, not GTC. > > We could also do a following thing: > - use QPC+GTC when TSC is reported stable (this can be count on to my > knowledge, we already do that) > - use timeGetTime+GTC when not > - use only GTC when NowLoRes is used (since both QPC and timeGetTime are a > bit expensive to call and a resolution of 16+ms is sufficient for the case) > - remove the auto-tune logic completely (leaving GTC only to allow compare > of NowLoRes() and Now()) I tried TGT + GTC, and that was not granular enough to prevent an orangefest; TGT is not significantly better than GTC it seems. We would need a serial number for sure if we did that, but then we would have duplicate serial numbers from timestamps received over IPC. I think we should make the sort algorithm in TimerThread stable again. That would avoid this excessive need for granularity, possibly giving us more options.
Needed to retrigger the debug build so that the windows 7 debug tests can run... check back.
Flags: needinfo?(docfaraday)
Ok, with some additional experimentation, it seems that the inter-process drift we're seeing is largely due to the fact that GetTickCount is inconsistent between processes. Using just GetTickCount (or timeGetTime) we occasionally see the content process freaking out because timestamps from the parent appear to come from the future. Increasing the threshold for tolerable GTC/QPC inconsistency and making correction more conservative seems to avoid the test failures. I am going to experiment with this further, because it is a huge simplification that could fix additional intermittents.
Flags: needinfo?(docfaraday)
Priority: -- → P3
Summary: Intermittent GECKO(2240) | Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:663 → Intermittent GECKO(2240) | Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:674
Component: DOM → DOM: Core & HTML
Summary: Intermittent GECKO(2240) | Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:674 → Intermittent Assertion failure: firstTimeStamp == mTimers[0]->Timeout(), at z:/build/build/src/xpcom/threads/TimerThread.cpp:674
See Also: → 1622359
Severity: normal → S3
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago7 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: