Open Bug 1371438 Opened 3 years ago Updated 8 months ago

Intermittent test_reftests_with_caret.html | application crashed [@ TimerThread::FindNextFireTimeForCurrentThread(mozilla::TimeStamp,unsigned int)]

Categories

(Core :: XPCOM, defect, critical)

defect
Not set
critical

Tracking

()

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, leave-open)

Crash Data

Attachments

(1 file, 1 obsolete file)

Component: Editor → Selection
Looks like the safety check we put here is failing:

http://searchfox.org/mozilla-central/source/xpcom/threads/TimerThread.cpp#646

Perhaps the issue is that the binary heap does not have a stable sort and there are two nsITimers with the same mTimeout.

What do you think Olli?
Blocks: 1368493
Flags: needinfo?(bugs)
Component: Selection → XPCOM
But we compare TimeStamps, not Timers.
This is puzzling.
Flags: needinfo?(bugs)
Oh, we do indeed.  It would be interesting to know the state of the list when that assert failed.
Perhaps we should use std::priority_queue:

http://en.cppreference.com/w/cpp/container/priority_queue

It uses a binary heap underneath, but handles all the push_heap/pop_heap stuff manually.  That would let this method just use a normal iterator without worrying about corrupting the list.

(Sorry, I was not aware of priority_queue before just now.)
Oh, I don't think priority_queue exposes iterators.  Sorry.
Attached patch timer_thread_debug.diff (obsolete) — Splinter Review
May need to add more debug code, but hopefully this would reveal something.
Attachment #8876213 - Flags: review?(bkelly)
Comment on attachment 8876213 [details] [diff] [review]
timer_thread_debug.diff

Review of attachment 8876213 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/threads/TimerThread.cpp
@@ +648,5 @@
> +#ifdef DEBUG
> +  if (!mTimers.IsEmpty()) {
> +    if (firstTimeStamp != mTimers[0]->Timeout()) {
> +      TimeStamp now = TimeStamp::Now();
> +      printf("firstTimeStamp %f, mTimers[0]->Timeout() %f, "

Should this be printf_stderr?
Attachment #8876213 - Flags: review?(bkelly) → review+
Keywords: leave-open
Attached patch printf_stderrSplinter Review
Attachment #8876213 - Attachment is obsolete: true
Pushed by opettay@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fdd4f87cc0e
investigate why assertion fails on Windows, r=bkelly
Duplicate of this bug: 1371104
Our extra printf has been in for a while.  Doesn't seem like big changes when this impacts:

  firstTimeStamp 2.000000, mTimers[0]->Timeout() 1.000000, initialFirstTimer 0EEE04C0, current first 0F7F63C0

From:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=133455725&lineNumber=28509

What do you guys think is the next step?  It definitely seems like our windows TimeStamp code just sometimes reorders things on our automation windows VM test machines.  Could we force these to just avoid the high performance counters from the start so we don't have to suffer the switch in the middle of a test?

For example, see bug 1391693 comment 101 for another place this is issue showed up for me.
Flags: needinfo?(honzab.moz)
Flags: needinfo?(bugs)
I'm not familiar enough with Windows timestamps to understand what is buggy there.
Flags: needinfo?(bugs)
(In reply to Ben Kelly [:bkelly] from comment #20)
> Our extra printf has been in for a while.  Doesn't seem like big changes
> when this impacts:
> 
>   firstTimeStamp 2.000000, mTimers[0]->Timeout() 1.000000, initialFirstTimer
> 0EEE04C0, current first 0F7F63C0

If you are getting 1ms difference then I'm afraid it's not caused by the GetTickCount() fallback.  If timestamp has already switched to use it, |now| would only contain the 16ms resolution value.  Hence, calculation of a duration (= something_captured_earlier - now) cannot be 2 or 1 ms.  It would be 0 or 16 ms, because we would subtract two GetTickCount() captured values.

I more suspect that this is because the jitter of QueryPerformaceCounter (the high res counter we use) is too small to engage the fallback switch - it doesn't reach the threshold, but is large enough to cause these assertion failures.

It could also be that TSC is reported stable by the VM but in reality it isn't.  In that case we don't even bother detecting jitter to potentially fallback.

> 
> From:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> central&job_id=133455725&lineNumber=28509
> 
> What do you guys think is the next step?  It definitely seems like our
> windows TimeStamp code just sometimes reorders things on our automation
> windows VM test machines.  Could we force these to just avoid the high
> performance counters from the start so we don't have to suffer the switch in
> the middle of a test?

I would rather go the opposite way first - let's have an env var (TimeStamp is used to early for prefs) to control this. 0 - default behavior as now (fallback detection), 1 - force high res all the time, 2 - force low res all the time.

Let's run the tests with hi-res forced for awhile to eliminate the fallback bit.

I can file a bug if you agree.

> 
> For example, see bug 1391693 comment 101 for another place this is issue
> showed up for me.
Flags: needinfo?(honzab.moz) → needinfo?(bkelly)
And to add - I don't know exactly how the two timestamps you are comparing are captured, but to be honest, I think the assertion that fails here is a bit naive.  Two different timestamps regardless if captured one by one simply may have some delay between them.  You should add a reasonable threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are capturing these two timestamps, or even just one in close succession, on Windows the thread scheduling may introduce delays.
(In reply to Honza Bambas (:mayhemer) from comment #23)
> And to add - I don't know exactly how the two timestamps you are comparing
> are captured, but to be honest, I think the assertion that fails here is a
> bit naive.  Two different timestamps regardless if captured one by one
> simply may have some delay between them.  You should add a reasonable
> threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are
> capturing these two timestamps, or even just one in close succession, on
> Windows the thread scheduling may introduce delays.

The assertion is basically checking:

1. Capture t1 with TimeStamp::Now()
2. Capture t2 with TimeStamp::Now()
3. Sort t1 and t2 into a list with soonest first using TimeStamp::operator<(), etc.  Result is [t1, t2]
4. Some time later re-sort the list.  Assert that first element is still t1.

The assertion that is failing is that t2 is now sorting before t1.

I don't see how this could happen if TimeStamp values were stable.  Also, it only happens on automation VM windows machines.

If any fuzzing needs to be done I think TimeStamp::operator<() should be taking care of that, no?
Flags: needinfo?(bkelly)
(In reply to Honza Bambas (:mayhemer) from comment #22)
> > What do you guys think is the next step?  It definitely seems like our
> > windows TimeStamp code just sometimes reorders things on our automation
> > windows VM test machines.  Could we force these to just avoid the high
> > performance counters from the start so we don't have to suffer the switch in
> > the middle of a test?
> 
> I would rather go the opposite way first - let's have an env var (TimeStamp
> is used to early for prefs) to control this. 0 - default behavior as now
> (fallback detection), 1 - force high res all the time, 2 - force low res all
> the time.
> 
> Let's run the tests with hi-res forced for awhile to eliminate the fallback
> bit.

I think this would be great to try.
(In reply to Ben Kelly [:bkelly] from comment #24)
> (In reply to Honza Bambas (:mayhemer) from comment #23)
> > And to add - I don't know exactly how the two timestamps you are comparing
> > are captured, but to be honest, I think the assertion that fails here is a
> > bit naive.  Two different timestamps regardless if captured one by one
> > simply may have some delay between them.  You should add a reasonable
> > threshold, like (a - b).ToMillis() < 2 or so.  Note that if two threads are
> > capturing these two timestamps, or even just one in close succession, on
> > Windows the thread scheduling may introduce delays.
> 
> The assertion is basically checking:
> 
> 1. Capture t1 with TimeStamp::Now()
> 2. Capture t2 with TimeStamp::Now()
> 3. Sort t1 and t2 into a list with soonest first using
> TimeStamp::operator<(), etc.  Result is [t1, t2]
> 4. Some time later re-sort the list.  Assert that first element is still t1.
> 
> The assertion that is failing is that t2 is now sorting before t1.

Thanks for this explanation.  This was the missing bit.

But, when [1] shows 1 and 2 ms, we have not fell back to GetTickCount64 yet.  It would return 0 or 16ms difference, see [2] - we return difference of values (inside TimeStamp) captured with GetTickCount64 only, which SHOULD have only ~15.6 ms resolution.  This is a software counter (inside windows kernel) that is updated on context switch, to my knowledge, insensitive to timeBeginPeriod.

[1] https://dxr.mozilla.org/mozilla-central/rev/19b32a138d08f73961df878a29de6f0aad441683/xpcom/threads/TimerThread.cpp#656
[2] https://dxr.mozilla.org/mozilla-central/rev/19b32a138d08f73961df878a29de6f0aad441683/mozglue/misc/TimeStamp_windows.cpp#329

> 
> I don't see how this could happen if TimeStamp values were stable.  Also, it
> only happens on automation VM windows machines.
> 
> If any fuzzing needs to be done I think TimeStamp::operator<() should be
> taking care of that, no?

I'll file a bug and try to find time to write a patch to control the qpc/gtc/auto-detect usage.
Depends on: 1406386
You need to log in before you can comment on or make changes to this bug.