Closed Bug 836869 Opened 7 years ago Closed 7 years ago

Make QueryPerformanceCounter bad leap detection heuristic smarter

Categories

(Core :: XPCOM, defect)

All
Windows 7
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

Attachments

(1 file)

It would be better to stop using QPC only after bigger number of detected unwanted leaps.  Now we just disable on the first occurrence.  The code in bug 822490 allow us to be smarter.
I had a valuable chat with Avi today on IRC, this bug raises its priority.

Now the heuristic is too restrictive.  Since the main problem is to setup the correct thresholds I would like to find some way around how to set them.

The right question then is: how large jittery of QPC are we willing to accept?

The main consideration is for animation and the refresh driver.


To explain the current code:

  When we call TimeStamp::Now() we snapshot both GetTickCount() [fast, 15.6ms resolution] and QueryPerformanceCounter() [slow, potentially unreliable, <1ms resolution].  When two stamps are subtracted we count skew (difference of GTC and QPC) for both time stamps on both sides of the - operator.  When in reasonable threshold, we accept QPC values as source for the actual duration calculation.  Otherwise we use GTC, and under some other simple conditions, we may even completely disable QPC.


And now the suggestion:

  Count two durations, one from GTC values and other from QPC values.  It's giving the same number as subtracting two skews actually, but it's way more easier to imagine what the numbers mean.

  For animations with 60fps the interval of two frames is ca 16.6ms.  (How can this be fluent on windows w/o using BeginTimePeriod() is actually a mystery for me :) - I don't know how our animation code actually works, tho).  

  Here is my attempt to find the threshold values: say we don't want more then one "mistake" in the interval per ca 5s.  So, when we detect 3 "mistakes" during 10s, we don't want to use QPC since it's jittering too much for intended purpose.

  Now we need to define the "mistake": going backward or forward more then by duration of two (or more? way more?) frames then expected.  Sounds reasonable ?

  For longer periods (5s+) the current algorithm can be preserved, only collect penalty score that decreases with good checks.  On some penalty threshold disable QPC.



To illustrate behavior of the current code, from bug 749894, comment 129, these are values that disables QPC:

QPC check after 109ms with overflow 37.5589ms, adjusted trend per interval is 689.1545ms
QPC check after 110ms with overflow 26.8460ms, adjusted trend per interval is 488.1087ms
QPC check after 109ms with overflow 21.4024ms, adjusted trend per interval is 392.7045ms
QPC check after 114ms with overflow 8.1027ms, adjusted trend per interval is 141.5117ms

QPC check after 328ms with overflow 15.8965ms, adjusted trend per interval is 96.9301ms
QPC check after 18ms with overflow 0.9308ms, adjusted trend per interval is 102.8896ms

This log is produced by the TimeStamp - operator.

|check after| is GTC of the first stamp minus GTC of the second stamp, i.e. the actual duration counted from GTC values.

|overflow| is how much absolute value of skew difference went over 4*15.6ms.  The 4*15.6ms threshold is something I found out empirically in the past.  64ms, i.e. about 4 frames in 60fps video.

|adjusted trend per interval| is just normalization of |overflow| as if the check has been made after 2000ms.  I.e. for the first case numbers: 96.9301ms = (2000ms / 328ms) * 15.8965ms.

Accepted trend is 50ms and when checked in less then 2000ms -> disable QPC.

The first case (actually all first 4 cases) is interesting:
- duration 109ms
- actual overflow = 64+38 = 102ms
- based on that, it looks like QPC didn't either move forward at all or moved two times faster (the former is more probable)

It really shows there is something wrong with it and we should disable it.
Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
From your blog post: http://www.janbambas.cz/usage-of-queryperformancecounter-in-gecko/ (good post btw ;) ):
> - QueryPerformanceCounter may jump against expected monotonicity
> when a thread switches a processor core (happens quite often,
> by the way) ; solution can be sticking one’s thread to a single core.
> This, though, happens on minority of systems like 64-bit processors
> running 64-bit Windows XP.
> - QueryPerformanceCounter may not exactly reflect time while a machine
> has been suspended ; or simply may not increase as expected.

And probably also: some platforms/HW actually limit us to 15.6ms resolution (or some other fixed arbitrary value)?

Are these the only noise origins we know of? How about we try to classify the noise we're getting (at the code), so we should only fall back to the totally generic detection and compensation path when we can't recognize the source of the discrepancy? (a totally generic one would, of course, be great, when it's good enough)

E.g. we can probably detect resume from suspend, maybe also core-switch, etc. It should also probably be possible to detect that the best resolution we're seeing is around 15.6ms. Maybe even issue a 200ms test at a thread when starting the browser to really see what can we get from this system.

Could there be other noise origins which pollute our samples? event queue load, for instance? (just guessing, got no actual reason to suspect it).

Bottom line, instead of applying heuristics to the generic approach, it might prove more valuable to apply heuristics (or actual detection) to figure out the source of the discrepancy, and then just compensate as appropriate for that specific noise source.
Depends on: 749894
No longer depends on: 749894
(In reply to Avi Halachmi (:avih) from comment #2)
> From your blog post:
> http://www.janbambas.cz/usage-of-queryperformancecounter-in-gecko/ (good
> post btw ;) ):
> > - QueryPerformanceCounter may jump against expected monotonicity
> > when a thread switches a processor core (happens quite often,
> > by the way) ; solution can be sticking one’s thread to a single core.
> > This, though, happens on minority of systems like 64-bit processors
> > running 64-bit Windows XP.
> > - QueryPerformanceCounter may not exactly reflect time while a machine
> > has been suspended ; or simply may not increase as expected.
> 
> And probably also: some platforms/HW actually limit us to 15.6ms resolution
> (or some other fixed arbitrary value)?

I've never heard about this, but I don't say it's not possible.  However, I would more expect the QueryPerformanceFrequency to fail in that case.  That bypasses QPC completely since it's not available on the platform at all.

> 
> Are these the only noise origins we know of? How about we try to classify
> the noise we're getting (at the code), so we should only fall back to the
> totally generic detection and compensation path when we can't recognize the
> source of the discrepancy? (a totally generic one would, of course, be
> great, when it's good enough)

My dream :) not enough data to make it real tho.  IMO, we should establish a real-example boundary we are able to accept.  When too many leaps out of that boundary has been detected, only then say QPC is shitty.

> 
> E.g. we can probably detect resume from suspend, maybe also core-switch,
> etc. 

Crazy step back, sorry, I have an experience with this.

Core switching API horribly slow.  I was experimenting with having a skew array for all cores.  The current code abandoned the skew technique completely.

The reason is that wake detection has some huge latency after the actual wake up.  It's an async main thread event.  The current code solves this just by not using longer duration based on GTC when comparing two stamps to disable QPC globally.  This is perfect and simple solution to the wake up issue.  And, this bounds our upper limit for the "too long duration for QPC turn-off decision" duration.  Sleeping a laptop for less then 3 seconds is a reasonably working use case (I sometimes do it to workaround some mouse driver bugs on my ancient lenovo).

> It should also probably be possible to detect that the best resolution
> we're seeing is around 15.6ms. Maybe even issue a 200ms test at a thread
> when starting the browser to really see what can we get from this system.

Hmmm... possible, but wastes resources.  I don't believe personally that QPC may give such bad resolution.  However, some testing code and telemetry is in place here.  I wanted to do this a long time ago anyway.

> 
> Could there be other noise origins which pollute our samples? event queue
> load, for instance? (just guessing, got no actual reason to suspect it).

Hard to say.  I don't know windows thread scheduling that well.

> 
> Bottom line, instead of applying heuristics to the generic approach, it
> might prove more valuable to apply heuristics (or actual detection) to
> figure out the source of the discrepancy, and then just compensate as
> appropriate for that specific noise source.

Can you be more specific?  I on the other hand want to keep the code simple and don't waste (for sure startup) resources to detect bad QPC.  IMO it self-detects during use.
Attached patch v1Splinter Review
Not sure who should review this.

Changes to the fault detection algorithm:
- there are now 3 meaningful (more human) parameters to set:
  - fault tolerance interval: for how long tops we accept error(s) of QPC to happen
  - failure count during tolerance interval: how many errors during the interval 
  - failure threshold: difference of duration measured using QPC and duration measured using GTC we consider as a failure 

- initially I've set those up as 4 errors per 5 seconds, threshold 50ms
- there is still a basic threshold we must allow because of GTC very low resolution and laggy behavior (its ceil(4 * 15.6) = 64ms) ; this has been just slightly adjusted in this patch
- QPC can be disabled only when failures are found for timestamps that are less then 2 seconds each from other (engaged because of standby and general QPC vs GTC inaccuracy during long runs)
Attachment #724728 - Flags: review?(tellrob)
Attachment #724728 - Flags: review?(avihpit)
I think we should still try to understand where do the errors come from, especially on cases other than resume from suspend.

(In reply to Honza Bambas (:mayhemer) from comment #3)
> Can you be more specific?  I on the other hand want to keep the code simple
> and don't waste (for sure startup) resources to detect bad QPC.  IMO it
> self-detects during use.

TBH, I was mostly throwing out ideas without concrete knowledge to back them up, hoping that with your domain knowledge and experience you might be able to pick up something useful.

The only direction which I do actually suspect is event queue load. There are evidence which suggest that the timer filter sometimes uses inaccurate timestamp inputs for its compensation mechanism, which seems to happen more when the event queue is loaded, or right after it got held up by a long handler.

I don't know if this happens because Timestamp::Now() is inaccurate, or because it reads the timestamp too late.

If it's because Timestamp::Now() is inaccurate, then the event queue hypothesis might explain the inaccurate readings which may lead to disabling QPC. This is also somewhat supported by the fact that QPC gets disabled more on debug builds than on opt builds, as debug builds are likely to be slower, and thus loaded event queue would cause bigger reading errors than on OPT builds.

Another possible cause which I don't recall mentioned is the CPU frequency (multiplier/throttling), which typically changes quite a lot, and even every few ms.

Knowing the cause won't necessarily change the solution, but at least we might be able to know better what we're fighting with.
Comment on attachment 724728 [details] [diff] [review]
v1

I think review from Rob is enough on this patch.
Attachment #724728 - Flags: review?(avihpit)
Depends on: 850846
FYI, jrmuizel added logs of QPC, GTC, and skew (=GTC-QPC) values, and collected them from a single M2 run on windows 7 and XP:

Try build: https://tbpl.mozilla.org/?tree=Try&rev=c3b64b565a3a
Patch: https://hg.mozilla.org/try/rev/c3b64b565a3a

Log of that M2 on XP: https://tbpl.mozilla.org/php/getParsedLog.php?id=20634175&tree=Try&full=1 (truncated at 50M bytes).

Graph of skew/number-of-calls-to-Timestamp::Now() on the windows 7 run of that M2: http://people.mozilla.com/~jmuizelaar/win7.png

Same as the above for the XP-opt run: http://people.mozilla.com/~jmuizelaar/xp.png

Different view of the XP data: http://people.mozilla.com/~jmuizelaar/gtc.png

Notes:
- For the XP run, there seems to be an extremely high correlation between number of calls to Now() and the drift. I can conclude from that that either:
  a. Calls to Now() affect the drift very predictably on average. Or...
  b. That M2 run calls Now() at an extremely steady average rate, and thus steadily samples the changing drift.

- |b.| could be the result of profiler samples (even if they're actually no-op). If it's not |b|, then I find it hard to believe that a random run generates such a steady average-intervals of Now() calls to explains the high correlation to the drift, and I can't think a reason other than |a|.

- QPC seems to occasionally toggle between 0 and some proper value. Skew follows with toggle between big positive/negative values.

- If ignoring the samples where QPC is 0, the QPC value appears to be always strictly increasing (I didn't try to find jumps backward).
(In reply to Avi Halachmi (:avih) from comment #8)
> FYI, jrmuizel added logs of QPC, GTC, and skew (=GTC-QPC) values, and
> collected them from a single M2 run on windows 7 and XP:
> 
> Try build: https://tbpl.mozilla.org/?tree=Try&rev=c3b64b565a3a
> Patch: https://hg.mozilla.org/try/rev/c3b64b565a3a
> 
> Log of that M2 on XP:
> https://tbpl.mozilla.org/php/getParsedLog.php?id=20634175&tree=Try&full=1
> (truncated at 50M bytes).
> 
> Graph of skew/number-of-calls-to-Timestamp::Now() on the windows 7 run of
> that M2: http://people.mozilla.com/~jmuizelaar/win7.png
> 
> Same as the above for the XP-opt run:
> http://people.mozilla.com/~jmuizelaar/xp.png
> 
> Different view of the XP data: http://people.mozilla.com/~jmuizelaar/gtc.png
> 
> Notes:
> - For the XP run, there seems to be an extremely high correlation between
> number of calls to Now() and the drift. I can conclude from that that either:

I'd expect this is because different clock sources are being used and we call
TimeStamp::Now() a lot. The drift shouldn't be a problem as the current code should handle it. The mystery is the cause of the big changes that we see.

> - QPC seems to occasionally toggle between 0 and some proper value. Skew
> follows with toggle between big positive/negative values.

This is when we do low-res time stamps.

> - If ignoring the samples where QPC is 0, the QPC value appears to be always
> strictly increasing (I didn't try to find jumps backward).

This means QPC is mostly working
Some more interesting facts about the xp graph:

1. The skew is jumping in 7.5ms (GTC interval/2) amounts. I don't have a good explanation for this behaviour.

2. We are using the ACPI PM timer (PMT) for QPC
Comment on attachment 724728 [details] [diff] [review]
v1

I like the approach and if your testing shows that it works well in practice then it looks good to me. I didn't take a close look at all the code but I trust it's all correct.
Attachment #724728 - Flags: review?(tellrob) → review+
I forgot to answer, sorry:

(In reply to Avi Halachmi (:avih) from comment #6)
> I think we should still try to understand where do the errors come from,
> especially on cases other than resume from suspend.

QPC is a hardware based counter that differs with each processor and chipset, it's simply known to be faulty in some cases.  Those are bugs in processors.

> If it's because Timestamp::Now() is inaccurate, then the event queue
> hypothesis might explain the inaccurate readings which may lead to disabling
> QPC. 

I'm not sure I follow.

> This is also somewhat supported by the fact that QPC gets disabled more
> on debug builds than on opt builds, as debug builds are likely to be slower,
> and thus loaded event queue would cause bigger reading errors than on OPT
> builds.

I may try to investigate how much the values differ in debug build and opt builds.  This is also dependent on how "close" GTC and QPC are called.  If a debug code involves any context switches, it may be more vulnerable.

>   a. Calls to Now() affect the drift very predictably on average. Or...

By drift you mean how the skew differs in time?  Skew was changing only 

> - QPC seems to occasionally toggle between 0 and some proper value. Skew
> follows with toggle between big positive/negative values.

This is interesting, but I'd like to see the code you used to discover this.
(In reply to Honza Bambas (:mayhemer) from comment #12)
> By drift you mean how the skew differs in time?  Skew was changing only 

Hit enter too soon..

Skew was changing only occasionally in the old code - every 4 or 5 seconds IIRC.  In the new code, there is no concept of keeping a skew value that needs to be maintained.

If you mean how GTC and QPC differ during the time, it's observable on any machine.  The skew change in time is small, but it is always there and I don't expect that for long times (several hours) skew would be stable.
(In reply to Rob Arnold [:robarnold] from comment #11)
> Comment on attachment 724728 [details] [diff] [review]
> v1
> 
> I like the approach and if your testing shows that it works well in practice
> then it looks good to me. I didn't take a close look at all the code but I
> trust it's all correct.

Thanks for the review.  I was checking the logic by tempering the frequency conversion logic to make a "feel" QPC is not accurate.  The logs showed it works well.
Comment on attachment 724728 [details] [diff] [review]
v1

Benjamin, last time I've committed changes to this code w/o your attention.  Do you want to check on this?
Attachment #724728 - Flags: superreview?(benjamin)
Comment on attachment 724728 [details] [diff] [review]
v1

I'm going to trust robarnold on this one. You break it you own it! ;-)
Attachment #724728 - Flags: superreview?(benjamin) → superreview+
https://hg.mozilla.org/mozilla-central/rev/83be5398e907
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Blocks: 873110
See Also: → 880289
You need to log in before you can comment on or make changes to this bug.