Closed Bug 822490 Opened 7 years ago Closed 7 years ago

Radically simplify windows TimeStamp implementation

Categories

(Core :: XPCOM, defect)

All
Windows XP
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla21

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Depends on 1 open bug)

Details

Attachments

(1 file, 3 obsolete files)

After analyzing bug 765215 I found out that the calibration code is too inaccurate, fragile and needs a lot of synchronization, flags and is very non-deterministic.  


I want to use the fact we never directly use TimeStamp::Now() value but we always differ two timestamps (epoch and now).  Based on that I want TimeStamp::Now() remember both QPC and GTC values.  When we count the difference (override for the - operator) I want to do the following:

- compare skew (=QPC-GTC) of both timestamps
- if the difference of the two skews falls under some threshold, use QPCa - QPCb as a result for TimeDuration
- if not, use GTCa - GTCb as a result
- also, if the skew differs too much but we've recorded both stamps in some short time each after other (say 4 seconds) then switch a flag to stop using QPC at all since it is not reliable

Simple and perfectly working w/o any locks at all, handling also the wake up issue.

I will leave TSC and WIN64 api for GetTickCount as an optimization.


-----------------------------------------------

The main reason to transit this way is a big issue after waking up when GTC and QPC skew may radically change and we may jump in the TimeStamp::Now() result back.  To explain more in detail:

- we record a skew, QPC and GTC are stable
- we sleep the machine
- we wake up the machine after say 2 minutes
- we access TimeStamp::Now() which detects that QPC and GTC are too far each from other -> falls back to use GTC + skew ; the skew determined previously
- we detect wake up and force recalibration
- we change the skew to a radically different number and start using QPC again for the result
=> but the different between GTC + previous skew and QPC is going back too much, so we for a long time (may be hours!) return the same value from TimeStamp::Now() that is doing simply |return MAX(now, lastResult);|.  lastResult is the last GTC + skew while QPC is much smaller number then |GTC + skew|, since QPC has been reset or simply jumped far from its expected monotonicity
I think this is a good idea.
Blocks: 827287
Attached patch v1 (obsolete) — Splinter Review
- implemented according the description
- when TSC is unstable, we use QPC values when two stamps are up to 2 seconds each from other (according GTC) ; for longer times hi precision is not needed and GTC is used
- when two time stamps have been taken within a two seconds interval, we do skew difference check as in the current implementation ; if out of bounds, QPC is disabled just by flipping a flag
- this approach doesn't need any special handling of suspend/wakeup (if shorter then 2 seconds, which in 99.99% cases is) since potentially unstable QPC is used and checked only for shorter intervals
Attachment #706167 - Flags: review?(tellrob)
Attached patch v1.1 (obsolete) — Splinter Review
This one is better (fixed hard failure interval calculation, between else).

Try: https://tbpl.mozilla.org/?tree=Try&rev=c4626c0482db

- implemented according the description
- we do QPC/GTC skew difference check when subtracting two time stamps
- when out of bound, GTC is used
- when out of bound AND the two time stamps have been taken within a two seconds interval, QPC is completely disabled (just by flipping a flag)
- this approach doesn't need any special handling of suspend/wakeup since GTC will be preferred over potentially unstable QPC automatically
Attachment #706167 - Attachment is obsolete: true
Attachment #706167 - Flags: review?(tellrob)
Attachment #706202 - Flags: review?(tellrob)
Comment on attachment 706202 [details] [diff] [review]
v1.1

Review of attachment 706202 [details] [diff] [review]:
-----------------------------------------------------------------

I have some questions about this but I like the idea (I think this is what Microsoft wanted us to do originally when we started using QPC).

"I love this plan! I'm excited to be a part of it! LET'S DO IT!"

::: xpcom/ds/TimeStamp.h
@@ +159,5 @@
>  
>      return TimeDuration::FromTicks(int64_t(aTicks));
>    }
>  
> +  // Duration, result is implementation-specific difference of two TimeStamps

Hmm, someone else should probably sign off on this change (sr time?)

::: xpcom/ds/TimeStamp_windows.cpp
@@ +61,5 @@
> +static const uint32_t kQPCHardFailureDetectionInterval = 2000;
> +
> +// On every use of QPC values we check the overflow of skew difference of the
> +// two stamps doesn't go over this number of milliseconds.  Both timer
> +// functions jitter so we have to have some limit.  The value is based on tests.

Which tests? Can you attach them to the bug?

@@ +133,5 @@
> +// ----------------------------------------------------------------------------
> +
> +// Initially true, set to false when QPC is found unstable and never
> +// returns back to true since that time.
> +bool volatile sUseQPC = true;

These two should be static, right?

@@ +248,5 @@
>    // jump by 32 ms in two steps, therefor use the ceiling value.
>    sUnderrunThreshold =
>      LONGLONG((-2) * ticksPerGetTickCountResolutionCeiling);
>  
> +  // QPC should go no further then 2 * GTC resolution.

nit (while you're here...): s/then/than

@@ +410,5 @@
> +  if (aDuration < sQPCHardFailureDetectionInterval) {
> +    // Interval between the two time stamps is very short, consider
> +    // QPC as unstable and disable it completely.
> +    sUseQPC = false;
> +    LOG(("TimeStamp: QPC disabled"));

So if we ever compare two timestamps that are too close together, QPC gets turned off for everyone?
(In reply to Rob Arnold [:robarnold] from comment #4)
> Comment on attachment 706202 [details] [diff] [review]
> v1.1
> 
> Review of attachment 706202 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I have some questions about this but I like the idea (I think this is what
> Microsoft wanted us to do originally when we started using QPC).
> 
> "I love this plan! I'm excited to be a part of it! LET'S DO IT!"

Wish I remember where the idea came to me :)

> 
> ::: xpcom/ds/TimeStamp.h
> @@ +159,5 @@
> >  
> >      return TimeDuration::FromTicks(int64_t(aTicks));
> >    }
> >  
> > +  // Duration, result is implementation-specific difference of two TimeStamps
> 
> Hmm, someone else should probably sign off on this change (sr time?)

I'm just changing the comment, but the value has already been like this.

> 
> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +61,5 @@
> > +static const uint32_t kQPCHardFailureDetectionInterval = 2000;
> > +
> > +// On every use of QPC values we check the overflow of skew difference of the
> > +// two stamps doesn't go over this number of milliseconds.  Both timer
> > +// functions jitter so we have to have some limit.  The value is based on tests.
> 
> Which tests? Can you attach them to the bug?

Unfortunatelly not.  Those were just little console programs exercising the GTC and QPC APIs on my old intel based winxp machine.  It was very empiric to watch what happens, and the numbers are still more guesses then exact numbers.  However, in real usage it showed up to be more or less tuned.  Also, I no longer posses that xp machine to test with unfortunately.

> @@ +410,5 @@
> > +  if (aDuration < sQPCHardFailureDetectionInterval) {
> > +    // Interval between the two time stamps is very short, consider
> > +    // QPC as unstable and disable it completely.
> > +    sUseQPC = false;
> > +    LOG(("TimeStamp: QPC disabled"));
> 
> So if we ever compare two timestamps that are too close together, QPC gets
> turned off for everyone?

No.  If the difference goes too far over the basic limit *and* we are under 2 seconds duration of the two comparing time stamps, then we disable QPC completely.

The calculation is following:  

SKEW1 = QPC1 - GTC1
SKEW2 = QPC2 - GTC2
DIFF = SKEW1 - SKEW2 // can jitter since GTC increases in 15.6 ms intervals (usually)
IS_POTENTIALLY_OVERFLOWING = DIFF not in [-32, 32] // 32 is simplified value of default GTC time increment * 2
OVERFLOW = ABS(DIFF) - 32  **
IS_OVERFLOWING = OVERFLOW > 100ms

if (IS_OVERFLOWING && ABS(GTC1 - GTC2) < 2s) -> bye bye QPC

 **  Before this line DIFF is linearly adjusted as the two time stamps would be taken with 2 seconds delay each after other.  Thus, when you compare two stamps one taken at 10000ms and other at 30000ms, and DIFF is, say, 250ms, we do:  DIFF = (2000 / (30000 - 10000)) * DIFF , giving DIFF = 25ms.  The 100ms overflow per 2000ms is something like giving 5% tolerance to QPC jitter.


Except some issues with overflow calculation (that can for large numbers internally overflow and potentially give wrong numbers, according my logging - should be fixed even in the current version) this works well.

However, I'm thinking of some penalty system here while also tightening the limit to some 2% only.  Say 3 problems in 10 seconds mean to stop using QPC, but here I would guess since I don't have a machine to test with.

Also, usually problems with QPC are on exotic systems like 64bit winxp on AMD processors.  On my win7/full intel machine QPC works well.  On my old intel core laptop with win8 it has usual wakeup issues, that this code perfectly detects.
(In reply to Rob Arnold [:robarnold] from comment #4)
> Comment on attachment 706202 [details] [diff] [review]
> v1.1
> > +  // Duration, result is implementation-specific difference of two TimeStamps
> 
> Hmm, someone else should probably sign off on this change (sr time?)

As I'm thinking of it...
1) the comment was misplaced all the time
2) this is just an internal variable that hold the time in "some" form/units

The important thing is that ToSeconds() will convert this internal value to represent actual seconds.  What mValue holds doesn't at all matter.


Also, I'm not sure I really need to define my own copy constructor and the assignment operator.  What do you think?  Implicit will do the same I believe.
Comment on attachment 706202 [details] [diff] [review]
v1.1

Review of attachment 706202 [details] [diff] [review]:
-----------------------------------------------------------------

I'd prefer to see a more lenient way of disabling QPC but that could be in a separate patch. I worry that since arithmetic is allowed on timestamps, it is possible to erroneously trigger the heuristic.

::: xpcom/ds/TimeStamp_windows.h
@@ +27,5 @@
> +  bool CheckQPC(int64_t aDuration, const TimeStampValue &aOther) const;
> +
> +public:
> +  struct _SomethingVeryRandomHere;
> +  TimeStampValue(_SomethingVeryRandomHere* nullValue);

I should have asked about this...some new Mozilla convention I'm unaware of?

@@ +28,5 @@
> +
> +public:
> +  struct _SomethingVeryRandomHere;
> +  TimeStampValue(_SomethingVeryRandomHere* nullValue);
> +  TimeStampValue(const TimeStampValue &aOther);

Yeah, your implementation here is the same as the compiler's.
Attachment #706202 - Flags: review?(tellrob) → review+
(In reply to Rob Arnold [:robarnold] from comment #7)
> Comment on attachment 706202 [details] [diff] [review]
> v1.1
> 
> Review of attachment 706202 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'd prefer to see a more lenient way of disabling QPC but that could be in a
> separate patch. I worry that since arithmetic is allowed on timestamps, it
> is possible to erroneously trigger the heuristic.

I'll file a bug.  Landing this is similar to what we have now.

> 
> ::: xpcom/ds/TimeStamp_windows.h
> @@ +27,5 @@
> > +  bool CheckQPC(int64_t aDuration, const TimeStampValue &aOther) const;
> > +
> > +public:
> > +  struct _SomethingVeryRandomHere;
> > +  TimeStampValue(_SomethingVeryRandomHere* nullValue);
> 
> I should have asked about this...some new Mozilla convention I'm unaware of?

This is copied from TimeDuration constructor.

> 
> @@ +28,5 @@
> > +
> > +public:
> > +  struct _SomethingVeryRandomHere;
> > +  TimeStampValue(_SomethingVeryRandomHere* nullValue);
> > +  TimeStampValue(const TimeStampValue &aOther);
> 
> Yeah, your implementation here is the same as the compiler's.

Removed in the next patch version.

Thanks.
Attached patch v1.2 (obsolete) — Splinter Review
- appropriate globals turned to static
- using implicit copy ctor and = op for TimeStampValue
- using std::abs where appropriate
- fixed log (s/%d/%llu/) that falsely led me to think of an arithmetic bug

All over the review comments are nits, so I just carrying forward r+.

I'll be testing this a short time on my 2 core old machine to check logs and other stuff works.
Attachment #706202 - Attachment is obsolete: true
Attachment #708714 - Flags: review+
Blocks: 836869
(In reply to Honza Bambas (:mayhemer) from comment #9)
> I'll be testing this a short time on my 2 core old machine to check logs and
> other stuff works.

The algorithm as is now may be too sensitive.  I got some jitter that went 10ms over the threshold for two stamps just 14ms each from other.  This cannot happen in the current code, so we would regress.  I will have to fix bug 836869 inside this one, or just make thresholds more benevolent before landing.
Attached patch 1.3Splinter Review
- raise of the tolerance threshold from 32ms to 64ms +- ; actually, the timeIncrement * 4
- added few more logs for TSC, timeIncrement and QPC frequecy

I'm not sure whether this even needs a rereview...
Attachment #708714 - Attachment is obsolete: true
Attachment #711958 - Flags: review?(tellrob)
Comment on attachment 711958 [details] [diff] [review]
1.3

Review of attachment 711958 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't look at the interdiff but it seems fine based on what you've said.
Attachment #711958 - Flags: review?(tellrob) → review+
https://hg.mozilla.org/mozilla-central/rev/604e819f63d7
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Depends on: 840911
I landed a fixup for GCC that requires cstdlib included for std::abs:

https://hg.mozilla.org/integration/mozilla-inbound/rev/d3e16515f630
(In reply to Philip Chee from comment #16)
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/d3e16515f630
> I already have a reviewed patch in Bug 840911 :P

I didn't notice it, sorry about that.
Mochitest 3 test "performance_now" seems to have increased frequency of intermittent failures, possibly as a result of this bug. Some analysis on bug 749894 suggests that on the try servers, we're getting "QPC Disabled" at about 25% of the tests (which can sometimes fail the test, but not always).

Need to decide if the "QPC disabled" case should pass of fail that test, and adjust the test accordingly.
Depends on: 749894
(In reply to Avi Halachmi (:avih) from comment #19)
> ... Some analysis on
> bug 749894 suggests that on the try servers, we're getting "QPC Disabled" at
> about 25% of the tests ...

On windows XP, with higher percentage at XP-debug builds (than XP-opt).
Although I think this patch looks fine, I was surprised to find it when reading TimeStamp hg logs. In the future, please ask an XPCOM peer for review or at least ask us to delegate the review.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #21)
> Although I think this patch looks fine, I was surprised to find it when
> reading TimeStamp hg logs. In the future, please ask an XPCOM peer for
> review or at least ask us to delegate the review.

Ups, didn't cross my mind.  Do you want to check on the patch now and make me some changes?
Nono, like I said, the patch looks fine. I just wish I had been aware of it ;-)
You need to log in before you can comment on or make changes to this bug.