Last Comment Bug 676349 - Implement high-resolution platform timers for the Windows platform
: Implement high-resolution platform timers for the Windows platform
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: XPCOM (show other bugs)
: Other Branch
: All Windows 7
: -- enhancement with 2 votes (vote)
: mozilla11
Assigned To: Honza Bambas (:mayhemer)
:
:
Mentors:
Depends on: 765215
Blocks: 707776 707781 707782 707783 640796 673105
  Show dependency treegraph
 
Reported: 2011-08-03 12:28 PDT by Honza Bambas (:mayhemer)
Modified: 2012-12-22 06:54 PST (History)
28 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
WIP v0.1 (13.66 KB, patch)
2011-08-17 11:15 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v0.9 (15.66 KB, patch)
2011-08-24 18:10 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v0.9 (17.88 KB, patch)
2011-08-29 05:35 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v1 (22.76 KB, patch)
2011-09-03 07:09 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v1.1 (40.62 KB, patch)
2011-09-11 08:50 PDT, Honza Bambas (:mayhemer)
tellrob: review-
Details | Diff | Splinter Review
v2 - simplified (12.22 KB, patch)
2011-09-29 08:50 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v2.WIP1 (15.91 KB, patch)
2011-09-29 16:14 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v3 (19.49 KB, patch)
2011-10-31 09:46 PDT, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v3.1 (20.14 KB, patch)
2011-11-22 11:28 PST, Honza Bambas (:mayhemer)
no flags Details | Diff | Splinter Review
v3.2 (21.42 KB, patch)
2011-11-29 13:56 PST, Honza Bambas (:mayhemer)
tellrob: review+
honzab.moz: checkin+
Details | Diff | Splinter Review

Description Honza Bambas (:mayhemer) 2011-08-03 12:28:14 PDT
As described.

Carrying out comments from the original bug, stolen for Mac OSX ;)

Rob Arnold:
Some platforms have buggy QPC implementations. Correctness may not be guaranteed across suspend/hibernate as well, even on Vista. Have you looked at the implementation of PRMJ_Now or Chromium's TimeStamp implementation?

This (QPC) approach doesn't address the monotonicity requirement. PRMJ_Now doesn't either. I think what you want is something that uses the same core as PRMJ_Now (the self-correcting/calibration) but w/o the truncation to microseconds and conversion to an actual date. For the clock-skew detection, you could use the multimedia timer since that is monotonic (I think). PRMJ_Now also should work with using the multimedia timer as its guide so you might want to rewrite it to use the MM timer (perhaps even sharing the same calibration data as TimeStamp::Now so that they're more consistent).


Christian :Biesinger
Shouldn't this live in a TimeStamp_windows.cpp file, similar to TimeStamp_posix.cpp?
Comment 1 Honza Bambas (:mayhemer) 2011-08-14 17:06:06 PDT
Summary of timers API on the Windows platform:

- DWORD GetTickCount():
  - proven as stable (no leaps forward or backward, we currently use it to implement PR_GetTickCount())
  - resolution ~15.6 ms, or precisely TimeIncrement of GetSystemTimeAdjustment (regardless the TimeAdjustmentDisabled value)
  - not sensitive to timeBeginPeriod setting on all platforms
  - cheap to call


- "LONGLONG" getSystemTimeAsFileTime()
  - resolution depends on the timeBeginPeriod call:
    - on Windows 7 (no sure about Vista) it is sensitive to this setting and may give up to 1ms resolution
    - on Windows XP: no affect, still ~15.6 ms
  - cheap to call, same or bit worse then GetTickCount


- DWORD timeGetTime()
  - gives the same resolution as getSystemTime* functions, also leaps at the same moment
  - on all platforms (XP,7) is sensitive to timeBeginPeriod setting
  - is not that cheap to call


- "LONGLONG" QueryPerformanceCounter()
  - always gives 1ms resolution, it is the best resolution timer available on Windows
  - tests show it might peek every 32 leaps of GetTickCount by 15ms on AMD Athlon 64 when running Process Explorer
  - reported to be able to leap in any direction on MP systems with older Windows platforms because of not correct sync between cores/processors when a thread changes the core to run on: http://support.microsoft.com/kb/895980 ; can be worked around on Windows XP by setting thread affinity to only a single core/processor when calling QueryPerformanceCounter
  - reported to be able to "leap" forward on some platforms: http://support.microsoft.com/kb/274323
  - reported to be able to peek on Athlon X2 processors: http://www.gamedev.net/topic/370281-queryperformancecounter-athlonx2/
  - it is very expensive to call on my two WinXP machines, not that much on i7 with Win7


- timeBeginPeriod/timeEndPeriod()
  - changes the global system setting
  - when called with '1' as an argument, it changes the context switch scheduling to happen every 1 ms
  - this makes WaitFor* and some of the functions above work with 1 ms resolution
  - tests show the call doesn't have immediate affect
  - has bad impact on performance
  - has bad impact on battery life
  - has bad impact on power saving
  - in general it is not recommended to switch to 1ms period for longer times then few/tens of milliseconds ; should be used only for dedicated apps that play media or games, or only for shorts moments when really necessary


The Chromium implementation:

- class TimeTicks is mirror of our TimeStamp class
- method TimeTicks::HighResNow(): /* expensive to call */, using QPC if available and not on a platform considered faulty (currently AMD family 15), otherwise fall back to use of timeGetTime
- method TimeTicks::Now() using timeGetTime, with 15.6ms resolution when not configured by timeBeginPeriod
- when a delayed task (as I understand, it is a timer schedule) is queued and the delay is in 2*15.6ms, then for a short time they switch to timeBeginPeriod(1)

- class Time working with the wall clock time
- method Time::Now(): based on GetSystemTimeAsFileTime + relative TimeTicks::Now() (not HighResNow()), resynchronized every minute; that allows 1ms precise wall clock time when set by timeBeginPeriod(1) on all platforms


My plan:

I would like to use GetTickCount to check if QueryPerformanceCounter works well and not use timeBeginPeriod.  Last QPC value should be kept per processor, however information about current processor is available only on Win7 and there seems not to be problems with multiple cores/processors sync (I have to double check the latter).  Initially I want to record the difference between GTC and QPC for each processor (during first query or initially in a background thread) and watch for a difference of QPC from GTC on each following query.  If found larger then 15.6 + 1.5 ms, adjust the difference by the currently detected difference - 15.6 ms.  This way if QPC just goes in different speed from GTC we will never have edge of more then 2 ms and keep sync.  If the difference is large because of QPC leap then we sync this way back to at least 15.6 ms precision quickly but of course loose 1 ms resolution.

I currently play with a little test program that tries to prove the concept.  It still needs some adjustments, but mostly does what I expect.
Comment 2 Rob Arnold [:robarnold] 2011-08-14 17:28:48 PDT
That looks like a good writeup of the various timing functions. Two things to note: GetTickCount is much faster than GetSystemTimeAsFileTime because it reads from a shared page of memory whereas the latter is a system call. GetTickCount also wraps around at around 49 days. You can use GetTickCount64 but that's not available on XP.

What you want to do sounds pretty similar to I did in PRMJ_Now. How do you intend to handle timezone changes or the user manually adjusting their clock?
Comment 3 Honza Bambas (:mayhemer) 2011-08-15 06:36:02 PDT
(In reply to Rob Arnold [:robarnold] from comment #2)
> That looks like a good writeup of the various timing functions. Two things
> to note: GetTickCount is much faster than GetSystemTimeAsFileTime because it
> reads from a shared page of memory whereas the latter is a system call.

Good to know.  Tests prove that.

> GetTickCount also wraps around at around 49 days. You can use GetTickCount64
> but that's not available on XP.

I think it is OK to just check for GTC go backwards and add 1 << 32 to a "hi part" of the result and convert that way to 64 bits.  That is what Chromium does two with their RolloverProtectedNow.  I don't think we will call our TimeStamp::Now() or ::HiResNow() - I would suggest to have this method too - less then every 49 days to detect it.

> 
> What you want to do sounds pretty similar to I did in PRMJ_Now. How do you
> intend to handle timezone changes or the user manually adjusting their clock?

GetTickCount and QueryPerformanceCounter are both immune to it.  The goal of this bug is to have provider for ticks, not for wall clock time.
Comment 4 Rob Arnold [:robarnold] 2011-08-15 11:05:33 PDT
(In reply to Honza Bambas (:mayhemer) from comment #3)
> > 
> > What you want to do sounds pretty similar to I did in PRMJ_Now. How do you
> > intend to handle timezone changes or the user manually adjusting their clock?
> 
> GetTickCount and QueryPerformanceCounter are both immune to it.  The goal of
> this bug is to have provider for ticks, not for wall clock time.

I mistakenly thought part of the goal of this work was to synchronize Date.now with the native code timing functions. This makes your job much easier then :)
Comment 5 Honza Bambas (:mayhemer) 2011-08-17 11:15:44 PDT
Created attachment 553846 [details] [diff] [review]
WIP v0.1

This is a WIP patch turning TimeStamp::Now() be based on QueryPerformanceCounter (QPC).

- as a calibrator/check I use GetTickCount (GTC) that is cheap to call, stable and independent on time*Period setting
- QPC result is adjusted to match GTC within GTC's resolution
- at the start or after a larger deviation from GTC we recalibrate ; it is an expensive up to 15.6ms operation that determines exact distance (skew) between GTC and QPC
- QPC skew (and also the calibration) is kept (done) per-processor to avoid QPC leaps when switching between different processors or cores with different values of RDT (being out of sync) ; this is questionable, though
- the distribution of my adjusted-QPC and GTC values are mostly perfect, except the distance 0.0 - 0.6 ms where I experience some drop ; I want to make this a bit more better
- I unfortunately didn't get results of Talos from my try run, so I don't know how large impact using QPC for every TimeStamp::Now() call has on the performance ; if we figure out there is a large regression we may want to have a new HiResNow() method
Comment 6 Jeff Muizelaar [:jrmuizel] 2011-08-17 11:41:00 PDT
(In reply to Honza Bambas (:mayhemer) from comment #5)
> Created attachment 553846 [details] [diff] [review]

It would be nice if we didn't have to pay the cost of SetThreadAffinityMask on every system. Having to always call QueryPerformanceCounter on the same processor takes away one of the big advantages of QueryPerformanceCounter over rdtsc.
Comment 7 Honza Bambas (:mayhemer) 2011-08-17 11:55:56 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
> (In reply to Honza Bambas (:mayhemer) from comment #5)
> > Created attachment 553846 [details] [diff] [review]
> 
> It would be nice if we didn't have to pay the cost of SetThreadAffinityMask
> on every system. Having to always call QueryPerformanceCounter on the same
> processor takes away one of the big advantages of QueryPerformanceCounter
> over rdtsc.

Yes.  This is one questionable thing in the patch.  But as you can see, I first check on what processor we are and then "lock" to it.  I didn't so far measure how expensive that it, those will be my next steps unless somebody says "affinity and multiple cores is really wrong because ...".  On XP the current processor cannot be determined, though, at least in a cheap way suitable for production.  My patch forces use of core 0 on XP.  We may avoid call to affinity/get-current-processor by determining the number of processors is 0 (there is a function available for it since XP, AFAIK).
Comment 8 Jeff Muizelaar [:jrmuizel] 2011-08-17 12:03:29 PDT
(In reply to Honza Bambas (:mayhemer) from comment #7)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #6)
> > (In reply to Honza Bambas (:mayhemer) from comment #5)
> > > Created attachment 553846 [details] [diff] [review]
> > 
> > It would be nice if we didn't have to pay the cost of SetThreadAffinityMask
> > on every system. Having to always call QueryPerformanceCounter on the same
> > processor takes away one of the big advantages of QueryPerformanceCounter
> > over rdtsc.
> 
> Yes.  This is one questionable thing in the patch.  But as you can see, I
> first check on what processor we are and then "lock" to it.  I didn't so far
> measure how expensive that it, those will be my next steps unless somebody
> says "affinity and multiple cores is really wrong because ...".  On XP the
> current processor cannot be determined, though, at least in a cheap way
> suitable for production.  My patch forces use of core 0 on XP.  We may avoid
> call to affinity/get-current-processor by determining the number of
> processors is 0 (there is a function available for it since XP, AFAIK).

Ah, didn't even realize that you were locking to a particular core. I don't think that's an acceptable thing to do. Forcing FF to run on a particular core seems user hostile, plus we could get into a bad situation where a bunch of our threads end up forced on to the same core. Chrome manages without SetAffinity so I think we should try to as well.
Comment 9 Honza Bambas (:mayhemer) 2011-08-17 12:11:44 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #8)
> Ah, didn't even realize that you were locking to a particular core. 

Note: only for the time of read from QPC, more or less.

> I don't
> think that's an acceptable thing to do. Forcing FF to run on a particular
> core seems user hostile, plus we could get into a bad situation where a
> bunch of our threads end up forced on to the same core. 

That happens for a very short period of time, but I would like to go w/o it too.

> Chrome manages
> without SetAffinity so I think we should try to as well.

They forbid QPC usage for AMD family 15 processors.

Other solution could be to check for the processor before QPC and after, I was doing this with the first tests.  If different, then reload QPC.  Test showed it might happen, but rarely.  Hard to say if a switch there-and-back can happen during QPC call what would get QPC for a different core we believe we have the value from.
Comment 10 Jeff Muizelaar [:jrmuizel] 2011-08-17 20:23:50 PDT
(In reply to Honza Bambas (:mayhemer) from comment #9)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #8)
> > Ah, didn't even realize that you were locking to a particular core. 
> 
> Note: only for the time of read from QPC, more or less.
> 
> > I don't
> > think that's an acceptable thing to do. Forcing FF to run on a particular
> > core seems user hostile, plus we could get into a bad situation where a
> > bunch of our threads end up forced on to the same core. 
> 
> That happens for a very short period of time, but I would like to go w/o it
> too.

I misread the code assuming a zero mask meant run on all the processors. So this isn't as bad as I thought.

> > Chrome manages
> > without SetAffinity so I think we should try to as well.
> 
> They forbid QPC usage for AMD family 15 processors.

Maybe we should only use SetAffinity on those processors. In fact it might be worth having two implementations. One where we trust QPC to give the same result across processors and one where we don't. This would let one of them be much simpler and higher performance.

> Other solution could be to check for the processor before QPC and after, I
> was doing this with the first tests.  If different, then reload QPC.  Test
> showed it might happen, but rarely.  Hard to say if a switch there-and-back
> can happen during QPC call what would get QPC for a different core we
> believe we have the value from.

It seems like this approach could race. i.e. the switch could happen after you check which processor you're on and the call to QPC.
Comment 11 Honza Bambas (:mayhemer) 2011-08-18 13:28:36 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #10)
> One where we trust QPC to give the same
> result across processors and one where we don't. This would let one of them
> be much simpler and higher performance.
> 

Agree.  But how to decide?  And also I don't think the reason they block is just the core sync problem.  IMO it is because QPC may go forward and then back on that processors.  We are protected from this by GetTickCount checks.

> It seems like this approach could race. i.e. the switch could happen after
> you check which processor you're on and the call to QPC.

My idea was:

do
 p1 = GetCurrentProcessor()
 QueryPerformanceCounter()
 p2 = GetCurrentProcessor()
while (p1 != p2)

This of course still might be racy..
Comment 12 Honza Bambas (:mayhemer) 2011-08-24 18:10:40 PDT
Created attachment 555611 [details] [diff] [review]
v0.9

- simplified version w/o the periodical adjustment by small amounts of time
- the limit for re-calibration is to go 32 ms over GTC value or be 15.6*2 under the GTC value at any time
- after a longer time (10s) from the last read of Now() we adjust skew of QPC from GTC to 0, again at any time ; this puts QPC closer back to GTC w/o affecting the high resolution ; I have observed QPC overruns or underruns GTC after a longer period, that is probably caused by not very precise representation of the frequency
- we sync on all cores using threads started for each core at the start ; needed to prevent start up regression caused by usage of Now() early on an uncalibrated core
- we now have to implementations: optimized one for a single processor system, one for multi-core/processor

Please excuse any formal mistakes, my system drive has failed and the patch's not been quit polished.  Now I'm leaving for 4 days, so I rather submit the patch as is to let you start looking at it.

Try push: http://tbpl.allizom.org/?tree=Try&usebuildbot=1&rev=e0db070e33d4
Comment 13 Honza Bambas (:mayhemer) 2011-08-29 05:35:38 PDT
Created attachment 556521 [details] [diff] [review]
v0.9

This is the right patch, last one was not qref'ed.
Comment 14 Honza Bambas (:mayhemer) 2011-09-03 07:09:03 PDT
Created attachment 558062 [details] [diff] [review]
v1

So, I have made more research on performance impact of using SetThreadAffinityMask.

- On Win7, thanks fast implementation of QPC, costs more or less the same time as call to GTC, it regresses performance ca 90x (i.e. { GetCurrentProcessor(); SetAffinity(one-core); QPC(); SetAffinity(all-cores); } costs 90 times more time then just calling QPC() alone )

- On WinXP, thanks slow implementation of QPC, it regresses performance ca 2x.  As XP is the most problematic platform, it makes sense to use affinity masking there.  I also have found this as a real solution in other projects with practical prove of fixing most problems with hiccups, e.g. [1].

- Then, on Win7, doing { GetCurrentProcessor(); QPC(); GetCurrentProcessor(); } regresses ca 10x against call of QPC() alone ; but we may do more research on Win7/Vista behavior of QPC and potentially use just a single threaded optimized version.


Based on that I turned to use just QPC surrounded with two NtGetCurrentProcessorNumber() calls, when that method is available in the system (only platform where it is not present is XP).  If the numbers equal, we got value from a core the numbers indicate with very high probability.  On MP system with XP I have to use affinity mask to core 0.  On non-XP systems having NtGetCurrentProcessorNumber() I set affinity ONLY when recalibrating a single core, expected to be quit rare operation.

I do a background calibration on all cores using parallel threads bound to each core.  That happens at startup time and when a larger deviation has been detected 10 seconds or more after TimeStamp::Now() has not been used the last time.

I'm not against having TimeStamp::HiResNow() method.


Try run: https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=f3644b751a83


[1] http://www.ogre3d.org/forums/viewtopic.php?p=270250
Comment 15 Rob Arnold [:robarnold] 2011-09-03 21:55:49 PDT
Comment on attachment 558062 [details] [diff] [review]
v1

Review of attachment 558062 [details] [diff] [review]:
-----------------------------------------------------------------

I have not reviewed the entirety of this but I would like to see some revisions before continuing. Since there are (at least?) two implementations in this file, I think it would be helpful to more clearly separate them using classes/namespaces/files/whatever appeals to you. As written, it's difficult for me to know when reading a function which implementation will use it and when reading a variable, what its meaning is. Some, like sFrequencyPerSecond are shared between implementations and not well commented. Overall I would like more comments but perhaps a restructuring of the code will make the structure more clear. It should be apparent, either via the structure of the code or via comments, what the implementations are and on which platforms/under what conditions they are used.

::: xpcom/ds/TimeStamp_windows.cpp
@@ +62,5 @@
> +#endif /* PR_LOGGING */
> +
> +// Estimate of the smallest duration of time we can measure.
> +static PRUint64 sResolution;
> +static PRUint64 sResolutionSigDigs;

THis needs a comment

@@ +74,5 @@
> +
> +// Calibration intervals, using signed int to allow aritmetics.
> +// Values kept in tick of the Performance Coutner x 1000.
> +
> +// Negative treshold for QPC being under GTC to fully recalibrate

nit: threshold not treshold (this typo is made consistently in this file)

@@ +82,5 @@
> +// Limit for MP skew consolidation
> +static PRInt64  sConsolidationThreshold;
> +// If the last result has been obtained after this period then any
> +// deviation of QPC from GTC will cause full recalibration
> +static PRUint64 sForceRecalibratePeriod;

What units are all of these in? ms? ns?

@@ +92,5 @@
> +// GetTickCount 64 bit resolution logic.
> +static PRUint64 sGetTickCountHighPart = 0;
> +static DWORD    sLastGetTickCount = 0;
> +
> +// Skews betweeen QueryPerformanceCounter and GetTickCount on each processor

This comment seems like it's in the wrong place

@@ +124,5 @@
> +TickCount64()
> +{
> +  DWORD gtc = ::GetTickCount();
> +  // It may happen that when accessing GTC on multiple threads the results
> +  // may differ (GTC value may be lower on one thead then on other).  

It may be lower due to running before the others right around the overflow moment?

@@ +130,5 @@
> +  // a significant difference.
> +  EnterCriticalSection(&sGetTickCountOverlapLock);
> +  if (sLastGetTickCount > gtc && (sLastGetTickCount - gtc) > (1ULL << 31))
> +    sGetTickCountHighPart += 1ULL << 32;
> +  PRUint64 result = sGetTickCountHighPart | PRUint64(sLastGetTickCount = gtc);

The assignment here is slick but confusing. I didn't catch it the first time through. Might be best to split it into its own line.

@@ +139,5 @@
> +// All following logic works with ticks of the Performance Counter x 1000
> +// for preserving full possible resolution of the Performance Frequency.
> +// GetTickCount values in milliseconds are multiplied with frequency per 
> +// second.  Therefor we need to multiply QPC value by 1000 to have the same
> +// units to allow simple arithmentics with both QPC and GTC.

arithmetic?

@@ +618,5 @@
> +    return NS_OK;
> +
> +  gInitialized = PR_TRUE;
> +
> +  InitializeCriticalSectionAndSpinCount(&sGetTickCountOverlapLock, 4096);

How is this constant determined? I think this may be ignored on some version of Windows but I'm not sure
Comment 16 Jeff Muizelaar [:jrmuizel] 2011-09-06 13:43:38 PDT
(In reply to Honza Bambas (:mayhemer) from comment #14)
> Created attachment 558062 [details] [diff] [review]
> v1


How does the performance of this implementation of TimeStamp::Now() compare to PR_Now()?
Comment 17 Jeff Muizelaar [:jrmuizel] 2011-09-06 13:48:02 PDT
(In reply to Honza Bambas (:mayhemer) from comment #14)
> I'm not against having TimeStamp::HiResNow() method.

I don't really like this, as it's not obvious which method to call when. One option to consider would be a LowResTimeStamp class that stores it's result in 32 bits. This would help us not regress memory usage in some places that are currently using PR_Now()
Comment 18 Honza Bambas (:mayhemer) 2011-09-11 08:35:28 PDT
Thanks for the first review.

(In reply to Rob Arnold [:robarnold] from comment #15)
> Comment on attachment 558062 [details] [diff] [review]
> v1
> 
> Review of attachment 558062 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I have not reviewed the entirety of this but I would like to see some
> revisions before continuing. Since there are (at least?) two implementations
> in this file, 

Actually 3, and what more, one of them has an option.  I have added a comment to one of the new files in an upcoming patch to explain.

> I think it would be helpful to more clearly separate them
> using classes/namespaces/files/whatever appeals to you. 

I separated then using files, but maybe more adjustment to the structure will be needed.

> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +62,5 @@
> > +#endif /* PR_LOGGING */
> > +
> > +// Estimate of the smallest duration of time we can measure.
> > +static PRUint64 sResolution;
> > +static PRUint64 sResolutionSigDigs;
> 
> THis needs a comment

Not sure what comment you want.  This is copied from other implementations of TimeStamp where no more comments are added.

> nit: threshold not treshold (this typo is made consistently in this file)

Thanks.

> What units are all of these in? ms? ns?

I have added a comment to a visible place, introducing an artificial [mt] unit.  See the new patch for details.

> @@ +618,5 @@
> > +    return NS_OK;
> > +
> > +  gInitialized = PR_TRUE;
> > +
> > +  InitializeCriticalSectionAndSpinCount(&sGetTickCountOverlapLock, 4096);
> 
> How is this constant determined? I think this may be ignored on some version
> of Windows but I'm not sure

Hmm.. it has been taken from your patch for PRMJ_Now().  So you probably know better how you have determined them then me.
Comment 19 Honza Bambas (:mayhemer) 2011-09-11 08:50:26 PDT
Created attachment 559769 [details] [diff] [review]
v1.1

Updated to first review comments.

There are 4 new files now:
1. TimeStamp_windows.h : keeps definition of classes for each individual implementation
2. TimeStamp_windows.cpp : keeps impl for TimeStamp and TimeDuration and shared code of base classes defined in the new header + a code to decide based on the system config what class with what configuration to use (TimeStamp::Startup())
3. TimeStamp_windows_QPC_1P.cpp : impl of a single core version of the calibrator
4. TimeStamp_windows_QPC_MP.cpp : impl of a multi core version
Comment 20 Honza Bambas (:mayhemer) 2011-09-11 12:58:36 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #17)
> (In reply to Honza Bambas (:mayhemer) from comment #14)
> > I'm not against having TimeStamp::HiResNow() method.
> 
> I don't really like this, as it's not obvious which method to call when. One
> option to consider would be a LowResTimeStamp class that stores it's result
> in 32 bits. This would help us not regress memory usage in some places that
> are currently using PR_Now()

PR_Now() result is 64 bit.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #16)
> (In reply to Honza Bambas (:mayhemer) from comment #14)
> > Created attachment 558062 [details] [diff] [review]
> > v1
> 
> 
> How does the performance of this implementation of TimeStamp::Now() compare
> to PR_Now()?

Simply measured 100'000'000 calls:

Configuration               PR_Now      QPC (my patch)

Win7 64, i7 (8 cores)       14s         21s
WinXP 32, Athlon XP 64      49s         200s
WinXP 32, Core 2 Duo        41s         471s

On dual core machine with XP the regression is caused by affinity locking.  If there are significant calls we could drop back to GetTickCount on that platform.  QueryPerformanceCounter is expected to be slow on XP.  Therefor I suggested to have HiResNow().  Chromium has the same solution for this.
Comment 21 Honza Bambas (:mayhemer) 2011-09-20 11:50:40 PDT
Comment on attachment 559769 [details] [diff] [review]
v1.1

Jeff, could you please add your review comments, since Rob is currently busy?  I'd still like to let him do the final review, but any comments from you will be useful as well.

Thanks.
Comment 22 Jeff Muizelaar [:jrmuizel] 2011-09-21 16:51:15 PDT
(In reply to Honza Bambas (:mayhemer) from comment #20)
> On dual core machine with XP the regression is caused by affinity locking. 
> If there are significant calls we could drop back to GetTickCount on that
> platform.  QueryPerformanceCounter is expected to be slow on XP.  Therefor I
> suggested to have HiResNow().  Chromium has the same solution for this.

It sort of sucks that TimeStamp::Nnow() would be slower than PR_Now(). How does the performance of raw QPC compare to PR_Now()?
Comment 23 Jeff Muizelaar [:jrmuizel] 2011-09-27 17:53:34 PDT
Instead of comparing with PR_Now() we should actually be comparing with PR_IntervalNow() i.e. the current TimeStamp::Now() implementation.
Comment 24 Rob Arnold [:robarnold] 2011-09-27 19:30:38 PDT
Comment on attachment 559769 [details] [diff] [review]
v1.1

Review of attachment 559769 [details] [diff] [review]:
-----------------------------------------------------------------

I haven't gone through everything - this code is still tricky to read but that could just be the complexity of the problem.

::: xpcom/ds/TimeStamp_windows.cpp
@@ +79,5 @@
> +// ----------------------------------------------------------------------------
> +// GetTickCount shared code
> +// ----------------------------------------------------------------------------
> +
> +TimeStamp_windows_base::TimeStamp_windows_base()

Does it make sense for this base implementation to be in a different file than the TimeStamp class implementation?

@@ +190,5 @@
> +                          &TimeIncrement,
> +                          &TimeAdjustmentDisabled);
> +
> +  // 100ns resolution
> +  PRInt64 TicksPerGetTickCountResolution = 

style nit: locals shouldn't start with capital letters

@@ +223,5 @@
> +  LARGE_INTEGER pc;
> +  ULONG processor1 = 0, processor2 = 0;
> +  do {
> +    // This is cheap check of the core we read QPC value on ; it is not 100%
> +    // reliable but gives better performance then using SetThreadAffinityMask

How did you test SetTHreadAffinityMask? Did you ask it to switch the thread to its current processor or did you force it to processor 0?

@@ +249,5 @@
> +}
> +
> +// Return calibrated GTC and QPC values in [mt].  Calibration means we read
> +// QPC value right after GTC has been increased for its interval.  This gives
> +// the best didfference of QPC and GTC value.

nit: difference

@@ +272,5 @@
> +
> +  // 10 total trials is arbitrary: what we're trying to avoid by
> +  // looping is getting unlucky and being interrupted by a context
> +  // switch or signal, or being bitten by paging/cache effects
> +  for (int i = 0; minres != 0 && i < 9; ++i) {

Seems like this should be a named constant

@@ +279,5 @@
> +
> +    PRUint64 candidate = (end - start);
> +    if (candidate < minres)
> +      minres = candidate;
> +  }

Should this be a do...while loop?

@@ +309,5 @@
> +{
> +  if (sResolution)
> +    return;
> +
> +  sResolution = sInstance->ResolutionNs();

Is this code threadsafe? Does it need to be?

@@ +354,5 @@
> +    TimeStamp::Shutdown();
> +  }
> +};
> +
> +static TimeStampInitialization initOnce;

Static constructors are strongly frowned upon (perhaps even forbidden) since they hurt startup performance. Is there a way to avoid this?

@@ +399,5 @@
> +    // NtGetCurrentProcessorNumber is not available on Windows XP.  
> +    // Let's fall back to use only core 0 when accessing
> +    // QueryPerformanceCounter. QueryPerformanceCounter implementation on 
> +    // XP is already quit slow, using SetThreadAffinityMask then doesn't
> +    // impact relative prerformance significantly.

Since we're using an undocumented API, does it make sense to also assert and/or log if we hit this path and are not on vista+?

::: xpcom/ds/TimeStamp_windows.h
@@ +98,5 @@
> + * can be used as an end-class or as a base class for QueryPerformanceCounter 
> + * individual implementations ; see derivations bellow.
> + * ----------------------------------------------------------------------------
> + */
> +class TimeStamp_windows_base {

Mozilla style is to not use _ in class names. Something like TimeStampWindowsBase would be good.

@@ +319,5 @@
> +  /**
> +   * Skews betweeen QueryPerformanceCounter and GetTickCount on each processor.
> +   * Kept in [mt].
> +   */
> +  PRInt64  mSkew[64];

64 comes from...?

::: xpcom/ds/TimeStamp_windows_1P.cpp
@@ +52,5 @@
> +TimeStamp_windows_QPC_1P::TimeStamp_windows_QPC_1P(PRUint64 frequency, 
> +                                                   bool affinity)
> +  : TimeStamp_windows_QPC(frequency)
> +  , mForceAffinity(affinity)
> +  , mSkewMask(0)

use false instead of 0

@@ +69,5 @@
> +  PRUint64 result;
> +
> +  EnterCriticalSection(&mTimeStampLock);
> +
> +  DWORD_PTR prevMask = 0;

use NULL instead of 0

@@ +70,5 @@
> +
> +  EnterCriticalSection(&mTimeStampLock);
> +
> +  DWORD_PTR prevMask = 0;
> +  HANDLE thread = NULL;

use INVALID_HANDLE (or INVALID_HANDLE_VALUE...I forget which one works) instead of NULL

@@ +77,5 @@
> +    prevMask = ::SetThreadAffinityMask(thread, 1);
> +  }
> +
> +  PRUint64 TC;
> +  PRUint64 PC;

can you call these gtc & qpc? makes it much clearer what they mean

::: xpcom/ds/TimeStamp_windows_MP.cpp
@@ +64,5 @@
> +//
> +// The result is in ticks of the Performance Counter, adjusted
> +// not to significantly deviate from GetTickCount and be monotonic.
> +// Retun value translated to ms will be in interval
> +// <GetTickCount - 2 * ~15.6, GetTickCount + 2 * 16>

Can you use the formal mathematical interval notation to indicate the precise limits? i.e. use [] and () instead of <>
Comment 25 Honza Bambas (:mayhemer) 2011-09-28 06:35:10 PDT
(In reply to Rob Arnold [:robarnold] from comment #24)
> Comment on attachment 559769 [details] [diff] [review] [diff] [details] [review]
> v1.1
> 
> Review of attachment 559769 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> I haven't gone through everything - this code is still tricky to read but
> that could just be the complexity of the problem.
> 
> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +79,5 @@
> > +// ----------------------------------------------------------------------------
> > +// GetTickCount shared code
> > +// ----------------------------------------------------------------------------
> > +
> > +TimeStamp_windows_base::TimeStamp_windows_base()
> 
> Does it make sense for this base implementation to be in a different file
> than the TimeStamp class implementation?

I was also thinking about that.  I'll try how the structure is going to look like.

> @@ +223,5 @@
> > +  LARGE_INTEGER pc;
> > +  ULONG processor1 = 0, processor2 = 0;
> > +  do {
> > +    // This is cheap check of the core we read QPC value on ; it is not 100%
> > +    // reliable but gives better performance then using SetThreadAffinityMask
> 
> How did you test SetTHreadAffinityMask? Did you ask it to switch the thread
> to its current processor or did you force it to processor 0?

I have a little C++ console program that loops calls to these functions and measure the time.  Switching to a core a thread is already running on takes the same time (on my i7/Win7 machine) as switching to any other core.  Penalty is you have to read the processor number you are on that also costs something.

Therefor I decided to stick with core 0 on WinXP.

As I recall, calls to GetTickCount vs SetThreadAffinityMask+QueryPerformanceCounter+SetThreadAffinityMask on WinXP is about 1:20 :(

> @@ +272,5 @@
> > +
> > +  // 10 total trials is arbitrary: what we're trying to avoid by
> > +  // looping is getting unlucky and being interrupted by a context
> > +  // switch or signal, or being bitten by paging/cache effects
> > +  for (int i = 0; minres != 0 && i < 9; ++i) {
> 
> Seems like this should be a named constant

This code is copied from other implementations.  Should I fix them too?

> 
> @@ +279,5 @@
> > +
> > +    PRUint64 candidate = (end - start);
> > +    if (candidate < minres)
> > +      minres = candidate;
> > +  }
> 
> Should this be a do...while loop?

No.  It could be altered with minres = min(minres, candidate).  Why a loop?

> 
> @@ +309,5 @@
> > +{
> > +  if (sResolution)
> > +    return;
> > +
> > +  sResolution = sInstance->ResolutionNs();
> 
> Is this code threadsafe? Does it need to be?

Good point.  It probably has to be.  I do lazy init (that could be invoked on any thread), because doing it at the start up time may significantly affect Ts because we need to calibrate first before we measure the resolution.

> 
> @@ +354,5 @@
> > +    TimeStamp::Shutdown();
> > +  }
> > +};
> > +
> > +static TimeStampInitialization initOnce;
> 
> Static constructors are strongly frowned upon (perhaps even forbidden) since
> they hurt startup performance. Is there a way to avoid this?

This is also copied from other implementations.  If this needs to be fixed, then let's fix it in a different bug for all implementations.

> 
> @@ +399,5 @@
> > +    // NtGetCurrentProcessorNumber is not available on Windows XP.  
> > +    // Let's fall back to use only core 0 when accessing
> > +    // QueryPerformanceCounter. QueryPerformanceCounter implementation on 
> > +    // XP is already quit slow, using SetThreadAffinityMask then doesn't
> > +    // impact relative prerformance significantly.
> 
> Since we're using an undocumented API, 

Which one that is?  NtGetCurrentProcessorNumber is just not supported on Win XP.

> does it make sense to also assert
> and/or log if we hit this path and are not on vista+?

Personally, I would rather fall back to use GetTickCount on this platform (when no QPC is avail) because we will slow XP system down significantly.

> @@ +319,5 @@
> > +  /**
> > +   * Skews betweeen QueryPerformanceCounter and GetTickCount on each processor.
> > +   * Kept in [mt].
> > +   */
> > +  PRInt64  mSkew[64];
> 
> 64 comes from...?

Needs a little explanation (and needs a comment in the code too):
- 64 is the maximum number of processors the NtGetCurrentProcessorNumber method supports as well as the maximum number of processor SetThreadAffinityMask is capable to work with
- SetThreadAffinityMask is using DWORD_PTR for keeping and passing the mask ; this type is unsigned type that has the same length as void* pointer has, i.e. in 32bit build it is 32 bits long and in 64bit build it is 64 bits long, there is comment in the file about this
- It is apparent that 32bits systems can only support 32 processors and 64bits 64 processors
- However, if a system has more then 64 processors and is capable of supporting them all (I have to recheck which version of Win those are) then we have to use processor groups (then the number of processor is Group:Processor) ; there is an API that gets/sets a processor group and GetCurrentProcessorNumber/SetThreadAffinityMask function then work on that group and the number of a processor in that group

> > +  DWORD_PTR prevMask = 0;
> 
> use NULL instead of 0

No, see above comments.

> 
> @@ +70,5 @@
> > +
> > +  EnterCriticalSection(&mTimeStampLock);
> > +
> > +  DWORD_PTR prevMask = 0;
> > +  HANDLE thread = NULL;
> 
> use INVALID_HANDLE (or INVALID_HANDLE_VALUE...I forget which one works)
> instead of NULL

Hmm.. I think NULL is better here.  It is the value that CreateThread may return on failure, so assiging NULL to HANDLE is OK, IMO.

> > +// <GetTickCount - 2 * ~15.6, GetTickCount + 2 * 16>
> 
> Can you use the formal mathematical interval notation to indicate the
> precise limits? i.e. use [] and () instead of <>

This seems to be locale specific.  I'll use what you suggest.

Rob, thanks for your comments, however, the most important, except these details, is your opinion on the main logic (skews per processor/calibration/adjustment/performance impact).  I have read carefully your implementation for PRMJ_Now() and it is clear you are an expert on using windows timers APIs.
Comment 26 Jeff Muizelaar [:jrmuizel] 2011-09-28 08:00:11 PDT
Also, worth adding to the patch is more documentation about the rationale for using GetTickCount() instead of timeGetTime(). The reasons for this decision aren't really clear to me.
Comment 27 Honza Bambas (:mayhemer) 2011-09-28 08:20:27 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #23)
> Instead of comparing with PR_Now() we should actually be comparing with
> PR_IntervalNow() i.e. the current TimeStamp::Now() implementation.

Win7/Intel Core i7 (7 cores, using 2 calls to NtGetCurrentProcessor to determine the core)
  10 times slower
  My patch: 2112.13491589ms
  Rollover tick count: 213.32809626ms

WinXP/Althlon XP 64 4000+ (single processor, using just call to QueryPerformanceCounter)
  62 times slower
  My patch: 20186.8236474ms
  Rollover tick count: 323.05415752ms

WinXP/Intel Core 2 Duo mobile (2 cores, using affinity lock to core 0)
  211 times slower
  My patch: 334327,88189560ms
  Rollover tick count: 1581,87590881ms
Comment 28 Honza Bambas (:mayhemer) 2011-09-28 08:41:23 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #26)
> Also, worth adding to the patch is more documentation about the rationale
> for using GetTickCount() instead of timeGetTime(). The reasons for this
> decision aren't really clear to me.

I'll add this to the patch as well:

First, see comment 2, and then also some pieces from comment 1 - here is an outline:

GetTickCount is very fast to call and also not sensitive to time*Period settings.  Therefor I consider it a stable value useful for calibration and further checking.  It also has been proved.  timeGetTime needs linking to Winmm.dll and is slower to call and also sensitive to time*Period settings and I observed some jitter in the result too.
Comment 29 Rob Arnold [:robarnold] 2011-09-28 08:46:29 PDT
(In reply to Honza Bambas (:mayhemer) from comment #25)
> > How did you test SetTHreadAffinityMask? Did you ask it to switch the thread
> > to its current processor or did you force it to processor 0?
> 
> I have a little C++ console program that loops calls to these functions and
> measure the time.  Switching to a core a thread is already running on takes
> the same time (on my i7/Win7 machine) as switching to any other core. 
> Penalty is you have to read the processor number you are on that also costs
> something.

I would add a brief comment about this then.

> > @@ +272,5 @@
> > > +
> > > +  // 10 total trials is arbitrary: what we're trying to avoid by
> > > +  // looping is getting unlucky and being interrupted by a context
> > > +  // switch or signal, or being bitten by paging/cache effects
> > > +  for (int i = 0; minres != 0 && i < 9; ++i) {
> > 
> > Seems like this should be a named constant
> 
> This code is copied from other implementations.  Should I fix them too?

Yes, but in another patch/bug
 
> > @@ +279,5 @@
> > > +
> > > +    PRUint64 candidate = (end - start);
> > > +    if (candidate < minres)
> > > +      minres = candidate;
> > > +  }
> > 
> > Should this be a do...while loop?
> 
> No.  It could be altered with minres = min(minres, candidate).  Why a loop?

I meant for the comment to apply to the loop (I am still getting used to splinter)

> > Static constructors are strongly frowned upon (perhaps even forbidden) since
> > they hurt startup performance. Is there a way to avoid this?
> 
> This is also copied from other implementations.  If this needs to be fixed,
> then let's fix it in a different bug for all implementations.

Ok. File a bug and cc taras?

> > @@ +399,5 @@
> > > +    // NtGetCurrentProcessorNumber is not available on Windows XP.  
> > > +    // Let's fall back to use only core 0 when accessing
> > > +    // QueryPerformanceCounter. QueryPerformanceCounter implementation on 
> > > +    // XP is already quit slow, using SetThreadAffinityMask then doesn't
> > > +    // impact relative prerformance significantly.
> > 
> > Since we're using an undocumented API, 
> 
> Which one that is?  NtGetCurrentProcessorNumber is just not supported on Win
> XP.

I worry about it not being supported on a future version of Windows.
 
> > does it make sense to also assert
> > and/or log if we hit this path and are not on vista+?
> 
> Personally, I would rather fall back to use GetTickCount on this platform
> (when no QPC is avail) because we will slow XP system down significantly.

Err, what I meant to write was "should we assert and/or log if we hit this path and are not on xp?"

> > @@ +319,5 @@
> > > +  /**
> > > +   * Skews betweeen QueryPerformanceCounter and GetTickCount on each processor.
> > > +   * Kept in [mt].
> > > +   */
> > > +  PRInt64  mSkew[64];
> > 
> > 64 comes from...?
> 
> Needs a little explanation (and needs a comment in the code too):
> - 64 is the maximum number of processors the NtGetCurrentProcessorNumber
> method supports as well as the maximum number of processor
> SetThreadAffinityMask is capable to work with
> - SetThreadAffinityMask is using DWORD_PTR for keeping and passing the mask
> ; this type is unsigned type that has the same length as void* pointer has,
> i.e. in 32bit build it is 32 bits long and in 64bit build it is 64 bits
> long, there is comment in the file about this
> - It is apparent that 32bits systems can only support 32 processors and
> 64bits 64 processors
> - However, if a system has more then 64 processors and is capable of
> supporting them all (I have to recheck which version of Win those are) then
> we have to use processor groups (then the number of processor is
> Group:Processor) ; there is an API that gets/sets a processor group and
> GetCurrentProcessorNumber/SetThreadAffinityMask function then work on that
> group and the number of a processor in that group

I'm not too worried about systems with > 64 processors but we should name this constant and perhaps compute this constant from sizeof DWORD_PTR and/or comment it

> > > +  DWORD_PTR prevMask = 0;
> > 
> > use NULL instead of 0
> 
> No, see above comments.

I mistook it for a pointer type, sorry.

> > @@ +70,5 @@
> > > +
> > > +  EnterCriticalSection(&mTimeStampLock);
> > > +
> > > +  DWORD_PTR prevMask = 0;
> > > +  HANDLE thread = NULL;
> > 
> > use INVALID_HANDLE (or INVALID_HANDLE_VALUE...I forget which one works)
> > instead of NULL
> 
> Hmm.. I think NULL is better here.  It is the value that CreateThread may
> return on failure, so assiging NULL to HANDLE is OK, IMO.

After reading http://blogs.msdn.com/b/oldnewthing/archive/2004/03/02/82639.aspx and the documentation, it's not clear to me if NULL Or INVALID_HANDLE_VALUE is correct. But since GetCurrentThread is cheap (it just returns a pseudo-handle, not even a real one), perhaps we should just always call it.

> 
> > > +// <GetTickCount - 2 * ~15.6, GetTickCount + 2 * 16>
> > 
> > Can you use the formal mathematical interval notation to indicate the
> > precise limits? i.e. use [] and () instead of <>
> 
> This seems to be locale specific.  I'll use what you suggest.

Oh! I had no idea. I think it is still better to indicate the precise limits.

> Rob, thanks for your comments, however, the most important, except these
> details, is your opinion on the main logic (skews per
> processor/calibration/adjustment/performance impact).  I have read carefully
> your implementation for PRMJ_Now() and it is clear you are an expert on
> using windows timers APIs.

This is the part that will take me longer because the logic here is so complex. I am giving this other feedback now because I think it will make it easier for me to review the core logic.
Comment 30 Honza Bambas (:mayhemer) 2011-09-28 09:07:10 PDT
(In reply to Honza Bambas (:mayhemer) from comment #27)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #23)
> > Instead of comparing with PR_Now() we should actually be comparing with
> > PR_IntervalNow() i.e. the current TimeStamp::Now() implementation.
> 
> Win7/Intel Core i7 (7 cores, using 2 calls to NtGetCurrentProcessor to
> determine the core)
>   10 times slower
>   My patch: 2112.13491589ms
>   Rollover tick count: 213.32809626ms
> 
> WinXP/Althlon XP 64 4000+ (single processor, using just call to
> QueryPerformanceCounter)
>   62 times slower
>   My patch: 20186.8236474ms
>   Rollover tick count: 323.05415752ms
> 
> WinXP/Intel Core 2 Duo mobile (2 cores, using affinity lock to core 0)
>   211 times slower
>   My patch: 334327,88189560ms
>   Rollover tick count: 1581,87590881ms

To explain: the time is the elapsed number of milliseconds for 100'000'000 loops of call to my implementation of Now() and old implementation of Now().

BTW, on Core 2 Duo processor I also have seen following result:
My patch: 76149,21030466ms
Rollover tick count: 1450,05300953ms

Means 52 times slower.
Comment 31 Jeff Muizelaar [:jrmuizel] 2011-09-28 11:18:50 PDT
Comment on attachment 559769 [details] [diff] [review]
v1.1

>+// Function protecting GetTickCount result from rolling over, result is in ms
>+PRUint64
>+TimeStamp_windows_base::TickCount64()
>+{
>+  DWORD gtc = ::GetTickCount();
>+
>+  // It may happen that when accessing GTC on multiple threads the results
>+  // may differ (GTC value may be lower due to running before the others 
>+  // right around the overflow moment).  That falsely shifts the high part.
>+  // Easiest solution is to check for a significant difference.
>+
>+  EnterCriticalSection(&mGetTickCountOverlapLock);
>+  if (mLastGetTickCount > gtc && (mLastGetTickCount - gtc) > (1ULL << 31))
>+    mGetTickCountHighPart += 1ULL << 32;
>+
>+  mLastGetTickCount = gtc;
>+
>+  PRUint64 result = mGetTickCountHighPart | PRUint64(gtc);
>+  LeaveCriticalSection(&mGetTickCountOverlapLock);
>+
>+  return result;
>+}

This should be implemented using cmpxchg8b instead of locking.
Comment 32 Honza Bambas (:mayhemer) 2011-09-28 12:33:05 PDT
(In reply to Jeff Muizelaar [:jrmuizel] from comment #31)
> This should be implemented using cmpxchg8b instead of locking.

Can you please provide an outline how?
Comment 33 Jeff Muizelaar [:jrmuizel] 2011-09-28 12:56:04 PDT
(In reply to Honza Bambas (:mayhemer) from comment #32)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #31)
> > This should be implemented using cmpxchg8b instead of locking.
> 
> Can you please provide an outline how?

https://bugzilla.mozilla.org/show_bug.cgi?id=484921#c3
Comment 34 Honza Bambas (:mayhemer) 2011-09-29 08:50:45 PDT
Created attachment 563426 [details] [diff] [review]
v2 - simplified

Jeff, Rob, this a preview of a very much simplified and optimized patch for this bug.  It is a new implementation, so there is no common part with the v1 patch.

It is lock-free code using interlocked windows functions and a simple background thread that syncs the skew qpc - gtc every 10 seconds.

When qpc deviate to much, we abandon it and use gtc directly.

Please take a look before the v1 patch, this might also be much simpler to review.

Thanks for all comments and help.
Comment 35 Honza Bambas (:mayhemer) 2011-09-29 11:03:55 PDT
Performance results of the patch on Win7/i7 machine, again 100'000'000 calls of the function in a loop:

My patch v2: 3528.377 ms
Locked GTC: 2805.280 ms - this is similar to current TimeStamp::Now() impl
Pure QPC: 1425.778 ms

The patch v2 is just 25% slower then the current implementation on Windows 7.

Now I have to migrate the Interlocked* functions to what is supported also on WinXP and test again.
Comment 36 Honza Bambas (:mayhemer) 2011-09-29 16:14:26 PDT
Created attachment 563586 [details] [diff] [review]
v2.WIP1

This is a WIP patch, major simplification of v1 of the patch, not polished.  It should include all read locks on all necessary places.  Also includes a code that measure the performance during shutdown and displays in the console, that code is self explanatory.

On Win7/i7 machine I get numbers with this patch:
My patch: 4688.671 ms
Locked GTC: 2779.506 ms
QPC pure: 1436.720 ms

I submit it just for feedback.  This is not a final version I would give for review.

Thanks.
Comment 37 Rob Arnold [:robarnold] 2011-09-29 20:33:52 PDT
Comment on attachment 563586 [details] [diff] [review]
v2.WIP1

Review of attachment 563586 [details] [diff] [review]:
-----------------------------------------------------------------

This is looking much more readable. I like the direction it's going in but we need to be sure that all the edge cases are handled.

::: xpcom/ds/TimeStamp_windows.cpp
@@ +96,5 @@
> +static const LONGLONG kNsPerSec   = 1000000000;
> +
> +namespace mozilla {
> +
> +static nsAutoPtr<Mutex> sMutex;

Can this just be a pointer? Trying to avoid static constructors :)

@@ +126,5 @@
> +    xor ecx,ecx
> +    mov esi,[Source]
> +    lock cmpxchg8b [esi]
> +  }
> +}

We should use either the Windows ones if possible, or the nspr/mfbt ones. Assembly is most certainly not the answer here. Lacking comments as well, it's hard to figure out what this is supposed to do.

@@ +141,5 @@
> +  } else {
> +    lastResult = MozInterlockedGet64(&sLastGTCResult);
> +  }
> +
> +  LONGLONG lastResultHiPart = lastResult & 0xFFFFFFFF00000000ULL;

There's probably a better expression for the mask that's more descriptive but it escapes me

@@ +260,5 @@
> +  while (!PR_AtomicSet(&sSyncThreadStop, 0)) {
> +    LONGLONG skew = Calibrate(true);
> +    MozInterlockedCompareExchange64(&sSkew, skew, sSkew);
> +
> +    ::WaitForSingleObject(sSyncEvent, kCalibrationInterval);

Waking up every kCalibrationInterval on mobile is going to be bad when kCalibrationInterval is 10 seconds. You might want to look into one of those fancy power-aware timers in Windows 7 and increasing the timeout to something longer.

@@ +303,5 @@
> +{
> +  LOG(("TimeStamp: falling back to GTC"));
> +  InterlockedCompareExchangePointer((PVOID*)&sHiResNowFunc, (PVOID)&TickCount64,
> +                                    (PVOID)&sHiResNowFunc);
> +  StopCalibrationThread();

This needs to be thread safe since it can be called on multiple threads (multiple threads decide that QPC is unstable). Also I think it needs a call to InitResolution as well?

@@ +376,5 @@
> +
> +  // 10 total trials is arbitrary: what we're trying to avoid by
> +  // looping is getting unlucky and being interrupted by a context
> +  // switch or signal, or being bitten by paging/cache effects
> +  for (int i = 0; minres != 0 && i < 9; ++i) {

This could be turned into a do-while loop to avoid a duplicated body (i.e. it's essentially an unrolled do-while loop)
Comment 38 Honza Bambas (:mayhemer) 2011-09-30 10:00:08 PDT
(In reply to Rob Arnold [:robarnold] from comment #37)
> Comment on attachment 563586 [details] [diff] [review] [diff] [details] [review]
> v2.WIP1
> 
> Review of attachment 563586 [details] [diff] [review] [diff] [details] [review]:

Thanks!

> We should use either the Windows ones if possible, or the nspr/mfbt ones.
> Assembly is most certainly not the answer here. Lacking comments as well,
> it's hard to figure out what this is supposed to do.

Windows ones are slower, simply linkable functions are only on WinVista and later, on XP there is only ExInterlockedCompareExchange64 that is part of DDK that Gecko doesn't link to.

I have seen also implementation of an atomic set via FPU, but there is no cmp/get-old feature.

Not sure what mfbt means.

NSPR supports only 32 bits atomic operations.  The implementation on winnt is also via __asm [1] using the lock prefix.

If __cpuid indicates there is a support for CMPXCHG8B then we are safe to use it.  The assembler code is no more then a working implementation of the on-XP-unreachable InterlockedCompareExchange64.  I'm not aware of another way to atomically work with 64 bits of memory in 32 bits code.

> > +  LONGLONG lastResultHiPart = lastResult & 0xFFFFFFFF00000000ULL;
> 
> There's probably a better expression for the mask that's more descriptive
> but it escapes me

Yes, but even a debug build code is very well optimized to grab just the high part of the memory into a 32bit register.  That is pretty good!

> Windows 7 and increasing the timeout to
> something longer.

Yes, I'm aware.  If you have any concrete tips for this, it might save me some work.  I am trying to find some docs on this, so far no luck.

> > +  StopCalibrationThread();
> 
> This needs to be thread safe since it can be called on multiple threads
> (multiple threads decide that QPC is unstable). Also I think it needs a call
> to InitResolution as well?

Very good points.

> This could be turned into a do-while loop 

Now I finally got it ;)

Remaining comments will be addressed.


[1] http://hg.mozilla.org/mozilla-central/annotate/b384cfab8b1d/nsprpub/pr/src/md/windows/ntmisc.c#l1060
Comment 39 Emanuel Hoogeveen [:ehoogeveen] 2011-10-02 07:32:52 PDT
I realize this bug is in a fairly advanced stage, but I wanted to mention that in testing with RDTSC in a timing sensitive application, I found that limiting the thread in question to one core to avoid serializing instructions resulted in the occasional large spike in latency. Apparently the Windows scheduler felt the need to use that core for something else, and so my application was left hanging until its critical timing window had passed.

I ended up using wrapping the RDTSC in some serializing instructions and forgetting about processor affinity - this regressed performance of the call by about 5-20x depending on the system, but that was acceptable for my application and got rid of the latency spikes. As noted above, QPC calls have a much greater spread in terms of call overhead, and have additional problems.

However, I should note that I did not explicitly test the serialized RDTSC call for monotonicity, as this was not as much of an issue for my application. As the application is quite taxing I also didn't have to worry about processor power states, and so could use a one-time calibration per application launch with QPC to get the timestamps per second.

On newer processors, "starting from Core i7 and starting from AMD Athlon 64 X2 CPUs with AM2 Socket (Windsor & Brisbane)" (from Wikipedia), there's also the RDTSCP instruction which does serialize the instruction stream. In addition, the MFENCE and SFENCE/LFENCE SSE2 instructions could be used for serialization instead of the CPUID instruction, say in x64 builds. In other builds you'd have to do runtime detection.

In any case, this bug is probably beyond that stage, and your requirements are different from mine. I just wanted to add some thoughts about my experiences as the whole high resolution timing situation is pretty messed up.
Comment 40 Honza Bambas (:mayhemer) 2011-10-03 15:10:25 PDT
(In reply to Emanuel Hoogeveen from comment #39)
> I realize this bug is in a fairly advanced stage, but I wanted to mention
> that in testing with RDTSC in a timing sensitive application, I found that
> limiting the thread in question to one core to avoid serializing
> instructions resulted in the occasional large spike in latency. Apparently
> the Windows scheduler felt the need to use that core for something else, and
> so my application was left hanging until its critical timing window had
> passed.
> 
> I ended up using wrapping the RDTSC in some serializing instructions and
> forgetting about processor affinity - this regressed performance of the call
> by about 5-20x depending on the system, but that was acceptable for my
> application and got rid of the latency spikes. As noted above, QPC calls
> have a much greater spread in terms of call overhead, and have additional
> problems.
> 
> However, I should note that I did not explicitly test the serialized RDTSC
> call for monotonicity, as this was not as much of an issue for my
> application. As the application is quite taxing I also didn't have to worry
> about processor power states, and so could use a one-time calibration per
> application launch with QPC to get the timestamps per second.

Thanks for your thoughts, Emanuel.  Anything helpful is very welcomed.

To note: I left the idea of locking a thread to a single core or processor.  The performance regression was too large.  I prefer (and continue work) on the simpler patch, that just calibrates in background and watches for larger deviations ; if detected, fall back to GetTickCount.

> 
> On newer processors, "starting from Core i7 and starting from AMD Athlon 64
> X2 CPUs with AM2 Socket (Windsor & Brisbane)" (from Wikipedia), there's also
> the RDTSCP instruction which does serialize the instruction stream. In
> addition, the MFENCE and SFENCE/LFENCE SSE2 instructions could be used for
> serialization instead of the CPUID instruction, say in x64 builds. In other
> builds you'd have to do runtime detection.

I tried the following code:

   __asm {
      ; Flush the pipeline
      XOR eax, eax
      CPUID
      ; Get RDTSC counter in edx:eax
      RDTSC
   }

and it didn't give me the expected result.  It is dependent on the actual processor frequency, that, also on my i7 machine, changes with speed step and turbo technology.  So it is unfortunately useless for referring time, but is useful for profiling.

Also I don't want (I can say anyone in Mozilla too) use so very low level assembly code.  This would end up trying to re-implement what MS already did.  QueryPerformanceCounter/Frequency is what I intend to use.


> 
> In any case, this bug is probably beyond that stage, and your requirements
> are different from mine. I just wanted to add some thoughts about my
> experiences as the whole high resolution timing situation is pretty messed
> up.

We would like to have some compromise between perfect resolution and simply and well performing code.  I believe what I have now is almost stable code to have a basic impl to test with by larger audience of Nightly testers.
Comment 41 Emanuel Hoogeveen [:ehoogeveen] 2011-10-03 16:04:01 PDT
Yeah, unfortunately there doesn't seem to be a 'current clock speed' instruction that you could use to compare it to. I believe ACPI timers (notably the HPET) do have such frequency information, but I have no idea if user mode code can even access them at all, nevermind what kind of overhead would be associated with it. Since I haven't found any immediately obvious way to query a processor's clock frequency, perhaps operating systems use ACPI timers for calibration purposes? I'm just guessing here, but I would like to know how it works.
Comment 42 Honza Bambas (:mayhemer) 2011-10-19 12:33:34 PDT
(In reply to Emanuel Hoogeveen from comment #41)
> I would like to know how it works.

Maybe http://support.microsoft.com/kb/895980 could give some info.
Comment 43 Honza Bambas (:mayhemer) 2011-10-31 09:46:11 PDT
Created attachment 570738 [details] [diff] [review]
v3

Rob, sorry for some time of silence on this.

But I have a surprise for you: this patch is a very much simplified compared to all that has gone thought this bug so far:

- we return only value of qpc, untempered 
- we check it doesn't deviate from gtc too much ; otherwise, we fallback to use gtc + the-last-good-skew
- we "recalibrate" as soon as we reach 4 seconds from the last "recalibration", done simply during read of the hi-res value
- qpc-gtc changes rapidly after system wake-up, mainly on XP ; I added an observer for the wakeup event that resets possible drop to gtc because of that

The patch is based on the experience I got after hours of work on implementing this, has been tested on all my 3 Win machines.

I don't think this is urgent to get reviewed for the next Aurora merge, since it is regression-risky low-level change, but I will talk to Josh about this.

Next steps:
- add telemetry to see how this works in the field (I would personally prefer to land this together w/ the added telemetry)
- remove the commented out code for measuring performance, it's there for reference only
- make the calibration a bit better, we should try to capture the difference of qpc and gtc ASAP after gtc has shifted to lower potential jitter that the current code carries
Comment 44 Rob Arnold [:robarnold] 2011-11-10 06:52:35 PST
Comment on attachment 570738 [details] [diff] [review]
v3

Review of attachment 570738 [details] [diff] [review]:
-----------------------------------------------------------------

I didn't get through all of it but I do have some questions.

::: xpcom/ds/TimeStamp_windows.cpp
@@ +274,5 @@
> +                          &timeIncrement,
> +                          &timeAdjustmentDisabled);
> +
> +  // How many 100ns to the next 1.0ms
> +  DWORD timeIncrementRoundUp = 10000 - (timeIncrement % 10000);

Isn't this code incorrect if timeIncrement = 0? Can it ever be 0? I don't quite understand what the purpose of calculating this is.

@@ +316,5 @@
> +  if (0 == minres) {
> +    minres = 1;
> +  }
> +
> +  ULONGLONG result = mt2ms(minres * kNsPerSec);

minres is in mt per the comment block for CalibratedPerformanceCounter - why is multiplied by kNsPerSec?

@@ +334,5 @@
> +
> +  sResolutionSigDigs = sigDigs;
> +}
> +
> +// Function protecting GetTickCount result from rolling over, result is in [ms]

from reading the code, it looks like the result is in ticks, not ms

@@ +461,5 @@
> +  ULONGLONG gtc = TickCount64();
> +
> +  // XXX This is using ObserverService, cannot instantiate in the static 
> +  // startup, really needs a better initation code here.
> +  StandbyObserver::Ensure();

Should you initialize before you query the performance counter since initialization may take a while?
Comment 45 Honza Bambas (:mayhemer) 2011-11-10 11:51:17 PST
(In reply to Rob Arnold [:robarnold] from comment #44)
> Comment on attachment 570738 [details] [diff] [review] [diff] [details] [review]
> v3
> 
> Review of attachment 570738 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------

Thanks for looking into this, Rob.

> 
> Isn't this code incorrect if timeIncrement = 0? Can it ever be 0? I don't
> quite understand what the purpose of calculating this is.

I never thought about it.  We may fallback to 15.6 ms on 0.

To explain: only thing I'm trying to do here is to round up to a whole number of milliseconds.  If timeIncrement is 15.6001 ms then I want 16.0 ms.  timeIncrementRoundUp = 10000 - (timeIncrement % 10000) is 16.0000 - 15.6001 = 0.3999.

I can change the code or comments to be more understandable.


> minres is in mt per the comment block for CalibratedPerformanceCounter - why
> is multiplied by kNsPerSec?

Looks like it should be result = mt2ms(minres) * kNsPerMillisec;

It may be result of an automatic replace when I was adding the mx2mx macros.


> > +// Function protecting GetTickCount result from rolling over, result is in [ms]
> 
> from reading the code, it looks like the result is in ticks, not ms

No, GetTickCount returns the value in ms, there is no conversion of the result, just adding of the high part.


> Should you initialize before you query the performance counter since
> initialization may take a while?

That might be true, we probably don't need to call this under the lock since the global pointer to StandbyObserver changes only on the main thread.
Comment 46 Rob Arnold [:robarnold] 2011-11-13 20:50:19 PST
Comment on attachment 570738 [details] [diff] [review]
v3

Review of attachment 570738 [details] [diff] [review]:
-----------------------------------------------------------------

How feasible do you think it would be to have some unit tests for this to cover the various fallback triggers?

::: xpcom/ds/TimeStamp_windows.cpp
@@ +154,5 @@
> +static const ULONGLONG kOverflowLimit = 100;
> +
> +// Common mutex (thanks the relative complexity of the logic, this is better
> +// then using CMPXCHG8B.)
> +CRITICAL_SECTION sTimeStampLock;

Document which variables this guards

@@ +423,5 @@
> +         sinceLastCalibration / 1000, 
> +         mt2ms_d(overflow),
> +         mt2ms_d(trend)));
> +
> +    if (trend > ms2mt(kOverflowLimit)) {

Should this be abs(trend)? overflow can be negative thus trend can also be negative

@@ +457,5 @@
> +  ULONGLONG qpc = PerformanceCounter();
> +
> +  AutoCriticalSection lock(&sTimeStampLock);
> +
> +  ULONGLONG gtc = TickCount64();

Does this need to be called under the lock as well? There is some mt-aware logic in the implementation but it seems like all calls to it are serialized.

@@ +553,5 @@
> +TimeStamp::Startup()
> +{
> +  // Decide which implementation to use for the high-performance timer.
> +
> +  InitializeCriticalSectionAndSpinCount(&sTimeStampLock, 4096);

Where does 4096 come from? Make this a constant.

@@ +608,5 @@
> +  }
> +  test3 = PerformanceCounter() - t1;
> +
> +  /** PERFORMANCE COMPARE TESTING CODE END */
> +#endif

This testing code should either become a proper test binary of some sort or disappear
Comment 47 Honza Bambas (:mayhemer) 2011-11-14 08:41:46 PST
(In reply to Rob Arnold [:robarnold] from comment #46)
> Comment on attachment 570738 [details] [diff] [review] [diff] [details] [review]
> v3
> 
> Review of attachment 570738 [details] [diff] [review] [diff] [details] [review]:
> -----------------------------------------------------------------
> 
> How feasible do you think it would be to have some unit tests for this to
> cover the various fallback triggers?

We already have a CPP test for time stamp [1], but it doesn't work with libxul.  There is no bug for that [2].

> 
> @@ +423,5 @@
> > +         sinceLastCalibration / 1000, 
> > +         mt2ms_d(overflow),
> > +         mt2ms_d(trend)));
> > +
> > +    if (trend > ms2mt(kOverflowLimit)) {
> 
> Should this be abs(trend)? overflow can be negative thus trend can also be
> negative

No, overflow value here is always just positive:
+  if (diff < sUnderrunThreshold) {
+    overflow = sUnderrunThreshold - diff;
+  }
+  else if (diff > sOverrunThreshold) {
+    overflow = diff - sOverrunThreshold;
+  }

And:
+  // I obserrved differences about 2 times of the GTC resolution.  GTC may
+  // jump by 32 ms in two steps, therefor include the remainder as well.
+  sUnderrunThreshold =
+    LONGLONG((-2) * (ticksPerGetTickCountResolution + ticksIncrementRoundUp));

i.e. is negative

+
+  // QPC should go no further then 2 * GTC resolution
+  sOverrunThreshold =
+    LONGLONG((+2) * ticksPerGetTickCountResolution);

i.e. is positive.

> 
> @@ +457,5 @@
> > +  ULONGLONG qpc = PerformanceCounter();
> > +
> > +  AutoCriticalSection lock(&sTimeStampLock);
> > +
> > +  ULONGLONG gtc = TickCount64();
> 
> Does this need to be called under the lock as well? There is some mt-aware
> logic in the implementation but it seems like all calls to it are serialized.

As you feel this could be a problem, I'll investigate more on it.

> 
> @@ +553,5 @@
> > +TimeStamp::Startup()
> > +{
> > +  // Decide which implementation to use for the high-performance timer.
> > +
> > +  InitializeCriticalSectionAndSpinCount(&sTimeStampLock, 4096);
> 
> Where does 4096 come from? Make this a constant.

Again, I've inspired with your patch.  I'll turn this to a kXXX.

> 
> @@ +608,5 @@
> > +  }
> > +  test3 = PerformanceCounter() - t1;
> > +
> > +  /** PERFORMANCE COMPARE TESTING CODE END */
> > +#endif
> 
> This testing code should either become a proper test binary of some sort or
> disappear

I plan to remove it, it is there for anyone interested in testing performance of this patch only.  It will be removed from the patch before commit or sooner.

Some variant of this code (since it takes ages to finish on XP) would be nice to have in some test, but question is, what for?  Just to catch regressions?  It is IMO better to have some test that just utilize TimeStamp::Now() indirectly through some code and is covered by some automated T regression test.

[1] http://mxr.mozilla.org/mozilla-central/source/xpcom/tests/TestTimeStamp.cpp
[2] http://hg.mozilla.org/mozilla-central/diff/8e846d7f21ea/xpcom/tests/Makefile.in
Comment 48 Honza Bambas (:mayhemer) 2011-11-15 14:15:41 PST
(In reply to Honza Bambas (:mayhemer) from comment #47)
> > @@ +457,5 @@
> > > +  ULONGLONG qpc = PerformanceCounter();
> > > +
> > > +  AutoCriticalSection lock(&sTimeStampLock);
> > > +
> > > +  ULONGLONG gtc = TickCount64();
> > 
> > Does this need to be called under the lock as well? There is some mt-aware
> > logic in the implementation but it seems like all calls to it are serialized.
> 
> As you feel this could be a problem, I'll investigate more on it.


I probably misunderstood first.

You want to call ::GetTickCount also not under the lock, right?  That is a good suggestion.  We don't need to call it under the lock, we just need to lock over the logic with the result.


Rob, in general, do you want me to update the patch now, or should I wait for more feedback from you first?
Comment 49 Rob Arnold [:robarnold] 2011-11-17 09:13:11 PST
For testing, I was proposing a separate test just for this which mocks QPC and GTC.

My apologies for not knowing where the 4096 came - let this be motivation to document magic constants!

Yes, I was suggesting that we call ::GetTickCount without being under a lock.

In general, please update the patch after my review comments - my reviews may not be thorough but more designed to let you start working sooner by fixing little things which will help me understand the final patch.
Comment 50 Honza Bambas (:mayhemer) 2011-11-22 11:28:26 PST
Created attachment 576212 [details] [diff] [review]
v3.1

- timeIncrement == 0 check
- result = mt2ms(minres * kNsPerSec) explained and fixed
- variables rearranged and more commented
- GetTickCount no longer called under the lock
- 4096 now a constant
- performance testing code removed (I have to invent a way to test this all...)
Comment 51 Honza Bambas (:mayhemer) 2011-11-22 11:50:13 PST
Comment on attachment 576212 [details] [diff] [review]
v3.1

For reference, try run of this patch (completely green):
https://tbpl.mozilla.org/?tree=Try&rev=d1a50e389c5d
Comment 52 Rob Arnold [:robarnold] 2011-11-26 18:35:16 PST
Comment on attachment 576212 [details] [diff] [review]
v3.1

Review of attachment 576212 [details] [diff] [review]:
-----------------------------------------------------------------

This feels much closer

::: xpcom/ds/TimeStamp_windows.cpp
@@ +85,5 @@
> +// ----------------------------------------------------------------------------
> +
> +// After this time we always recalibrate the skew
> +// Value is number of [ms].
> +static const ULONGLONG kCalibrationInteval = 4000;

It would be nice to know why/how you picked the values for the various constants so that 4 years later we still know why and whether or not to readjust them (I can now say this from experience :)

@@ +135,5 @@
> +// Global lock
> +// ----------------------------------------------------------------------------
> +
> +// Thread spin count before entering the full wait state for sTimeStampLock.
> +// Inspired by Rob Arnold's work on PRMJ_Now().

I chased down the reason I picked 4096 and I would say that this was picked rather arbitrarily :)

@@ +289,5 @@
> +  GetSystemTimeAdjustment(&timeAdjustment,
> +                          &timeIncrement,
> +                          &timeAdjustmentDisabled);
> +
> +  if (!timeIncrement)

Do you mind initializing this to 0 before calling GetSystemTimeAdjustment? I just want to make sure it's not initialized to garbage.

@@ +302,5 @@
> +  timeIncrementRoundUp /= 10000;
> +  // Round up, values will be: 16, 21
> +  timeIncrementRoundUp += 1;
> +  // Convert back to 100ns, values will be: 160000, 210000
> +  timeIncrementRoundUp *= 10000;

Seems like this does a ceiling instead of rounding?

timeIncrement = 151000 -> 150999 -> 15 -> 16 -> 160000

@@ +330,5 @@
> +  // 10 total trials is arbitrary: what we're trying to avoid by
> +  // looping is getting unlucky and being interrupted by a context
> +  // switch or signal, or being bitten by paging/cache effects
> +
> +  ULONGLONG minres = ~0ULL;

I'm not sure what the - is doing here.

@@ +343,5 @@
> +  } while (--loops && minres);
> +
> +  if (0 == minres) {
> +    minres = 1;
> +  }

It still seems to me that this code will be erroneous if the calls to CalibratedPerformanceCounter are so quick that the counter never advances (say, GetTickCount). In this case, isn't it likely that the minres is rather far from 1, right?

@@ +372,5 @@
> +// of the common mutex.
> +static inline ULONGLONG
> +TickCount64(DWORD now)
> +{
> +  ULONGLONG lastResultHiPart = sLastGTCResult & (~0ULL << 32);

I feel like -0ULL is relying on undefined behavior on behalf of the compiler. I do not remember if overflow on unsigned arithmetic is well-defined in C or not (signed overflow is not). Either way, it would probably be clearer to write the constant that you mean.
Comment 53 Emanuel Hoogeveen [:ehoogeveen] 2011-11-26 19:22:16 PST
Not to butt in on the review here, but I think you're confusing the bitwise NOT (~) with the minus symbol (-) here :) Personally I've come to think that writing e.g. ~(size_t)0 is a bit clearer / more explicit in general, but ~0ULL is fairly common.
Comment 54 Rob Arnold [:robarnold] 2011-11-26 19:44:19 PST
I thought I read it carefully but I guess I didn't :) That makes much more sense than -0ULL and seems fairly readable.
Comment 55 Honza Bambas (:mayhemer) 2011-11-28 11:48:54 PST
Thank you!

(In reply to Rob Arnold [:robarnold] from comment #52)
> Comment on attachment 576212 [details] [diff] [review] [diff] [details] [review]
> v3.1
> 
> Review of attachment 576212 [details] [diff] [review] [diff] [details] [review]:
> It would be nice to know why/how you picked the values for the various
> constants so that 4 years later we still know why and whether or not to
> readjust them (I can now say this from experience :)

True is that most of them are just arbitrarily chosen numbers, mostly coming from results of all tests and looking into the logs.  I will try to come with either good comments and better way of determining the thresholds.  But telemetry probing may tell us more then thinking about the numbers now.

> Seems like this does a ceiling instead of rounding?
> 
> timeIncrement = 151000 -> 150999 -> 15 -> 16 -> 160000

As in the private email, yes, that is the correct name for this operation :)

> @@ +343,5 @@
> > +  } while (--loops && minres);
> > +
> > +  if (0 == minres) {
> > +    minres = 1;
> > +  }
> 
> It still seems to me that this code will be erroneous if the calls to
> CalibratedPerformanceCounter are so quick that the counter never advances
> (say, GetTickCount). In this case, isn't it likely that the minres is rather
> far from 1, right?

I'm not sure this code works at all.  With high probability on fallback to GetTickCount minres will be 0 (no advance during 10 fast calls to GTC).  Also it is quit expensive for QPC on XP.  

We could determine the result based on the tickInterval or performance frequency and not probe the timer directly since it may not jump when called repeatedly and if, then only by one tick or so.  Also it is not suggested to do it with QPC - to call it in a loop.


I'll update the patch tomorrow with your suggestion.  Thanks so far.
Comment 56 Jeff Muizelaar [:jrmuizel] 2011-11-28 12:34:17 PST
We should be able to use something like the following to detect situations where QPC is going to work reliably:

#include <intrin.h>

static bool
has_stable_tsc()
{
        int regs[4];

        // detect if the Advanced Power Management feature is supported
        __cpuid(regs, 0x80000000);
        if (regs[0] < 0x80000007)
                return false;

        __cpuid(regs, 0x80000007);
        // if bit 8 is set than TSC will run at a constant rate
        // in all ACPI P-state, C-states and T-states
        return regs[4] & (1 << 8);
}

It would be nice to use this to avoid the overhead of having to do both QPC and GTC for ever TimeStamp::Now()
Comment 57 Honza Bambas (:mayhemer) 2011-11-29 05:24:09 PST
(In reply to Jeff Muizelaar [:jrmuizel] from comment #56)
> We should be able to use something like the following to detect situations
> where QPC is going to work reliably:

How this prevents potential problems with dis-synchronicity between processor cores?
Comment 58 Jeff Muizelaar [:jrmuizel] 2011-11-29 07:39:46 PST
(In reply to Honza Bambas (:mayhemer) from comment #57)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #56)
> > We should be able to use something like the following to detect situations
> > where QPC is going to work reliably:
> 
> How this prevents potential problems with dis-synchronicity between
> processor cores?

I believe it is spec'ed to have all cores in sync, but I'll try to find a reference.
Comment 59 Honza Bambas (:mayhemer) 2011-11-29 13:56:02 PST
Created attachment 577745 [details] [diff] [review]
v3.2

Updates to this patch:
- added more comments on how the numbers had been chosen
- timeIncrement init to 0 first
- roundup -> ceiling 

Summary of things to do as followup bugs:
- remove static init
- change the way we determine resolution
- add telemetry
- investigate comment 56 suggestion

Rob, time of Q4 is closing, if we land this patch, we have 1 month to test it before letting it merge to Aurora.  It would be good to land this ASAP since there is no other way to see if it works then to let it be widely tested.
Comment 60 Rob Arnold [:robarnold] 2011-12-05 10:25:44 PST
Comment on attachment 577745 [details] [diff] [review]
v3.2

Review of attachment 577745 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, r+ w/nits fixed and followup bugs filed. Might be worth investing in telemetry to help tune the constants as well.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #58)
> (In reply to Honza Bambas (:mayhemer) from comment #57)
> > (In reply to Jeff Muizelaar [:jrmuizel] from comment #56)
> > > We should be able to use something like the following to detect situations
> > > where QPC is going to work reliably:
> > 
> > How this prevents potential problems with dis-synchronicity between
> > processor cores?
> 
> I believe it is spec'ed to have all cores in sync, but I'll try to find a
> reference.

Any update on this? Seems like it could simplify this code if we can detect a bad QPC source at initialization time.

::: xpcom/ds/TimeStamp_windows.cpp
@@ +96,5 @@
> +// reasonable test case).  So, there is a lot of space to prolong it
> +// to say 20 seconds or even more, needs testing in the field, though.
> +//
> +// Value is number of [ms].
> +static const ULONGLONG kCalibrationInteval = 4000;

Ooh, just noticed a missing 'r' here - this should be kCalibrationInterval.

@@ +112,5 @@
> +// Then, check the skew difference overflow is in this adjusted limit:
> +//   ABS((QPC_now - GTC_now) - (QPC_calib - GTC_calib)) - THRESHOLD < LIMIT
> +//
> +// Thresholds are calculated dynamically, see sUnderrunThreshold and
> +// sOverrunThreshold bellow.

s/bellow/below
Comment 61 Emanuel Hoogeveen [:ehoogeveen] 2011-12-05 13:02:22 PST
(In reply to Rob Arnold [:robarnold] from comment #60)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #58)
> > (In reply to Honza Bambas (:mayhemer) from comment #57)
> > > (In reply to Jeff Muizelaar [:jrmuizel] from comment #56)
> > > > We should be able to use something like the following to detect situations
> > > > where QPC is going to work reliably:
> > > 
> > > How this prevents potential problems with dis-synchronicity between
> > > processor cores?
> > 
> > I believe it is spec'ed to have all cores in sync, but I'll try to find a
> > reference.
> 
> Any update on this? Seems like it could simplify this code if we can detect
> a bad QPC source at initialization time.
In the Intel Processor Identification and the CPUID Instruction [1] document it states in section 5.2.6 on page 44 that "In the Core i7 and future processor generations, the TSC will continue to run in the deepest C-states. Therefore, the TSC will run at a constant rate in all ACPI P-, C-. and T-states. Support for this feature is indicated by CPUID.(EAX=80000007h):EDX[8]. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource." Which operating systems make use of the constant rate TSC where available, I don't know.

[1] http://www.intel.com/content/www/us/en/processors/processor-identification-cpuid-instruction-note.html
Comment 62 Honza Bambas (:mayhemer) 2011-12-05 13:13:01 PST
(In reply to Emanuel Hoogeveen from comment #61)
> Which
> operating systems make use of the constant rate TSC where available, I don't
> know.

On Windows 7 and Intel Core i7 I have just checked that the system is using RDTSC instruction.  There is some conditional code around, hard to read in asm, but it may fallback dynamically to something different.

Said again, I don't think it is a good idea to again try to invent a wheel here.  I believe the Windows implementation is mostly optimal and we would just copy the Microsoft's work.
Comment 63 Honza Bambas (:mayhemer) 2011-12-06 03:23:18 PST
Rob, thanks a lot for all the comments and reviews!  I reported all the followups needed.  Priority is the telemetry.

Landed as:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a59028a3510
Comment 64 Ed Morley [:emorley] 2011-12-06 18:37:45 PST
Backed out of inbound on suspicion of causing Win PGO builds to fail (the only other candidate remaining after philor manually triggered a load of PGO builds is bug 703024):
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=9a59028a3510
https://tbpl.mozilla.org/php/getParsedLog.php?id=7774288&tree=Mozilla-Inbound

https://hg.mozilla.org/integration/mozilla-inbound/rev/242ee41d7e5f

Will trigger a PGO build on inbound tip now, to see if it was indeed the cause.
Comment 65 Ed Morley [:emorley] 2011-12-07 01:22:35 PST
Backing this out fixed win PGO builds:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=242ee41d7e5f
Comment 67 Ed Morley [:emorley] 2011-12-09 06:53:49 PST
https://hg.mozilla.org/mozilla-central/rev/8165a0e15699
Comment 68 Rong 2012-11-13 09:24:33 PST
Hello, I notice two lines of code in standbyobserver.Observe, which is in TimeStamp_windows.cpp:

   298   sFallBackToGTC = false;
   299   sForceRecalibrate = true;

I'm using an intel atom N270 netbook. Before the standby, windbg shows sFallBackToGTC  is true, but after standby, it becomes false due to the above code, and then the bug 765215 appears. Then I change the value using windbg, and firefox come back to normal.

Is there any specific reason to put "sFallBackToGTC = false" after resume from standby? If we already set it to be true previously, it means QPC is not a good option, so why bother to re-check after each standby and cause more trouble?

Thanks a lot!
Comment 69 Jeff Muizelaar [:jrmuizel] 2012-11-15 22:58:44 PST
(In reply to Rong from comment #68)
> Hello, I notice two lines of code in standbyobserver.Observe, which is in
> TimeStamp_windows.cpp:
> 
>    298   sFallBackToGTC = false;
>    299   sForceRecalibrate = true;
> 
> I'm using an intel atom N270 netbook. Before the standby, windbg shows
> sFallBackToGTC  is true, but after standby, it becomes false due to the
> above code, and then the bug 765215 appears. Then I change the value using
> windbg, and firefox come back to normal.
> 
> Is there any specific reason to put "sFallBackToGTC = false" after resume
> from standby? If we already set it to be true previously, it means QPC is
> not a good option, so why bother to re-check after each standby and cause
> more trouble?

I'm not sure why we recheck QPC, but it would be good to understand what's going wrong here. What are the values of gtc and qpc before and after going to standby?
Comment 70 Rong 2012-11-19 09:21:46 PST
> I'm not sure why we recheck QPC, but it would be good to understand what's
> going wrong here. What are the values of gtc and qpc before and after going
> to standby?

I'm not able to give very accurate numbers, as the I've to break/continue the firefox in windbg before going to sleep.

0x13d0f4`c8c347f8 qpc before standby
0x928cb0`7e4189ea  result before standby
0xb504710`012e5e75 qpc after standby
0x4636af2`94814788 result after standby

After went through the code, I got more and more confused. Why we call PerformanceCounter() whenever we need a timestamp, even sFallBackToGTC is true? Isn't PerformanceCounter() quite costly, especially in XP?

Can we have a toggle key in about:config, or some start up parameter to disable this High res timer? I can see it is kind of immature, and should not be shipped without the option to disable it.
Comment 71 Honza Bambas (:mayhemer) 2012-11-20 05:52:20 PST
Sorry for later reply guys, I was occupied.

(In reply to Rong from comment #68)
> Hello, I notice two lines of code in standbyobserver.Observe, which is in
> TimeStamp_windows.cpp:
> 
>    298   sFallBackToGTC = false;
>    299   sForceRecalibrate = true;
> 
> I'm using an intel atom N270 netbook. Before the standby, windbg shows
> sFallBackToGTC  is true, but after standby, it becomes false due to the
> above code, and then the bug 765215 appears. 

This is interesting info.  Are you sure execution of this code is responsible for the deadlock?

> Then I change the value using
> windbg, and firefox come back to normal.

What exactly are you doing?  Is that like:
- attach windbg to Firefox that is in a deadlocked state
- then change the value of sFallBackToGTC to false
- Firefox suddenly runs again

Could you just try without the second step?

What is the main thread callstack?

> 
> Is there any specific reason to put "sFallBackToGTC = false" after resume
> from standby? 

Yes there is and it is written in the code.

On Firefox startup we take both QPC and GTC converted to milliseconds (or just same units) and save the difference.  We check different of QPC and GTC on every read of TimeStamp::Now() to detect that QPC jitters.  GTC is considered stable.

However, after wake up, QPC and GTC may jump each a different direction so the difference (skew) changes past our acceptable threshold.

It means that after waking up we may *false-negative* skew jump detection, hence we fallback to GTC.

Later we get information about being woken from sleep, it, however, arrives to the main thread after some time the wake-up has happened (may be even seconds).

Hence, we give QPC a second chance when we detect wake up.


> If we already set it to be true previously, it means QPC is
> not a good option, so why bother to re-check after each standby and cause
> more trouble?

Explained above.

We can check how long it is we've detected an unacceptable jitter in QPC and GTC different (skew) and when it was really caused by wake-up, do recalibration.  Otherwise not.

> 
> Thanks a lot!


(In reply to Rong from comment #70)
> After went through the code, I got more and more confused. Why we call
> PerformanceCounter() whenever we need a timestamp, even sFallBackToGTC is
> true? 

Good point, might be an optimization.

> Isn't PerformanceCounter() quite costly, especially in XP?

It is.


> 
> Can we have a toggle key in about:config, or some start up parameter to
> disable this High res timer? I can see it is kind of immature, and should
> not be shipped without the option to disable it.

Not a bad idea either.  I'll think of it and introduce a bug.
Comment 72 Rong 2012-11-20 12:50:47 PST
> >    298   sFallBackToGTC = false;
> >    299   sForceRecalibrate = true;
> This is interesting info.  Are you sure execution of this code is
> responsible for the deadlock?
> What exactly are you doing?  Is that like:
> - attach windbg to Firefox that is in a deadlocked state
> - then change the value of sFallBackToGTC to false
> - Firefox suddenly runs again
No, this is what I did:
1. attach windbg to firefox before go to sleep
2. set a break point at StandbyObserver::Observe
3. Go to sleep and wait for a few minutes
4. Wake up, and firefox hits the breakpoint
5. Use step-over to go just after the two line: sFallBackToGTC = false, sForceRecalibrate = true
6. Edit the memory directly using windbg, change sFallBackToGTC to true, and sForceRecalibrate to false
so effectively I removed the impact of this two lines.
7. continue firefox

Originally, there is a more than 80% chance that ff hangs after standby, but I've test this method for more than 6 times, no hang at all.
But something is STRANGE: when the laptop wake up at the 2nd time, the StandbyObserver::Observe was hit twice. I mean it jumps from 7 to 5 in above flow, so I have to change the values for twice.
   at the third wake up, I changed them 4 times
   at the forth wake up, I changed them 7 or 8 times(can't remember exactly, but must be 7 or 8).
Is there multiple instance of StandbyObserver going on?


> What is the main thread callstack?
The call stack is pretty normal, same as without going to sleep. 


> Yes there is and it is written in the code.
> 
> On Firefox startup we take both QPC and GTC converted to milliseconds (or
> just same units) and save the difference.  We check different of QPC and GTC
> on every read of TimeStamp::Now() to detect that QPC jitters.  GTC is
> considered stable.
If my understanding is correct, StandbyObserver::Observe is called at the very beginning after wake up, it is before any Now() is called. Right?

> However, after wake up, QPC and GTC may jump each a different direction so
> the difference (skew) changes past our acceptable threshold.
> It means that after waking up we may *false-negative* skew jump detection,
> hence we fallback to GTC.
Maybe we can do a direct calibration inside this Observe(), conditional on QPC is useful in the past(hense sFallBackToGTC is true in the past). As Now() is never called before Observe(), there will be no *false-negative* skew jump detection.

If we don't calibrate inside Observe(), QPC will be used for at least once, and sSkew will become some funny number, (see http://hg.mozilla.org/releases/mozilla-release/file/e0c8343d2809/xpcom/ds/TimeStamp_windows.cpp#l463 line 463) 
And after I search the whole file, I didn't find any check on skew value itself. What do you mean by "> the difference (skew) changes past our acceptable threshold."?

Can you help to build a version will the two line removed, and timestamp log enabled? It will help to further confirm the problem. Thanks a lot!
Comment 73 Rong 2012-11-20 14:10:32 PST
> But something is STRANGE: when the laptop wake up at the 2nd time, the
> StandbyObserver::Observe was hit twice. I mean it jumps from 7 to 5 in above
> flow, so I have to change the values for twice.
>    at the third wake up, I changed them 4 times
>    at the forth wake up, I changed them 7 or 8 times(can't remember exactly,
> but must be 7 or 8).
> Is there multiple instance of StandbyObserver going on?
> 
Sorry, this is my fault. I changed a wider range of memory than that needed(I used 'ew', but shoud be 'eb'). After change the memory correctly, this come back to normal.

I just found a even simpler way to stop the hanging:
- attach windbg to Firefox that is in a hanging state, break ff
- then change the value of sSkew from some very negative number(like -45979215208828000 type int64) to 0, continue ff
- Firefox back to normal
Comment 74 Honza Bambas (:mayhemer) 2012-11-21 03:55:07 PST
(In reply to Rong from comment #73)
> I just found a even simpler way to stop the hanging:
> - attach windbg to Firefox that is in a hanging state, break ff
> - then change the value of sSkew from some very negative number(like
> -45979215208828000 type int64) to 0, continue ff
> - Firefox back to normal

Thanks!  This is the right kind of information I needed.  I'll try to figure out what is wrong here and fix it.
Comment 75 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-12-21 13:35:51 PST
(In reply to Honza Bambas (:mayhemer) from comment #74)
> Thanks!  This is the right kind of information I needed.  I'll try to figure
> out what is wrong here and fix it.

Sounds like we should get a separate bug on file to track this? Wouldn't want to lose track of it in a FIXED bug.
Comment 76 Honza Bambas (:mayhemer) 2012-12-22 06:54:57 PST
(In reply to :Gavin Sharp (use gavin@gavinsharp.com for email) from comment #75)
> Sounds like we should get a separate bug on file to track this? Wouldn't
> want to lose track of it in a FIXED bug.

It's bug 765215.

Note You need to log in before you can comment on or make changes to this bug.