Closed Bug 784859 Opened 12 years ago Closed 12 years ago

Make the Windows TimeStamp implementation faster

Categories

(Core :: XPCOM, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla18

People

(Reporter: ehsan.akhgari, Assigned: ehsan.akhgari)

Details

Attachments

(6 files, 9 obsolete files)

2.32 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
4.81 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
7.46 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
2.46 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
5.15 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
2.77 KB, patch
bbondy
: review+
Details | Diff | Splinter Review
I and Jeff spent some time today to improve our Windows TimeStamp implementation.  Here is the major changes that we have implemented:

* Use QueryPerformanceCounter directly if the machine has a stable TSC
* Use GetTickCount64 if the platform supports it, and have a fallback for Windows XP which simulates it by calling GetTickCount.
* Avoid locking in almost all calls to TimeStamp::Now for machines without a stable TSC by doing atomic operations.

I'll attach the patches here, and I'll ask Honza for review since he wrote the original code.
Assignee: nobody → ehsan
Status: NEW → ASSIGNED
Attachment #654420 - Flags: review?(honzab.moz)
Attachment #654423 - Flags: review?(honzab.moz)
Attachment #654425 - Flags: review?(honzab.moz)
Comment on attachment 654421 [details] [diff] [review]
Part 2: Avoid locking to store the computed result in the global variable in CalibratedPerformanceCounter

Review of attachment 654421 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +517,5 @@
> +  do {
> +    readValue = InterlockedAdd64(destination, 0);
> +    if (readValue > newValue)
> +      return readValue;
> +  } while (readValue != InterlockedCompareExchange64(destination, newValue, readValue));

This will need something like:

+inline LONGLONG _cdecl 
+MozInterlockedCompareExchange64(LONGLONG volatile *Destination,
+                                LONGLONG Exchange,
+                                LONGLONG Comparand)
+{
+  __asm {
+    lea esi,[Comparand]
+    lea edi,[Exchange]
+    mov eax,[esi]
+    mov edx,[esi+4]
+    mov ebx,[edi]
+    mov ecx,[edi+4]
+    mov esi,[Destination]
+    lock cmpxchg8b [esi]
+  }
+}
+
+inline LONGLONG _cdecl 
+MozInterlockedGet64(LONGLONG volatile *Source)
+{
+  __asm {
+    xor eax,eax
+    xor edx,edx
+    xor ebx,ebx
+    xor ecx,ecx
+    mov esi,[Source]
+    lock cmpxchg8b [esi]
+  }
+}
+

to work on XP which doesn't have InterlockedCompareExchange64.

InterlockedAdd64 is only supported on Itanium
(In reply to Jeff Muizelaar [:jrmuizel] from comment #7)
> Comment on attachment 654421 [details] [diff] [review]
> Part 2: Avoid locking to store the computed result in the global variable in
> CalibratedPerformanceCounter
> 
> Review of attachment 654421 [details] [diff] [review]:
> -----------------------------------------------------------------
> +
> 
> to work on XP which doesn't have InterlockedCompareExchange64.
> 
> InterlockedAdd64 is only supported on Itanium

We might also be able to use __InterlockedCompareExchange64
This version of the patches actually compiles!  And it's improved much.
Attachment #654420 - Attachment is obsolete: true
Attachment #654421 - Attachment is obsolete: true
Attachment #654422 - Attachment is obsolete: true
Attachment #654423 - Attachment is obsolete: true
Attachment #654424 - Attachment is obsolete: true
Attachment #654425 - Attachment is obsolete: true
Attachment #654420 - Flags: review?(honzab.moz)
Attachment #654421 - Flags: review?(honzab.moz)
Attachment #654422 - Flags: review?(honzab.moz)
Attachment #654423 - Flags: review?(honzab.moz)
Attachment #654424 - Flags: review?(honzab.moz)
Attachment #654425 - Flags: review?(honzab.moz)
Attachment #654857 - Flags: review?(honzab.moz)
Would this also fix bug 603872?
(In reply to Trev from comment #15)
> Would this also fix bug 603872?

No. Date() does not use TimeStamp::Now()
Comment on attachment 654857 [details] [diff] [review]
Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC

Brian, could you please take a stab at reviewing these patches?  The code in question is fairly localized, and relatively easy to understand.  :-)
Attachment #654857 - Flags: review?(honzab.moz) → review?(netzen)
Attachment #654858 - Flags: review?(honzab.moz) → review?(netzen)
Attachment #654859 - Flags: review?(honzab.moz) → review?(netzen)
Attachment #654860 - Flags: review?(honzab.moz) → review?(netzen)
Attachment #654861 - Flags: review?(honzab.moz) → review?(netzen)
Attachment #654862 - Flags: review?(honzab.moz) → review?(netzen)
Comment on attachment 654857 [details] [diff] [review]
Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC

Review of attachment 654857 [details] [diff] [review]:
-----------------------------------------------------------------

As far as I understand QueryPerformanceCounter is a safe API that uses the time stamp counter register only when it's safe. 
Otherwise it provides a safe and fast alternative.  I don't understand why it would be gated on the same conditions as the TSC.
Using the TSC register directly would be faster than using QueryPerformanceCounter if it was safe.

See here: http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx

I suspect I'm missing some known information though.
(In reply to Brian R. Bondy [:bbondy] from comment #18)
> Comment on attachment 654857 [details] [diff] [review]
> Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC
> 
> Review of attachment 654857 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> As far as I understand QueryPerformanceCounter is a safe API that uses the
> time stamp counter register only when it's safe. 
> Otherwise it provides a safe and fast alternative.  I don't understand why
> it would be gated on the same conditions as the TSC.
> Using the TSC register directly would be faster than using
> QueryPerformanceCounter if it was safe.
> 
> See here: http://msdn.microsoft.com/en-us/library/ee417693%28VS.85%29.aspx
> 
> I suspect I'm missing some known information though.

The problem is that QueryPerformanceCounter isn't always safe. So we need to know when it's actually safe to use.
> The problem is that QueryPerformanceCounter isn't always safe. 
> So we need to know when it's actually safe to use.

In comment 18 I'm really asking for a resource that can explain exactly why it's not safe.  I suspect there is some documentation we're going off of, I just don't know where to find it.

The first patch seems to assume that QueryPerformanceCounter and TSC is the same thing, when it's not. 
And if we know TSC is safe, why use QueryPerformanceCounter at all?

The only remarks about QueryPerformanceCounter on its MSDN page about a problem with it is:
> On a multiprocessor computer, it should not matter which processor is called. 
> However, you can get different results on different processors due to bugs in the 
> basic input/output system (BIOS) or the hardware abstraction layer (HAL)

Which may not be an actual be a show stopper for using it.

The other page I mentioned above in Comment 18 states problems with TSC and recommends to use QueryPerformanceCounter instead.
The reason I'm asking for the reason QueryPerformanceCounter (QPC) is broken by the way, is because I'm being asked to review a patch that may not be using the correct heuristic for checking if QPC works or not.
Comment on attachment 654857 [details] [diff] [review]
Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC

Review of attachment 654857 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +38,5 @@
> +
> +  __cpuid(regs, 0x80000007);
> +  // if bit 8 is set than TSC will run at a constant rate
> +  // in all ACPI P-state, C-states and T-states
> +  return regs[4] & (1 << 8);

regs[4] is out of bounds!
Attachment #654857 - Flags: review?(netzen) → review-
Comment on attachment 654858 [details] [diff] [review]
Part 2: Avoid locking to store the computed result in the global variable in CalibratedPerformanceCounter

Review of attachment 654858 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +39,5 @@
> +
> +  __cpuid(cpuInfo.regs, 0);
> +  // Only allow Intel CPUs for now
> +  if (memcmp(cpuInfo.cpuString, "GenuineIntel", sizeof(cpuInfo.cpuString)))
> +    return false;

This needs to be a case insensitive compare, my CPU returns genuineintel for these 8 bytes.
Attachment #654858 - Flags: review?(netzen) → review-
*12 bytes
Comment on attachment 654859 [details] [diff] [review]
Part 3: Refactor TickCount64 to make its signature similar to GetTickCount64

Review of attachment 654859 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +397,5 @@
>  // @param gtc
>  // Result of GetTickCount().  Passing it as an arg lets us call it out
>  // of the common mutex.
>  static inline ULONGLONG
> +TickCount64()

While we're here, maybe we should just use GetTickCount64 when it's available? (Vista and later)
Comment on attachment 654860 [details] [diff] [review]
Part 4: Use the native GetTickCount64 function where available

Review of attachment 654860 [details] [diff] [review]:
-----------------------------------------------------------------

Awesome! Ignore last comment.
Attachment #654860 - Flags: review?(netzen) → review+
Attachment #654859 - Flags: review?(netzen) → review+
Comment on attachment 654861 [details] [diff] [review]
Part 5: Change the implementation of GetTickCount64Fallback so that it never locks

Review of attachment 654861 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +413,5 @@
> +        newValue = (oldTop + (1ULL<<32)) | newBottom;
> +    } else {
> +        newValue = oldTop | newBottom;
> +    }
> +  } while (old != _InterlockedCompareExchange64(reinterpret_cast<volatile __int64*> (sLastGTCResult),

Use &sLastGTCResult here not sLastGTCResult, otherwise it looks good. r=me with that.
Attachment #654861 - Flags: review?(netzen)
Attachment #654862 - Flags: review?(netzen) → review+
(In reply to Brian R. Bondy [:bbondy] from comment #22)
> Comment on attachment 654857 [details] [diff] [review]
> Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC
> 
> Review of attachment 654857 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +38,5 @@
> > +
> > +  __cpuid(regs, 0x80000007);
> > +  // if bit 8 is set than TSC will run at a constant rate
> > +  // in all ACPI P-state, C-states and T-states
> > +  return regs[4] & (1 << 8);
> 
> regs[4] is out of bounds!

Oops :D
Attachment #654857 - Attachment is obsolete: true
Attachment #659358 - Flags: review?(netzen)
(In reply to Brian R. Bondy [:bbondy] from comment #23)
> Comment on attachment 654858 [details] [diff] [review]
> Part 2: Avoid locking to store the computed result in the global variable in
> CalibratedPerformanceCounter
> 
> Review of attachment 654858 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +39,5 @@
> > +
> > +  __cpuid(cpuInfo.regs, 0);
> > +  // Only allow Intel CPUs for now
> > +  if (memcmp(cpuInfo.cpuString, "GenuineIntel", sizeof(cpuInfo.cpuString)))
> > +    return false;
> 
> This needs to be a case insensitive compare, my CPU returns genuineintel for
> these 8 bytes.

Fixed.  Also, I had the ordering of registers wrong, and I fixed that as well.
Attachment #654858 - Attachment is obsolete: true
Attachment #659359 - Flags: review?(netzen)
Attachment #659358 - Flags: review?(netzen) → review+
Comment on attachment 659359 [details] [diff] [review]
Part 2: Avoid locking to store the computed result in the global variable in CalibratedPerformanceCounter

Review of attachment 659359 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +41,5 @@
> +  // Only allow Intel CPUs for now
> +  // The order of the registers is reg[1], reg[3], reg[2].  We just adjust the
> +  // string so that we can compare in one go.
> +  if (_strnicmp(cpuInfo.cpuString, "GenuntelineI", sizeof(cpuInfo.cpuString)))
> +    return false;

Lookeat!s Gr

:)
Attachment #659359 - Flags: review?(netzen) → review+
Attachment #659360 - Flags: review+
It would've been nice to get actual performance numbers for this before checking it in.
Comment on attachment 659358 [details] [diff] [review]
Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC

Review of attachment 659358 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +33,5 @@
> +
> +  // detect if the Advanced Power Management feature is supported
> +  __cpuid(regs, 0x80000000);
> +  if (regs[0] < 0x80000007)
> +          return false;

This sure is indented strangely.
Comment on attachment 659359 [details] [diff] [review]
Part 2: Avoid locking to store the computed result in the global variable in CalibratedPerformanceCounter

(In reply to :Ms2ger from comment #36)
> Comment on attachment 659358 [details] [diff] [review]
> Part 1: Use QueryPerformanceCounter directly if the machine has a stable TSC
> 
> Review of attachment 659358 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: xpcom/ds/TimeStamp_windows.cpp
> @@ +33,5 @@
> > +
> > +  // detect if the Advanced Power Management feature is supported
> > +  __cpuid(regs, 0x80000000);
> > +  if (regs[0] < 0x80000007)
> > +          return false;
> 
> This sure is indented strangely.


>   // detect if the Advanced Power Management feature is supported
>   __cpuid(regs, 0x80000000);
>   if (regs[0] < 0x80000007)
>-          return false;
>+    return false;


:-)
Comment on attachment 659359 [details] [diff] [review]
Part 2: Avoid locking to store the computed result in the global variable in CalibratedPerformanceCounter

Review of attachment 659359 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +598,2 @@
>  
> +  return AtomicStoreIfGreaterThan(&sLastResult, result);

For original QPC integration and calibration I had a code that was only using Interlock* functions.  It was horribly slow, since Interlock is slow the same way as critical section entering.

Now you have critical section AND interlock, it is imo slower now.

I know lock is later removed at all, but in cost of adding just more Interlocking*...
Comment on attachment 659360 [details] [diff] [review]
Part 5: Change the implementation of GetTickCount64Fallback so that it never locks

Review of attachment 659360 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +434,2 @@
>  
> +  return newValue;

I had a similar (maybe better ;)) code that was using Interlocks* but that was much much slower then when I had just a single lock.  So I abandoned this approach.

So, IMO this is regressing performance and may also bring some gtc/qpc distance issues that are critical to checking QPC jitter...
Comment on attachment 654862 [details] [diff] [review]
Part 6: Remove the need for locking in all calls to TimeStamp::Now

Review of attachment 654862 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/ds/TimeStamp_windows.cpp
@@ +199,5 @@
> +    bool fallBackToGTC;
> +    bool forceRecalibrate;
> +  } flags;
> +  uint32_t dwordValue;
> +} sCalibrationFlags;

Why then there aren't just two uint32_t global vars rather?

@@ +507,1 @@
>        RecordFlaw();

RecordFlaw does sCalibrationFlags.flags.fallBackToGTC = true.  That can be turned to 32 bit write that may be freely concurent since we only switch to |true|.

Hence I think you don't need lock here.

@@ +519,2 @@
>  
> +    sCalibrationFlags.flags.forceRecalibrate = false;

Why are you actually locking here?

@@ +574,1 @@
>    LONGLONG diff = qpc - ms2mt(gtc) - sSkew;

Read access to 64bit sSkew is atomic?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: