Closed Bug 1457882 Opened 6 years ago Closed 6 years ago

Mutex performance is poor on OSX when heavily contended

Categories

(Core :: mozglue, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox61 --- fixed

People

(Reporter: jonco, Assigned: jonco)

Details

Attachments

(2 files, 1 obsolete file)

As mentioned in bug 1353787, I noticed that mutex performance under contention is really bad on OSX.  The benchmark I'm using is a synthetic JS parsing test which results in multiple threads spending most of their time trying to acquire a single mutex.

Results from a MBP with 8 hyperthreads (first column is number of threads):

    1 0.892 0.852 0.844 0.858 0.851 0.856 0.855 0.847 0.849 0.854 mean: 0.856 cofv: 1.49%
    2 3.33 3.38 3.35 3.33 3.22 3.20 3.23 3.25 3.21 3.22 mean: 3.27 cofv: 1.93%
    3 3.71 3.69 3.71 3.67 3.63 3.68 3.64 3.61 3.68 3.65 mean: 3.67 cofv: 0.916%
    4 4.10 4.10 4.08 4.05 4.07 4.05 4.05 4.04 4.05 4.07 mean: 4.06 cofv: 0.473%
    5 4.11 4.11 4.08 4.07 4.05 4.04 4.06 4.03 4.04 4.04 mean: 4.06 cofv: 0.629%
    6 4.12 4.11 4.09 4.07 4.05 4.03 4.05 4.05 4.06 4.06 mean: 4.07 cofv: 0.679%
    7 4.11 4.13 4.09 4.07 4.06 4.08 4.07 4.03 4.04 4.06 mean: 4.07 cofv: 0.704%
    8 4.09 4.09 4.06 4.08 4.05 4.05 4.04 4.05 4.03 4.05 mean: 4.06 cofv: 0.515%
    9 4.16 4.16 4.12 4.11 4.10 4.09 4.08 4.11 4.11 4.09 mean: 4.11 cofv: 0.606%
    10 4.12 4.09 4.05 4.07 4.06 4.02 4.04 4.04 4.02 4.06 mean: 4.06 cofv: 0.700%

For comparison, running the same benchmark on a Linux machine (also 8 hyperthreads):

    1 1.05 1.09 0.994 0.906 1.02 1.01 0.905 0.909 0.933 1.03 mean: 0.984 cofv: 6.41%
    2 0.738 0.858 0.841 0.815 0.802 0.819 0.790 0.777 0.674 0.805 mean: 0.792 cofv: 6.36%
    3 0.745 0.794 0.769 0.762 0.767 0.769 0.655 0.722 0.783 0.762 mean: 0.753 cofv: 4.98%
    4 0.743 0.764 0.730 0.733 0.732 0.775 0.734 0.754 0.755 0.735 mean: 0.746 cofv: 1.99%
    5 0.783 0.790 0.774 0.777 0.782 0.766 0.751 0.783 0.773 0.769 mean: 0.775 cofv: 1.36%
    6 0.803 0.816 0.790 0.785 0.796 0.781 0.772 0.781 0.802 0.783 mean: 0.791 cofv: 1.59%
    7 0.823 0.834 0.804 0.817 0.805 0.811 0.796 0.815 0.806 0.818 mean: 0.813 cofv: 1.27%
    8 0.849 0.865 0.849 0.854 0.858 0.849 0.845 0.855 0.863 0.862 mean: 0.855 cofv: 0.768%
    9 0.894 0.908 0.875 0.885 0.889 0.882 0.873 0.883 0.878 0.860 mean: 0.883 cofv: 1.39%
    10 0.913 0.934 0.909 0.909 0.921 0.915 0.914 0.917 0.928 0.919 mean: 0.918 cofv: 0.829%

On OSX the time taken to run the benchmark increases by as much as a factor 4.8 when adding more threads, whereas adding threads on Linux always decreases the runtime, at worst by only 7%.
Attached patch benchmarkSplinter Review
Benchmark, can be run with js/src/benchmark.sh.
Attached patch bug1457882-macos-adaptive-lock (obsolete) — Splinter Review
I saw that on Linux (and FreeBSD) we use the PTHREAD_MUTEX_ADAPTIVE_NP option to use an adpative mutex.  We can try and emulate that feature on OSX ourselves.

There's a description of the rationale behind this and how it works here: https://stackoverflow.com/a/25168942

It's implemented here in glibc inside __pthread_mutex_lock: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;hb=HEAD#l118

This patch is a pretty straightforward reimplementation of the same algorithm using the same constants.

Benchmark results with this change:

    1 0.851 0.850 0.844 0.848 0.834 0.831 0.835 0.838 0.851 0.842 mean: 0.842 cofv: 0.851%
    2 0.600 0.578 0.580 0.569 0.565 0.574 0.567 0.576 0.568 0.567 mean: 0.574 cofv: 1.71%
    3 0.525 0.496 0.487 0.489 0.484 0.487 0.487 0.486 0.487 0.491 mean: 0.492 cofv: 2.33%
    4 0.536 0.541 0.527 0.536 0.512 0.532 0.529 0.534 0.530 0.535 mean: 0.531 cofv: 1.41%
    5 0.672 0.623 0.604 0.604 0.638 0.636 0.601 0.628 0.619 0.596 mean: 0.622 cofv: 3.51%
    6 1.71 1.67 1.82 1.71 1.74 1.59 1.50 1.58 1.56 1.59 mean: 1.65 cofv: 5.66%
    7 3.40 3.62 3.56 3.35 3.63 3.50 3.36 3.40 3.51 3.45 mean: 3.48 cofv: 2.86%
    8 4.89 4.86 4.74 4.81 4.79 4.59 4.73 4.63 4.70 4.65 mean: 4.74 cofv: 1.97%
    9 5.09 5.08 5.04 5.03 5.02 5.03 5.04 5.05 5.06 5.05 mean: 5.05 cofv: 0.399%
    10 5.14 5.12 5.12 5.14 5.11 5.11 5.09 5.13 5.11 5.12 mean: 5.12 cofv: 0.302%

Performance is much better for up to 5 threads.  As the number of threads approaches the number of cores/hyperthreads performance degrades to be a little worse than previously, but this is not the common case.

Talos testing suggests this might be win on tp6_facebook (which does use parallel JS parsing) but with only medium confidence.  Try is otherwise green.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&newProject=try&newRevision=c492de7b26be14397f2fba9f2f821721f783fcd3&framework=1&filter=tp6&showOnlyComparable=1&selectedTimeRange=172800

This feels like quite a drastic thing to do, but it does look like it could be a decent performance win.
Comment on attachment 8972272 [details] [diff] [review]
bug1457882-macos-adaptive-lock

Review of attachment 8972272 [details] [diff] [review]:
-----------------------------------------------------------------

Kind of frightening, but if it improves OS X lock performance, I think that's a good thing.  Some comments below.

::: mozglue/misc/Mutex_posix.cpp
@@ +31,5 @@
>    }
>  
> +#ifdef XP_DARWIN
> +
> +static uint32_t sCPUCount = 0;

This should probably be Atomic<uint32_t, Relaxed>, since we'll be touching this from multiple threads, and the Relaxed-ness of this should be documented.

@@ +138,5 @@
>  {
> +#ifndef XP_DARWIN
> +  mutexLock();
> +#else
> +  // Emulate glibc's PTHREAD_MUTEX_ADAPTIVE_NP on OSX.

This comment is helpful for explaining what's going on below, but it doesn't explain why we're doing this.  Can you add some explanatory text for that?

@@ +140,5 @@
> +  mutexLock();
> +#else
> +  // Emulate glibc's PTHREAD_MUTEX_ADAPTIVE_NP on OSX.
> +
> +  if (sCPUCount == 0) {

Is this condition ever going to be true, since sCPUCount is initialized in the constructor to something non-zero?

@@ +155,5 @@
> +      if (count >= maxSpins) {
> +        mutexLock();
> +        break;
> +      }
> +      asm("rep; nop"); // Hint that we're spinning.

I expected to see asm("pause") here, but looking at the intel manuals suggests that the encoding of `pause` is identical to `rep; nop`.  I think it'd be a little clearer to use `pause`...but then I guess that could be read as a `sleep()` call or similar?  WDYT?

::: mozglue/misc/PlatformMutex.h
@@ +55,5 @@
>    static_assert(sizeof(pthread_mutex_t) / sizeof(void*) != 0 &&
>                  sizeof(pthread_mutex_t) % sizeof(void*) == 0,
>                  "pthread_mutex_t must have pointer alignment");
> +#ifdef XP_DARWIN
> +  mozilla::Atomic<int32_t, mozilla::MemoryOrdering::Relaxed> averageSpins;

Relaxed atomic variables should be documented with some sort of rationale as to why Relaxed-ness is OK.
Summary: Mutex performance is poor on OSX when heavily contented → Mutex performance is poor on OSX when heavily contended
Thanks for the comments.  Here's an updated patch.

> > +  if (sCPUCount == 0) {
> 
> Is this condition ever going to be true

Nope, I meant to compare with 1.  Cheers!
Attachment #8972272 - Attachment is obsolete: true
Attachment #8972367 - Flags: review?(nfroyd)
Comment on attachment 8972367 [details] [diff] [review]
bug1457882-macos-adaptive-lock v2

Review of attachment 8972367 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8972367 - Flags: review?(nfroyd) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ef231f04c4ec
Emulate glibc adaptive mutexes on OSX r=nfroyd
https://hg.mozilla.org/mozilla-central/rev/ef231f04c4ec
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Noticed these perf improvements on OS X! \o/

== Change summary for alert #13046 (as of Fri, 04 May 2018 04:54:14 GMT) ==

Improvements:

 14%  tp6_facebook osx-10-10 opt e10s stylo     366.46 -> 316.62
  2%  tp6_google osx-10-10 opt e10s stylo       862.10 -> 843.79
  1%  tp6_amazon osx-10-10 opt e10s stylo       626.10 -> 618.79

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=13046
You need to log in before you can comment on or make changes to this bug.