Closed Bug 1457882 Opened 7 years ago Closed 7 years ago

Mutex performance is poor on OSX when heavily contended

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla61

Tracking Flags:

Tracking

Status

firefox61

---

fixed

People

(Reporter: jonco, Assigned: jonco)

Details

Attachments

(2 files, 1 obsolete file)

benchmark 7 years ago Jon Coppeard (:jonco) 2.97 KB, patch		Details \| Diff \| Splinter Review
bug1457882-macos-adaptive-lock 7 years ago Jon Coppeard (:jonco) 6.39 KB, patch		Details \| Diff \| Splinter Review
bug1457882-macos-adaptive-lock v2 7 years ago Jon Coppeard (:jonco) 7.15 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review

Jon Coppeard (:jonco)

Assignee

Description

•

7 years ago

As mentioned in bug 1353787, I noticed that mutex performance under contention is really bad on OSX. The benchmark I'm using is a synthetic JS parsing test which results in multiple threads spending most of their time trying to acquire a single mutex. Results from a MBP with 8 hyperthreads (first column is number of threads): 1 0.892 0.852 0.844 0.858 0.851 0.856 0.855 0.847 0.849 0.854 mean: 0.856 cofv: 1.49% 2 3.33 3.38 3.35 3.33 3.22 3.20 3.23 3.25 3.21 3.22 mean: 3.27 cofv: 1.93% 3 3.71 3.69 3.71 3.67 3.63 3.68 3.64 3.61 3.68 3.65 mean: 3.67 cofv: 0.916% 4 4.10 4.10 4.08 4.05 4.07 4.05 4.05 4.04 4.05 4.07 mean: 4.06 cofv: 0.473% 5 4.11 4.11 4.08 4.07 4.05 4.04 4.06 4.03 4.04 4.04 mean: 4.06 cofv: 0.629% 6 4.12 4.11 4.09 4.07 4.05 4.03 4.05 4.05 4.06 4.06 mean: 4.07 cofv: 0.679% 7 4.11 4.13 4.09 4.07 4.06 4.08 4.07 4.03 4.04 4.06 mean: 4.07 cofv: 0.704% 8 4.09 4.09 4.06 4.08 4.05 4.05 4.04 4.05 4.03 4.05 mean: 4.06 cofv: 0.515% 9 4.16 4.16 4.12 4.11 4.10 4.09 4.08 4.11 4.11 4.09 mean: 4.11 cofv: 0.606% 10 4.12 4.09 4.05 4.07 4.06 4.02 4.04 4.04 4.02 4.06 mean: 4.06 cofv: 0.700% For comparison, running the same benchmark on a Linux machine (also 8 hyperthreads): 1 1.05 1.09 0.994 0.906 1.02 1.01 0.905 0.909 0.933 1.03 mean: 0.984 cofv: 6.41% 2 0.738 0.858 0.841 0.815 0.802 0.819 0.790 0.777 0.674 0.805 mean: 0.792 cofv: 6.36% 3 0.745 0.794 0.769 0.762 0.767 0.769 0.655 0.722 0.783 0.762 mean: 0.753 cofv: 4.98% 4 0.743 0.764 0.730 0.733 0.732 0.775 0.734 0.754 0.755 0.735 mean: 0.746 cofv: 1.99% 5 0.783 0.790 0.774 0.777 0.782 0.766 0.751 0.783 0.773 0.769 mean: 0.775 cofv: 1.36% 6 0.803 0.816 0.790 0.785 0.796 0.781 0.772 0.781 0.802 0.783 mean: 0.791 cofv: 1.59% 7 0.823 0.834 0.804 0.817 0.805 0.811 0.796 0.815 0.806 0.818 mean: 0.813 cofv: 1.27% 8 0.849 0.865 0.849 0.854 0.858 0.849 0.845 0.855 0.863 0.862 mean: 0.855 cofv: 0.768% 9 0.894 0.908 0.875 0.885 0.889 0.882 0.873 0.883 0.878 0.860 mean: 0.883 cofv: 1.39% 10 0.913 0.934 0.909 0.909 0.921 0.915 0.914 0.917 0.928 0.919 mean: 0.918 cofv: 0.829% On OSX the time taken to run the benchmark increases by as much as a factor 4.8 when adding more threads, whereas adding threads on Linux always decreases the runtime, at worst by only 7%.

Jon Coppeard (:jonco)

Assignee

Comment 1

•

7 years ago

Attached patch benchmark — Details — Splinter Review

Benchmark, can be run with js/src/benchmark.sh.

Jon Coppeard (:jonco)

Assignee

Comment 2

•

7 years ago

Attached patch bug1457882-macos-adaptive-lock (obsolete) — Details — Splinter Review

I saw that on Linux (and FreeBSD) we use the PTHREAD_MUTEX_ADAPTIVE_NP option to use an adpative mutex. We can try and emulate that feature on OSX ourselves. There's a description of the rationale behind this and how it works here: https://stackoverflow.com/a/25168942 It's implemented here in glibc inside __pthread_mutex_lock: https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_mutex_lock.c;hb=HEAD#l118 This patch is a pretty straightforward reimplementation of the same algorithm using the same constants. Benchmark results with this change: 1 0.851 0.850 0.844 0.848 0.834 0.831 0.835 0.838 0.851 0.842 mean: 0.842 cofv: 0.851% 2 0.600 0.578 0.580 0.569 0.565 0.574 0.567 0.576 0.568 0.567 mean: 0.574 cofv: 1.71% 3 0.525 0.496 0.487 0.489 0.484 0.487 0.487 0.486 0.487 0.491 mean: 0.492 cofv: 2.33% 4 0.536 0.541 0.527 0.536 0.512 0.532 0.529 0.534 0.530 0.535 mean: 0.531 cofv: 1.41% 5 0.672 0.623 0.604 0.604 0.638 0.636 0.601 0.628 0.619 0.596 mean: 0.622 cofv: 3.51% 6 1.71 1.67 1.82 1.71 1.74 1.59 1.50 1.58 1.56 1.59 mean: 1.65 cofv: 5.66% 7 3.40 3.62 3.56 3.35 3.63 3.50 3.36 3.40 3.51 3.45 mean: 3.48 cofv: 2.86% 8 4.89 4.86 4.74 4.81 4.79 4.59 4.73 4.63 4.70 4.65 mean: 4.74 cofv: 1.97% 9 5.09 5.08 5.04 5.03 5.02 5.03 5.04 5.05 5.06 5.05 mean: 5.05 cofv: 0.399% 10 5.14 5.12 5.12 5.14 5.11 5.11 5.09 5.13 5.11 5.12 mean: 5.12 cofv: 0.302% Performance is much better for up to 5 threads. As the number of threads approaches the number of cores/hyperthreads performance degrades to be a little worse than previously, but this is not the common case. Talos testing suggests this might be win on tp6_facebook (which does use parallel JS parsing) but with only medium confidence. Try is otherwise green. https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&newProject=try&newRevision=c492de7b26be14397f2fba9f2f821721f783fcd3&framework=1&filter=tp6&showOnlyComparable=1&selectedTimeRange=172800 This feels like quite a drastic thing to do, but it does look like it could be a decent performance win.

Nathan Froyd [:froydnj]

Comment 3

•

7 years ago

Comment on attachment 8972272 [details] [diff] [review] bug1457882-macos-adaptive-lock Review of attachment 8972272 [details] [diff] [review]: ----------------------------------------------------------------- Kind of frightening, but if it improves OS X lock performance, I think that's a good thing. Some comments below. ::: mozglue/misc/Mutex_posix.cpp @@ +31,5 @@ > } > > +#ifdef XP_DARWIN > + > +static uint32_t sCPUCount = 0; This should probably be Atomic<uint32_t, Relaxed>, since we'll be touching this from multiple threads, and the Relaxed-ness of this should be documented. @@ +138,5 @@ > { > +#ifndef XP_DARWIN > + mutexLock(); > +#else > + // Emulate glibc's PTHREAD_MUTEX_ADAPTIVE_NP on OSX. This comment is helpful for explaining what's going on below, but it doesn't explain why we're doing this. Can you add some explanatory text for that? @@ +140,5 @@ > + mutexLock(); > +#else > + // Emulate glibc's PTHREAD_MUTEX_ADAPTIVE_NP on OSX. > + > + if (sCPUCount == 0) { Is this condition ever going to be true, since sCPUCount is initialized in the constructor to something non-zero? @@ +155,5 @@ > + if (count >= maxSpins) { > + mutexLock(); > + break; > + } > + asm("rep; nop"); // Hint that we're spinning. I expected to see asm("pause") here, but looking at the intel manuals suggests that the encoding of `pause` is identical to `rep; nop`. I think it'd be a little clearer to use `pause`...but then I guess that could be read as a `sleep()` call or similar? WDYT? ::: mozglue/misc/PlatformMutex.h @@ +55,5 @@ > static_assert(sizeof(pthread_mutex_t) / sizeof(void*) != 0 && > sizeof(pthread_mutex_t) % sizeof(void*) == 0, > "pthread_mutex_t must have pointer alignment"); > +#ifdef XP_DARWIN > + mozilla::Atomic<int32_t, mozilla::MemoryOrdering::Relaxed> averageSpins; Relaxed atomic variables should be documented with some sort of rationale as to why Relaxed-ness is OK.

Nathan Froyd [:froydnj]

Updated

•

7 years ago

Summary: Mutex performance is poor on OSX when heavily contented → Mutex performance is poor on OSX when heavily contended

Jon Coppeard (:jonco)

Assignee

Comment 4

•

7 years ago

Attached patch bug1457882-macos-adaptive-lock v2 — Details — Splinter Review

Thanks for the comments. Here's an updated patch. > > + if (sCPUCount == 0) { > > Is this condition ever going to be true Nope, I meant to compare with 1. Cheers!

Attachment #8972272 - Attachment is obsolete: true

Attachment #8972367 - Flags: review?(nfroyd)

Nathan Froyd [:froydnj]

Comment 5

•

7 years ago

Comment on attachment 8972367 [details] [diff] [review] bug1457882-macos-adaptive-lock v2 Review of attachment 8972367 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!

Attachment #8972367 - Flags: review?(nfroyd) → review+

Pulsebot

Comment 6

•

7 years ago

Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/ef231f04c4ec Emulate glibc adaptive mutexes on OSX r=nfroyd

Raul Gurzau (:RaulG)

Comment 7

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ef231f04c4ec

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox61: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla61

Ionuț Goldan [:igoldan]

Comment 8

•

7 years ago

Noticed these perf improvements on OS X! \o/ == Change summary for alert #13046 (as of Fri, 04 May 2018 04:54:14 GMT) == Improvements: 14% tp6_facebook osx-10-10 opt e10s stylo 366.46 -> 316.62 2% tp6_google osx-10-10 opt e10s stylo 862.10 -> 843.79 1% tp6_amazon osx-10-10 opt e10s stylo 626.10 -> 618.79 For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=13046

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Mutex performance is poor on OSX when heavily contended

Categories

(Core :: mozglue, defect)

Tracking

()

People

(Reporter: jonco, Assigned: jonco)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Attachment

General

Description

File Name

Content Type