Closed Bug 1277709 Opened 4 years ago Closed 3 years ago

threadsafe refcounting uses stronger memory synchronization than is needed

Categories

(Core :: XPCOM, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox49 --- affected
firefox55 --- fixed

People

(Reporter: dbaron, Assigned: dbaron)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(1 file)

The threadsafe reference counting provided in XPCOM uses stronger memory synchronization than is needed.  In particular, it uses read-modify-write (RMW) operations with sequentially consistent memory synchronization for both incrementing and decrementing the reference count.  A comment indicates that we could switch to release/acquire semantics, but we should actually be able to go even further than that.

As explained in http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters (and I thought I had another link, but I can't find it), we should be able to use relaxed semantics when incrementing the reference count, release semantics when decrementing it to nonzero, and then acquire semantics when decrementing to zero.

This shouldn't make a difference on Intel processors, but it should make a difference on ARM.

We don't have a whole lot of performance benchmarks that run on ARM, but I pushed a patch to try to get some numbers, and did some retriggers of the autophone-s1s2 test, which does performance measurements of load time (throbber start to throbber end, I think) of cached copies of twitter, the new york times, and a blank page.  My current results are at:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=864cdd00360c&newProject=try&newRevision=f8126d991e02&framework=3&showOnlyImportant=0
which seem to be showing a 2.5%-4% performance improvement on those benchmarks on armv8 CPUs (although no significant change so far on ARMv7).

The patch I pushed, which seems to compile on try on platforms other than Mac, is:
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/6917fd1a20d0/threadsafe-reference-counting-min-synchronization
It would probably need a bit more work in order to be landable.

The other question is whether we consider this worth doing for that amount of performance gain; it has some risk that other code may be depending on synchronization that this patch will remove.
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) (vacation June 4-12) from comment #0)
> As explained in
> http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/
> usage_examples.html#boost_atomic.usage_examples.example_reference_counters
> (and I thought I had another link, but I can't find it), we should be able
> to use relaxed semantics when incrementing the reference count, release
> semantics when decrementing it to nonzero, and then acquire semantics when
> decrementing to zero.

One concern about this is that it relies on RMW operations always being used to provide the necessary synchronization, and we don't always use RMW operations, e.g.:

http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.h#346

I don't know how widely used those member functions are, though; maybe it's just a matter of deleting them.
In the patch, I made those be acquire and release operations, though I agree I'd want to investigate the uses further before landing.
NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.

No need for a response - I would just CC, but I'm guessing that might get lost in the PTO backlog.
Flags: needinfo?(n.nethercote)
(In reply to Bobby Holley (busy) from comment #3)
> NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.

How many devices actually have armv8 CPUs, though?  That benchmark only showed a win for those chips.

If the same patch showed a win on x86 windows, that's a much different story.
(In reply to Nathan Froyd [:froydnj] from comment #4)
> (In reply to Bobby Holley (busy) from comment #3)
> > NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.
> 
> How many devices actually have armv8 CPUs, though?  That benchmark only
> showed a win for those chips.
> 
> If the same patch showed a win on x86 windows, that's a much different story.

Good point - I don't know.
Though even if they're low market share, we do get disproportionate benefits from being fast on high-end devices, since that's often what the press and other influencers use.
Thanks for letting me know about this.
Flags: needinfo?(n.nethercote)
(In reply to Bobby Holley (busy) from comment #6)
> Though even if they're low market share, we do get disproportionate benefits
> from being fast on high-end devices, since that's often what the press and
> other influencers use.

This is a good point.
So since I last worked on this, a Mac OS X compiler upgrade means that it now compiles on all of our platforms:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f69d7878b2175c3de06a2fab791ba72f42e835f8&group_state=expanded
so I'm inclined to move forward on landing it.
(The last substantive changes to this code appear to have been in bug 884061.)
This uses std::atomic rather than mozilla::Atomic since mozilla::Atomic
does not support using different memory synchronization for different
atomic operations on the same variable.

The added comments could use careful review since, while they reflect my
understanding of the issue, I don't consider myself an expert on the
topic.

MozReview-Commit-ID: 7xByCXt17Dr
Attachment #8853775 - Flags: review?(nfroyd)
Comment on attachment 8853775 [details] [diff] [review]
Make threadsafe reference counting use the minimum memory sychronization needed

Review of attachment 8853775 [details] [diff] [review]:
-----------------------------------------------------------------

r=me; the comments below are just me thinking out loud as I reviewed secondary sources for help.

Can you file a followup bug to apply this optimization to nsStringBuffer, either manually repeating the code here, or making it use ThreadSafeAutoRefCnt (preferred)?  That should result in about as much of a win as this patch.

Also, does this count as multi-threaded code for the purposes of the sign above your desk? :)

::: xpcom/base/nsISupportsImpl.h
@@ +330,5 @@
> +      // We're going to destroy the object on this thread, so we need
> +      // acquire semantics to synchronize with the memory released by
> +      // the last release on other threads, that is, to ensure that
> +      // writes prior to that release are now visible on this thread.
> +      result = mValue.load(std::memory_order_acquire);

Boost suggests that you can get by with the equivalent of:

  atomic_thread_fence(std::memory_order_acquire);

here.  But I'm not sure it makes much difference in practice, given that your load is likely to hit in cache anyway.  Keeping fences out of the picture is probably good practice anyway.

Herb Sutter's "atomic<> Weapons" presentation suggests that you could do:

  nsrefcnt result = mValue.fetch_sub(1, std::memory_order_acq_rel);

which would be slightly less efficient, since you need the acquire ordering on every operation, rather than just the one that will result in the destruction of the object.
Attachment #8853775 - Flags: review?(nfroyd) → review+
(In reply to Nathan Froyd [:froydnj] from comment #12)
> Can you file a followup bug to apply this optimization to nsStringBuffer,
> either manually repeating the code here, or making it use
> ThreadSafeAutoRefCnt (preferred)?  That should result in about as much of a
> win as this patch.

I filed bug 1353181.  (It seemed faster to just repeat the code than to refactor ThreadSafeAutoRefCnt to be templatized over the underlying reference count type.  I was also a little worried about how to measure codesize since we stopped running our automated tests of it.)
https://hg.mozilla.org/integration/mozilla-inbound/rev/a52e75fdda073ca38e2a88b91ad7070c4138d702
Bug 1277709 - Make threadsafe reference counting use the minimum memory sychronization needed.  r=froydnj
Flags: needinfo?(dbaron)
https://hg.mozilla.org/integration/mozilla-inbound/rev/24f889bd554967c3008fd325c36903b5651188af
Bug 1277709 - Make threadsafe reference counting use the minimum memory sychronization needed.  r=froydnj
https://hg.mozilla.org/mozilla-central/rev/24f889bd5549
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1397052
You need to log in before you can comment on or make changes to this bug.