threadsafe refcounting uses stronger memory synchronization than is needed

RESOLVED FIXED in Firefox 55

Status

()

Core
XPCOM
RESOLVED FIXED
2 years ago
10 months ago

People

(Reporter: dbaron, Assigned: dbaron)

Tracking

(Blocks: 1 bug, {perf})

Trunk
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox49 affected, firefox55 fixed)

Details

Attachments

(1 attachment)

(Assignee)

Description

2 years ago
The threadsafe reference counting provided in XPCOM uses stronger memory synchronization than is needed.  In particular, it uses read-modify-write (RMW) operations with sequentially consistent memory synchronization for both incrementing and decrementing the reference count.  A comment indicates that we could switch to release/acquire semantics, but we should actually be able to go even further than that.

As explained in http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/usage_examples.html#boost_atomic.usage_examples.example_reference_counters (and I thought I had another link, but I can't find it), we should be able to use relaxed semantics when incrementing the reference count, release semantics when decrementing it to nonzero, and then acquire semantics when decrementing to zero.

This shouldn't make a difference on Intel processors, but it should make a difference on ARM.

We don't have a whole lot of performance benchmarks that run on ARM, but I pushed a patch to try to get some numbers, and did some retriggers of the autophone-s1s2 test, which does performance measurements of load time (throbber start to throbber end, I think) of cached copies of twitter, the new york times, and a blank page.  My current results are at:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=864cdd00360c&newProject=try&newRevision=f8126d991e02&framework=3&showOnlyImportant=0
which seem to be showing a 2.5%-4% performance improvement on those benchmarks on armv8 CPUs (although no significant change so far on ARMv7).

The patch I pushed, which seems to compile on try on platforms other than Mac, is:
https://hg.mozilla.org/users/dbaron_mozilla.com/patches/raw-file/6917fd1a20d0/threadsafe-reference-counting-min-synchronization
It would probably need a bit more work in order to be landable.

The other question is whether we consider this worth doing for that amount of performance gain; it has some risk that other code may be depending on synchronization that this patch will remove.
(Assignee)

Updated

2 years ago
Blocks: 1277711
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) (vacation June 4-12) from comment #0)
> As explained in
> http://www.chaoticmind.net/~hcb/projects/boost.atomic/doc/atomic/
> usage_examples.html#boost_atomic.usage_examples.example_reference_counters
> (and I thought I had another link, but I can't find it), we should be able
> to use relaxed semantics when incrementing the reference count, release
> semantics when decrementing it to nonzero, and then acquire semantics when
> decrementing to zero.

One concern about this is that it relies on RMW operations always being used to provide the necessary synchronization, and we don't always use RMW operations, e.g.:

http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.h#346

I don't know how widely used those member functions are, though; maybe it's just a matter of deleting them.
(Assignee)

Comment 2

2 years ago
In the patch, I made those be acquire and release operations, though I agree I'd want to investigate the uses further before landing.
NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.

No need for a response - I would just CC, but I'm guessing that might get lost in the PTO backlog.
Flags: needinfo?(n.nethercote)
(In reply to Bobby Holley (busy) from comment #3)
> NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.

How many devices actually have armv8 CPUs, though?  That benchmark only showed a win for those chips.

If the same patch showed a win on x86 windows, that's a much different story.
(In reply to Nathan Froyd [:froydnj] from comment #4)
> (In reply to Bobby Holley (busy) from comment #3)
> > NI njn to get it on the LLTT radar - 2.5%-4% pageload win is pretty big.
> 
> How many devices actually have armv8 CPUs, though?  That benchmark only
> showed a win for those chips.
> 
> If the same patch showed a win on x86 windows, that's a much different story.

Good point - I don't know.
Though even if they're low market share, we do get disproportionate benefits from being fast on high-end devices, since that's often what the press and other influencers use.
Thanks for letting me know about this.
Flags: needinfo?(n.nethercote)
(In reply to Bobby Holley (busy) from comment #6)
> Though even if they're low market share, we do get disproportionate benefits
> from being fast on high-end devices, since that's often what the press and
> other influencers use.

This is a good point.
(Assignee)

Comment 9

a year ago
So since I last worked on this, a Mac OS X compiler upgrade means that it now compiles on all of our platforms:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f69d7878b2175c3de06a2fab791ba72f42e835f8&group_state=expanded
so I'm inclined to move forward on landing it.
(Assignee)

Comment 10

a year ago
(The last substantive changes to this code appear to have been in bug 884061.)
(Assignee)

Comment 11

a year ago
Created attachment 8853775 [details] [diff] [review]
Make threadsafe reference counting use the minimum memory sychronization needed

This uses std::atomic rather than mozilla::Atomic since mozilla::Atomic
does not support using different memory synchronization for different
atomic operations on the same variable.

The added comments could use careful review since, while they reflect my
understanding of the issue, I don't consider myself an expert on the
topic.

MozReview-Commit-ID: 7xByCXt17Dr
Attachment #8853775 - Flags: review?(nfroyd)
Comment on attachment 8853775 [details] [diff] [review]
Make threadsafe reference counting use the minimum memory sychronization needed

Review of attachment 8853775 [details] [diff] [review]:
-----------------------------------------------------------------

r=me; the comments below are just me thinking out loud as I reviewed secondary sources for help.

Can you file a followup bug to apply this optimization to nsStringBuffer, either manually repeating the code here, or making it use ThreadSafeAutoRefCnt (preferred)?  That should result in about as much of a win as this patch.

Also, does this count as multi-threaded code for the purposes of the sign above your desk? :)

::: xpcom/base/nsISupportsImpl.h
@@ +330,5 @@
> +      // We're going to destroy the object on this thread, so we need
> +      // acquire semantics to synchronize with the memory released by
> +      // the last release on other threads, that is, to ensure that
> +      // writes prior to that release are now visible on this thread.
> +      result = mValue.load(std::memory_order_acquire);

Boost suggests that you can get by with the equivalent of:

  atomic_thread_fence(std::memory_order_acquire);

here.  But I'm not sure it makes much difference in practice, given that your load is likely to hit in cache anyway.  Keeping fences out of the picture is probably good practice anyway.

Herb Sutter's "atomic<> Weapons" presentation suggests that you could do:

  nsrefcnt result = mValue.fetch_sub(1, std::memory_order_acq_rel);

which would be slightly less efficient, since you need the acquire ordering on every operation, rather than just the one that will result in the destruction of the object.
Attachment #8853775 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 13

a year ago
(In reply to Nathan Froyd [:froydnj] from comment #12)
> Can you file a followup bug to apply this optimization to nsStringBuffer,
> either manually repeating the code here, or making it use
> ThreadSafeAutoRefCnt (preferred)?  That should result in about as much of a
> win as this patch.

I filed bug 1353181.  (It seemed faster to just repeat the code than to refactor ThreadSafeAutoRefCnt to be templatized over the underlying reference count type.  I was also a little worried about how to measure codesize since we stopped running our automated tests of it.)
(Assignee)

Comment 14

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/a52e75fdda073ca38e2a88b91ad7070c4138d702
Bug 1277709 - Make threadsafe reference counting use the minimum memory sychronization needed.  r=froydnj
backed this out in https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2ba7fac730edb476a51a732011872e95d34d82c8&filter-searchStr=OS+X+10.10+debug+XPCShell+(X) for failing xpcshell tests like https://treeherder.mozilla.org/logviewer.html#?job_id=88513551&repo=mozilla-inbound&lineNumber=7989 even after backout of the other patches.

to keep the tree clear backing out this changes
Flags: needinfo?(dbaron)
(Assignee)

Updated

a year ago
Flags: needinfo?(dbaron)
(Assignee)

Comment 17

a year ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/24f889bd554967c3008fd325c36903b5651188af
Bug 1277709 - Make threadsafe reference counting use the minimum memory sychronization needed.  r=froydnj

Comment 18

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/24f889bd5549
Status: NEW → RESOLVED
Last Resolved: a year ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Depends on: 1397052
You need to log in before you can comment on or make changes to this bug.