Closed Bug 1397052 Opened 7 years ago Closed 7 years ago

stylo: Long hang in GCAtomTableLocked

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla57

Tracking Flags:

Tracking

Status

firefox57

---

fixed

People

(Reporter: xidorn, Assigned: bholley)

References

Details

(Keywords: hang, Whiteboard: [stylo])

Attachments

(3 files)

patch for test 7 years ago Xidorn Quan [:xidorn] UTC+10 2.57 KB, patch		Details \| Diff \| Splinter Review
Don't update \|result\| after the acquire fence in nsThreadSafeRefCnt::operator--. v1 7 years ago Bobby Holley (:bholley) 1.52 KB, patch	froydnj : review+	Details \| Diff \| Splinter Review
Bug 1397052 - Add gtest for concurrent accessing. 7 years ago Xidorn Quan [:xidorn] UTC+10 59 bytes, text/x-review-board-request	froydnj : review+	Details

Xidorn Quan [:xidorn] UTC+10

Reporter

Description

•

7 years ago

I've been experiencing some significant long hangs on content process recently, and I successfully got a profile once, which shows that we spend several seconds inside GCAtomTableLocked during some Rust code releasing atom. Stylo has significantly increased the use of atoms (for avoiding string conversion cost between Rust and C++), so it is expected that the atom table would have far more atoms, as well as more short-live atoms. This may contribute to a much longer atom table GC than before. Unfortunately I lost the profile I got (because after I pressed the Share button and copy the url, I closed the tab, but the url doesn't seem to work at all). I will try catching another one when I saw this kind of hanging.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 1

•

7 years ago

Also note that I'm on Beta 56 with Stylo experiment turned on. It doesn't seem to me this is something we've changed between Beta and Trunk, though.

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Priority: -- → P2

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 2

•

7 years ago

Eventually I got another profile which shows a much shorter hang than the previous one: https://perfht.ml/2xMQg2d GCAtomTableLocked isn't triggered from Rust code in this profile, but I still suspect that the wide usage of atoms may be one cause of this issue. It is weird that we continuously run into GCAtomTableLocked. Do we really release so many atoms sometimes? If so, we probably want to trigger an atom table GC only at some particular time point (e.g. after a GC) rather than having it triggered randomly when we release an atom. If we should not be releasing so many atoms, we would need to investigate why this happens.

Summary: stylo: Long hang in GCAtomTableLocked → Long hang in GCAtomTableLocked

Whiteboard: [stylo]

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 3

•

7 years ago

And it seems this is really an issue, and it happens majority with Servo code in stack: https://squarewave.github.io/?durationSpec=2048_65536&payloadID=131360179ef146ab85e01052d16ebe89&search=GCAtomTable&thread=0

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 4

•

7 years ago

https://squarewave.github.io/?callTreeFilters=prefix-0&durationSpec=2048_65536&payloadID=131360179ef146ab85e01052d16ebe89&search=GCAtomTable&thread=0

Bobby Holley (:bholley)

Assignee

Updated

•

7 years ago

Summary: Long hang in GCAtomTableLocked → stylo: Long hang in GCAtomTableLocked

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 5

•

7 years ago

It doesn't seem to me that we should be releasing that many atoms, and the profile in comment 2 shows that we are overwhelmingly just doing GCAtomTable, in almost all the time, and there is neither anything which seems to be releasing tons of atoms, nor much actual atom destruction shows up, so it feels more likely that we are messing up gUnusedAtomCount somehow. Since it is an unsigned integer, there is a chance that we can mess it up by integer overflow into negative value as intermediate state in some cases, but it is unlikely to happen when we have been in the main thread :/

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Depends on: 1397130

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 6

•

7 years ago

And yeah, I think bug 1388632 could be related. Given that we are less parallel in Nightly, I would probably expect this to happen less as well... We can see if the hang reports keep being low in coming days.

Bobby Holley (:bholley)

Assignee

Comment 7

•

7 years ago

(In reply to Xidorn Quan [:xidorn] UTC+10 from comment #6) > And yeah, I think bug 1388632 could be related. Given that we are less > parallel in Nightly, I would probably expect this to happen less as well... > We can see if the hang reports keep being low in coming days. Hm, "doing less parallelism means we hit the race condition less" doesn't make me sleep much easier. I really how we can figure out why gUnusedAtomCount is getting messed up...

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 8

•

7 years ago

I successfully attached MSVC onto the content process when it hanged (because it just hangs for too long this time again and I have enough time to attach). So as expected, this is because of integer overflow. gUnusedAtomCount shows 4294967295, which is basically -1 if the variable is signed, when DynamicAtom::Release call into GCAtomTable. (It is on beta, so before the merge of StaticAtom and DynamicAtom.)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 9

•

7 years ago

And mysteriously, there is no other thread which is operating atom at all at that moment. So gUnusedAtomCount keeps being -1 after we execute ++ on it in DynamicAtom::Release means there have been three times before this happens that it accumulated incorrect decrement.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 10

•

7 years ago

Attached patch patch for test — Details — Splinter Review

And yes, the impl of gUnusedAtomCount is not thread-safe. This is a test patch modified from the one I wrote for debugging bug 1386158. It would addref/release a single atom lots of times concurrently, and print out value of gUnusedAtomCount after all threads finish. Local runs give me results like 4294967290 and 4294967295, which clearly shows the issue. And apparently, this triggers bug 1389826.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 11

•

7 years ago

So the following code is not balanced: > if (++gRefCnt == 1) { > --gCount; > } > if (--gRefCnt == 0) { > ++gCount; > } with gRefCnt being ThreadSafeAutoRefCnt and gCount being Atomic<uint32_t, ReleaseAcquire>. It produces non-zero result on gCount if there is concurrency access, but gRefCnt is correctly reset to zero at the end, so we are probably safe from leaking / use-after-free.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 12

•

7 years ago

So, the problem here is actually in ThreadSafeAutoRefCnt. Its operator-- looks like: [1] > nsrefcnt result = mValue.fetch_sub(1, std::memory_order_release) - 1; > if (result == 0) { > result = mValue.load(std::memory_order_acquire); > } > return result; This is troublesome for atom, because the reference count value can change between the fetch_sub() and the load(), which means that --gRefCnt is not guaranteed to return 0 when it is actually releasing the last reference. I'm not familiar with atomic operation, so it's not clear to me why it needs to be written this way. But I guess we can fix this via using Atomic<nsrefcnt, Relaxed> instead of ThreadSafeAutoRefCnt for mRefCnt of Atom. froydnj, what do you think? Also this behavior of ThreadSafeAutoRefCnt could be confusing. I'm not sure whether there is any other objects doing anything similar to atom table that some cache holds objects unconditionally. If there is similar pattern elsewhere, we should probably audit usage there as well. [1] https://searchfox.org/mozilla-central/rev/4d8e389498a08668cce9ebf6232cc96be178c3e4/xpcom/base/nsISupportsImpl.h#348-363

Blocks: 1277709

Flags: needinfo?(nfroyd)

Bobby Holley (:bholley)

Assignee

Comment 13

•

7 years ago

Ahah! Thanks for tracking that down. The current Release code assumes that refcounted objects will never be brought back from the dead, which breaks for our Atom setup. We still need the barrier, but it seems to me that we should shouldn't assign the value of the load into |result|. This matches what Rust's std::Arc does (it just uses a fence, and doesn't do a load). Nathan, does this sound right to you?

Bobby Holley (:bholley)

Assignee

Comment 14

•

7 years ago

Attached patch Don't update |result| after the acquire fence in nsThreadSafeRefCnt::operator--. v1 — Details — Splinter Review

MozReview-Commit-ID: 1Yp4kFXA5gV

Attachment #8905154 - Flags: review?(nfroyd)

Bobby Holley (:bholley)

Assignee

Updated

•

7 years ago

Assignee: nobody → bobbyholley

Bobby Holley (:bholley)

Assignee

Comment 15

•

7 years ago

xidorn, can you verify that this patch fixes your testcase?

Flags: needinfo?(nfroyd) → needinfo?(xidorn+moz)

Bobby Holley (:bholley)

Assignee

Updated

•

7 years ago

Blocks: 1388632

Nathan Froyd [:froydnj]

Comment 16

•

7 years ago

Comment on attachment 8905154 [details] [diff] [review] Don't update |result| after the acquire fence in nsThreadSafeRefCnt::operator--. v1 Review of attachment 8905154 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/base/nsISupportsImpl.h @@ +362,5 @@ > + // that the consumer uses a GC model and allows re-acquiring strong > + // references for objects whose reference count dropped to zero (e.g. > + // the Atom table). We leave |result| with its original value from > + // the fetch_sub so that consumers can use it to balance counters. > + mValue.load(std::memory_order_acquire); Thanks for the explanation here. This is OK, but if you really wanted to micro-optimize, you'd use std::atomic_thread_fence(std::memory_order_acquire); instead, and avoid the load entirely.

Attachment #8905154 - Flags: review?(nfroyd) → review+

Bobby Holley (:bholley)

Assignee

Comment 17

•

7 years ago

(In reply to Nathan Froyd [:froydnj] from comment #16) > Comment on attachment 8905154 [details] [diff] [review] > Don't update |result| after the acquire fence in > nsThreadSafeRefCnt::operator--. v1 > > Review of attachment 8905154 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: xpcom/base/nsISupportsImpl.h > @@ +362,5 @@ > > + // that the consumer uses a GC model and allows re-acquiring strong > > + // references for objects whose reference count dropped to zero (e.g. > > + // the Atom table). We leave |result| with its original value from > > + // the fetch_sub so that consumers can use it to balance counters. > > + mValue.load(std::memory_order_acquire); > > Thanks for the explanation here. This is OK, but if you really wanted to > micro-optimize, you'd use > > std::atomic_thread_fence(std::memory_order_acquire); > > instead, and avoid the load entirely. Good point. I guess that also lets me get rid of the comment.

Pulsebot

Comment 18

•

7 years ago

Pushed by bholley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/80e9cd4126b4 Don't update |result| after the acquire fence in nsThreadSafeRefCnt::operator--. r=froydnj

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 20

•

7 years ago

This added a hazard to the build: https://treeherder.mozilla.org/logviewer.html#?job_id=129006338&repo=autoland Backed out in https://hg.mozilla.org/integration/autoland/rev/6169c1d97aab3ef1fcad843619916040fe698e96

Flags: needinfo?(bobbyholley)

Bobby Holley (:bholley)

Assignee

Comment 21

•

7 years ago

Steve, any idea why this 1-line change added a hazard? Should I just bump the threshold?

Flags: needinfo?(bobbyholley) → needinfo?(sphink)

Bobby Holley (:bholley)

Assignee

Comment 22

•

7 years ago

Oh I see. Just the external function usage.

Flags: needinfo?(sphink)

Bobby Holley (:bholley)

Assignee

Comment 23

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=07f3cfe898eb4d9acd666c1392e77b73906d25b7

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 24

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #15) > xidorn, can you verify that this patch fixes your testcase? I've tested before that as far as we don't assign the result again, everything works fine, so I believe it would fix that.

Flags: needinfo?(xidorn+moz)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 25

•

7 years ago

(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23) > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=07f3cfe898eb4d9acd666c1392e77b73906d25b7 You should add __atomic_thread_fence I guess.

Comment hidden (mozreview-request)

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 27

•

7 years ago

attachment 8905340 [details] is two tests checking that concurrent access works as expected. It currently fails (unfortunately, intermittently), and it is expected to pass with the patch from bholley applies.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 28

•

7 years ago

try push for the gtest: https://treeherder.mozilla.org/#/jobs?repo=try&revision=34197ed4d315809cd031cead84992c28cb289263

Bobby Holley (:bholley)

Assignee

Comment 29

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=5606c01ae0704263a4a1cccaaaf07395ef0f3529

Pulsebot

Comment 30

•

7 years ago

Pushed by bholley@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d65d668da45f Don't update |result| after the acquire fence in nsThreadSafeRefCnt::operator--. r=froydnj

Bobby Holley (:bholley)

Assignee

Comment 31

•

7 years ago

NI xidorn to make sure the gtest passes with the patch in this bug and land it.

Flags: needinfo?(xidorn+moz)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 32

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/d65d668da45f

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox57: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla57

Nathan Froyd [:froydnj]

Comment 33

•

7 years ago

mozreview-review

Comment on attachment 8905340 [details] Bug 1397052 - Add gtest for concurrent accessing. https://reviewboard.mozilla.org/r/177122/#review182372 Thank you! ::: xpcom/ds/nsAtomTable.cpp:813 (Diff revision 1) > +uint32_t > +NS_GetUnusedAtomCount(void) Man, I wish there was a way to limit this to gtest-only code.

Attachment #8905340 - Flags: review?(nfroyd) → review+

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 34

•

7 years ago

mozreview-review-reply

Comment on attachment 8905340 [details] Bug 1397052 - Add gtest for concurrent accessing. https://reviewboard.mozilla.org/r/177122/#review182372 > Man, I wish there was a way to limit this to gtest-only code. Yeah. The `NS_GetNumberOfAtoms` above is also used only in gtest.

Xidorn Quan [:xidorn] UTC+10

Reporter

Comment 35

•

7 years ago

All pass now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa35220e65a9a12173bc375789a129b898afd33d

Flags: needinfo?(xidorn+moz)

Pulsebot

Comment 36

•

7 years ago

Pushed by xquan@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/44fb9477f42f Add gtest for concurrent accessing. r=froydnj

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 37

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/44fb9477f42f

Xidorn Quan [:xidorn] UTC+10

Reporter

Updated

•

7 years ago

Blocks: 1389826

You need to log in before you can comment on or make changes to this bug.