Closed Bug 844769 Opened 13 years ago Closed 9 years ago

TSan: Thread data race in AutoTraceSession::AutoTraceSession(JSRuntime *rt, js::HeapState heapState) vs. JSRuntime::isHeapBusy()

Categories

(Core :: JavaScript: GC, defect)

Product:

Component:

Platform:

x86_64

Linux

Type:

defect

Priority:

Not set

Severity:

critical

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla51

Tracking Flags:

Tracking

Status

firefox51

---

fixed

People

(Reporter: posidron, Assigned: terrence)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tsan][tsan-test-blocker])

Attachments

(6 files, 3 obsolete files)

trace 13 years ago Christoph Diehl [:posidron] 15.46 KB, text/plain		Details
Shell trace 9 years ago Jan de Mooij [:jandem] 4.91 KB, text/plain		Details
use_perfect_forwarding_for_rekey-v0.diff 9 years ago Terrence Cole [:terrence] 6.95 KB, patch	jonco : feedback+	Details \| Diff \| Splinter Review
do_not_fire_read_barriers_omt-v0.diff 9 years ago Terrence Cole [:terrence] 19.21 KB, patch	jonco : review+	Details \| Diff \| Splinter Review
compartments-before.txt 9 years ago Terrence Cole [:terrence] 10.45 KB, text/plain		Details
compartments-after.txt 9 years ago Terrence Cole [:terrence] 11.28 KB, text/plain		Details
use_perfect_forwarding_for_rekey-v1.diff 9 years ago Terrence Cole [:terrence] 7.77 KB, patch		Details \| Diff \| Splinter Review
bug844769-forward-for-rekey v2 9 years ago Jon Coppeard (:jonco) 8.78 KB, patch	sfink : review+	Details \| Diff \| Splinter Review
bug844769-alternate 9 years ago Jon Coppeard (:jonco) 3.87 KB, patch	terrence : review+	Details \| Diff \| Splinter Review

Christoph Diehl [:posidron]

Reporter

Description

•

13 years ago

Attached file trace (obsolete) — Details

During Firefox start-up with ThreadSanitizer (LLVM version), we get a data race reported as described in the attached log. Trace was created on mozilla-central with changeset 122820:c233837cce08. According to the TSan devs, most of the reported traces should be real data races, even though they can be "benign". We need to determine if the race can/should be fixed, or put on the ignore list. Even for benign races, TSan devs suggest to fix them (second priority), as they can also cause problems [1]. [1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong

Jan de Mooij [:jandem]

Comment 1

•

9 years ago

This one still shows up on jit-tests. I'll attach a new trace.

Component: General → JavaScript: GC

Jan de Mooij [:jandem]

Comment 2

•

9 years ago

Attached file Shell trace — Details

The background parse thread is here: bool isHeapBusy() const { return heapState_ != JS::HeapState::Idle; } And the main thread writes to heapState_ in AutoTraceSession::AutoTraceSession.

Attachment #717812 - Attachment is obsolete: true

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Assignee

Comment 3

•

9 years ago

According to the GC analysis, the one path from defaultNewGroup to isHeapBusy is via: #739477 = js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #758165 = js::ObjectGroup* js::ObjectGroupCompartment::makeGroup(js::ExclusiveContext*, js::Class*, class JS::Handle<js::TaggedProto>, uint32) #753457 = js::ObjectGroup* js::Allocate(js::ExclusiveContext*) [with T = js::ObjectGroup; js::AllowGC allowGC = (js::AllowGC)1u] #752868 = uint8 js::gc::GCRuntime::checkAllocatorState(JSContext*, int32) [with js::AllowGC allowGC = (js::AllowGC)1u] #5293 = uint8 JS::shadow::Runtime::isHeapBusy() const

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Assignee

Comment 4

•

9 years ago

Note that that should not actually be possible: the background thread uses an ExclusiveContext, which should cause us to skip the checkAllocatorState [1]. This is enforced by the type system, so I'm not sure what's going on here. Maybe a debug build would give us a cleaner stack? 1- https://dxr.mozilla.org/mozilla-central/source/js/src/gc/Allocator.cpp#135-139

Jan de Mooij [:jandem]

Comment 5

•

9 years ago

(In reply to Terrence Cole [:terrence] from comment #4) > Maybe a debug build would give us a cleaner stack? Good point. It's in TenuredCell::readBarrier (did the analysis miss this path?): #0 JS::shadow::Runtime::isHeapBusy() const jspubtd.h:164 (js+0x000100d43722) #1 js::InternalBarrierMethods<js::ObjectGroup*>::readBarrier(js::ObjectGroup*) Heap.h:1279 (js+0x000100dd5e08) #2 js::ReadBarrieredBase<js::ObjectGroup*>::read() const Barrier.h:553 (js+0x000100dd5dbe) #3 js::ReadBarriered<js::ObjectGroup*>::get() const Barrier.h:608 (js+0x000100dd5d10) #4 js::ReadBarriered<js::ObjectGroup*>::operator js::ObjectGroup* const() const Barrier.h:620 (js+0x000100daeaf8) #5 js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class const*, js::TaggedProto, JSObject*) ObjectGroup.cpp:495 (js+0x000100dade09) #6 js::NewObjectWithClassProtoCommon(js::ExclusiveContext*, js::Class const*, JS::Handle<JSObject*>, js::gc::AllocKind, js::NewObjectKind) jsobj.cpp:795 (js+0x000100a4642e) #7 js::NewObjectWithClassProto(js::ExclusiveContext*, js::Class const*, JS::Handle<JSObject*>, js::gc::AllocKind, js::NewObjectKind) jsobjinlines.h:703 (js+0x000100017b52)

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Assignee

Comment 6

•

9 years ago

Does the traverse.py analysis only give us one path from A to B?

Flags: needinfo?(terrence) → needinfo?(sphink)

Terrence Cole [:terrence]

Assignee

Comment 7

•

9 years ago

I think the best way to address this is to finally make the on-main-thread assertion we've wanted to have forever. I think if we add a custom ReadBarriered::get that takes the ExclusiveContext and does the get() || unbarrieredGet() internally, this shouldn't be so cumbersome. We do need to go out of line for this, so hopefully it won't be a slowdown. If it is, we can always open code it.

Assignee: nobody → terrence

Terrence Cole [:terrence]

Assignee

Comment 8

•

9 years ago

Note that this is not a sec issue: we never fire the barrier from OMT so our correctness does not depend at all on the value we're reading.

Keywords: sec-want

Steve Fink [:sfink] [:s:]

Comment 9

•

9 years ago

(In reply to Terrence Cole [:terrence] from comment #6) > Does the traverse.py analysis only give us one path from A to B? If that's what you ask for. You can also use "routes from <X> to <Y>" and it'll give you multiple routes, making some attempt to make the routes look different. In this case, it still only displayed that one route, which means it is all it could find. As for *why* it didn't see any other paths, it looks like the callgraph.txt I had lying around is too old. Sorry about that. But if I update, I still don't get the above path, which seems like it's a bug in my multiple route processing, since if I trace it through one link at a time it now has the full path. (It isn't terribly useful to print out *all* paths, since for most cases there are an exponential number of distinct ones. So some cleverness is required. Apparently, more than I have implemented at the moment.) But I can fake it. (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup Path from #1125956 to #3050: avoiding [1180999] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1099478 = _ZN2js11ObjectGroup12setTypeDescrEPNS_9TypeDescrE$void js::ObjectGroup::setTypeDescr(js::TypeDescr*) #1099465 = _ZN2js11ObjectGroup11setAddendumENS0_12AddendumKindEPvb$void js::ObjectGroup::setAddendum(uint32, void*, uint8) #1186658 = _ZN2js34PreliminaryObjectArrayWithTemplate15writeBarrierPreEPS0_$void js::PreliminaryObjectArrayWithTemplate::writeBarrierPre(js::PreliminaryObjectArrayWithTemplate*) #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr Path from #1125956 to #3050: avoiding [1180999, 1099478] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1186716 = _ZN2js14MallocProviderINS_16ExclusiveContextEE4new_INS_22ObjectGroupCompartment8NewTableEIPN2JS4ZoneEEEEPT_DpOT0_$T* js::MallocProvider<Client>::new_(Args&& ...) [with T = js::ObjectGroupCompartment::NewTable; Args = {JS::Zone*}; Client = js::ExclusiveContext] #1116372 = _ZN2js14MallocProviderINS_16ExclusiveContextEE10pod_mallocIhEEPT_m$T* js::MallocProvider<Client>::pod_malloc(size_t) [with T = unsigned char; Client = js::ExclusiveContext; size_t = long unsigned int] #1101809 = _ZN2js16ExclusiveContext13onOutOfMemoryENS_13AllocFunctionEmPv$void* js::ExclusiveContext::onOutOfMemory(int32, uint64, void*) #1101811 = _ZN9JSRuntime13onOutOfMemoryEN2js13AllocFunctionEmPvP9JSContext$void* JSRuntime::onOutOfMemory(int32, uint64, void*, JSContext*) #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1186728 = _ZN2js13TypeNewScript4makeEP9JSContextPNS_11ObjectGroupEP10JSFunction$uint8 js::TypeNewScript::make(JSContext*, js::ObjectGroup*, JSFunction*) #1099464 = _ZN2js11ObjectGroup12setNewScriptEPNS_13TypeNewScriptE$void js::ObjectGroup::setNewScript(js::TypeNewScript*) #1099465 = _ZN2js11ObjectGroup11setAddendumENS0_12AddendumKindEPvb$void js::ObjectGroup::setAddendum(uint32, void*, uint8) #1186658 = _ZN2js34PreliminaryObjectArrayWithTemplate15writeBarrierPreEPS0_$void js::PreliminaryObjectArrayWithTemplate::writeBarrierPre(js::PreliminaryObjectArrayWithTemplate*) #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728 Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1186723 = _ZNK2js13ReadBarrieredIPNS_11ObjectGroupEEcvKS2_Ev$js::ReadBarriered<T>::operator const T() const [with T = js::ObjectGroup*] #1187231 = _ZNK2js13ReadBarrieredIPNS_11ObjectGroupEE3getEv$const T js::ReadBarriered<T>::get() const [with T = js::ObjectGroup*] #1187517 = _ZNK2js17ReadBarrieredBaseIPNS_11ObjectGroupEE4readEv$void js::ReadBarrieredBase<T>::read() const [with T = js::ObjectGroup*] #1187683 = _ZN2js22InternalBarrierMethodsIPNS_11ObjectGroupEE11readBarrierES2_$static void js::InternalBarrierMethods<T*>::readBarrier(T*) [with T = js::ObjectGroup] #1098443 = _ZN2js2gc11TenuredCell11readBarrierEPS1_$void js::gc::TenuredCell::readBarrier(js::gc::TenuredCell*) #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const Ooh, at last the one jandem mentioned above. Any more? (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728 and #1186723 Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728, 1186723] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1110236 = _ZN2js17AddTypePropertyIdEPNS_16ExclusiveContextEPNS_11ObjectGroupEP8JSObject4jsidNS_7TypeSet4TypeE$void js::AddTypePropertyId(js::ExclusiveContext*, js::ObjectGroup*, JSObject*, jsid, js::TypeSet::Type) #1110265 = _ZN2js17ConstraintTypeSet7addTypeEPNS_16ExclusiveContextENS_7TypeSet4TypeE$void js::ConstraintTypeSet::addType(js::ExclusiveContext*, js::TypeSet::Type) #1204170 = _ZN2js17ConstraintTypeSet16postWriteBarrierEPNS_16ExclusiveContextENS_7TypeSet4TypeE$void js::ConstraintTypeSet::postWriteBarrier(js::ExclusiveContext*, js::TypeSet::Type) #1204171 = _ZN2js2gc11StoreBuffer10putGenericI10TypeSetRefEEvRKT_$void js::gc::StoreBuffer::putGeneric(TypeSetRef*) [with T = TypeSetRef] #1205212 = _ZN2js2gc11StoreBuffer3putINS1_13GenericBufferE10TypeSetRefEEvRT_RKT0_$void js::gc::StoreBuffer::put(js::gc::StoreBuffer::GenericBuffer*, TypeSetRef*) [with Buffer = js::gc::StoreBuffer::GenericBuffer; Edge = TypeSetRef] #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const ...and I can keep going. You can get it through a pre barrier. Calling setDelegate. Here's a fun one: (Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728 and #1186723 and #1110236 and #1099867 and #1186719 Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728, 1186723, 1110236, 1099867, 1186719] #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*) #1099781 = _ZN2jsL8NameToIdEPNS_12PropertyNameE$String.h:jsid js::NameToId(js::PropertyName*) #13302 = _ZL24NON_INTEGER_ATOM_TO_JSIDP6JSAtom$jsfriendapi.h:jsid NON_INTEGER_ATOM_TO_JSID(JSAtom*) #13304 = _ZN2js6detail13IdMatchesAtomE4jsidP6JSAtom$uint8 js::detail::IdMatchesAtom(jsid, JSAtom*) #15290 = _Z23INTERNED_STRING_TO_JSIDP9JSContextP8JSString$jsid INTERNED_STRING_TO_JSID(JSContext*, JSString*) #902801 = _Z22JS_StringHasBeenPinnedP9JSContextP8JSString$uint8 JS_StringHasBeenPinned(JSContext*, JSString*) #1178664 = _ZN2js16AssertHeapIsIdleEP9JSContext$void js::AssertHeapIsIdle(JSContext*) #1184273 = _ZN2js16AssertHeapIsIdleEP9JSRuntime$void js::AssertHeapIsIdle(JSRuntime*) #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const There are a few more before it fails to find any more. But in short: yes, you can get to isHeapBusy from defaultNewGroup.

Flags: needinfo?(sphink)

Terrence Cole [:terrence]

Assignee

Comment 10

•

9 years ago

Attached patch use_perfect_forwarding_for_rekey-v0.diff (obsolete) — Details — Splinter Review

The reason that I allowed readBarrier from OMT originally was that I could not figure out how to make rekey use the move constructor instead of the copy constructor. The trick is that perfect forwarding requires a new template name for each and every level forwarded. By deleting the copy constructor in InitialShapeEntry, I was able to track down and fix all of the remaining places in HashTable that did not perfectly forward. With that, rekey can use a mozilla::Move'd value and we can successfully avoid triggering read barriers from OMT. The next patch adds the actual OMT assertion to readBarrier and fixes up all of the places that we are currently triggering it from the parse thread. (At least all that I was able to turn up with the shell test suites.)

Attachment #8764437 - Flags: review?(jcoppeard)

Terrence Cole [:terrence]

Assignee

Comment 11

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a23d46d11ab5

Terrence Cole [:terrence]

Assignee

Comment 12

•

9 years ago

Attached patch do_not_fire_read_barriers_omt-v0.diff — Details — Splinter Review

Attachment #8764439 - Flags: review?(jcoppeard)

Jon Coppeard (:jonco)

Comment 13

•

9 years ago

Comment on attachment 8764439 [details] [diff] [review] do_not_fire_read_barriers_omt-v0.diff Review of attachment 8764439 [details] [diff] [review]: ----------------------------------------------------------------- Passing the context to get() is a bit of a pain, but it's great that we can now assert that we don't trigger barriers off main thread. ::: js/src/gc/Barrier.h @@ +609,1 @@ > I don't see how the copy assignment operator (just before here in the file) triggers the read barrier: ReadBarriered& operator=(const ReadBarriered& v) { T prior = this->value; this->value = v.value; this->post(prior, v.value); return *this; } Doesn't this need to call get() or read() somewhere? ::: js/src/jsatominlines.h @@ +29,5 @@ > > inline JSAtom* > js::AtomStateEntry::asPtrUnbarriered() const > { > MOZ_ASSERT(bits != 0); This seems to only be used during GC. Can we add an assertion to that effect? ::: js/src/vm/Shape.h @@ +1107,5 @@ > + } > + > + InitialShapeEntry& operator=(InitialShapeEntry&& other) { > + shape = other.shape.unbarrieredGet(); > + proto = other.proto.unbarrieredGet(); I'm a bit confused by this, because I can't see how assigning the raw T to a ReadBarriered<T> will work. Whatever though, I think we want to give ReadBarriered a move assignment operator taking a T&& and do |x = mozilla::Move(other.x)| here for these.

Attachment #8764439 - Flags: review?(jcoppeard) → review+

Jon Coppeard (:jonco)

Comment 14

•

9 years ago

Comment on attachment 8764437 [details] [diff] [review] use_perfect_forwarding_for_rekey-v0.diff Review of attachment 8764437 [details] [diff] [review]: ----------------------------------------------------------------- I don't understand this (see below). ::: js/public/HashTable.h @@ +1800,5 @@ > MOZ_ASSERT(table); > mozilla::ReentrancyGuard g(*this); > MOZ_ASSERT(p.found()); > + using Entry = typename HashTableEntry<T>::NonConstT; > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); Why does this require a const_cast when it didn't before? @@ +1801,5 @@ > mozilla::ReentrancyGuard g(*this); > MOZ_ASSERT(p.found()); > + using Entry = typename HashTableEntry<T>::NonConstT; > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); > + HashPolicy::setKey(copy, const_cast<Key&>(k)); I don't understand what's going on here. We've forwarded the key all the way down to this method and then we just cast it to a reference. What am I missing?

Terrence Cole [:terrence]

Assignee

Comment 15

•

9 years ago

(In reply to Jon Coppeard (:jonco) from comment #13) > Comment on attachment 8764439 [details] [diff] [review] > do_not_fire_read_barriers_omt-v0.diff > > Review of attachment 8764439 [details] [diff] [review]: > ----------------------------------------------------------------- > > Passing the context to get() is a bit of a pain, but it's great that we can > now assert that we don't trigger barriers off main thread. > > ::: js/src/gc/Barrier.h > @@ +609,1 @@ > > > > I don't see how the copy assignment operator (just before here in the file) > triggers the read barrier: > > ReadBarriered& operator=(const ReadBarriered& v) { > T prior = this->value; > this->value = v.value; > this->post(prior, v.value); > return *this; > } > > Doesn't this need to call get() or read() somewhere? Yeah, this looks like a bug. I didn't see it because InitialShapeEntry does not have copy-assign, only copy-construct. The copy constructor that it uses on it's children does have the right barrier logic. I'll make this use v.get() and fixup anything else that breaks. > ::: js/src/jsatominlines.h > @@ +29,5 @@ > > > > inline JSAtom* > > js::AtomStateEntry::asPtrUnbarriered() const > > { > > MOZ_ASSERT(bits != 0); > > This seems to only be used during GC. Can we add an assertion to that > effect? That's not true anymore. I changed AtomHasher::match to use the unbarriered variant on the entries we're checking as we don't want incidental usage in the hashtable to expand the live set. I think this is a pre-existing bug. > ::: js/src/vm/Shape.h > @@ +1107,5 @@ > > + } > > + > > + InitialShapeEntry& operator=(InitialShapeEntry&& other) { > > + shape = other.shape.unbarrieredGet(); > > + proto = other.proto.unbarrieredGet(); > > I'm a bit confused by this, because I can't see how assigning the raw T to a > ReadBarriered<T> will work. Whatever though, I think we want to give > ReadBarriered a move assignment operator taking a T&& and do |x = > mozilla::Move(other.x)| here for these. It avoids the barrier by using |MOZ_IMPLICIT ReadBarriered(const T& v)|, which appears to be a tighter match than |ReadBarriered& operator=(const ReadBarriered& v)|. I cribbed this from rekey where we don't have the option of Move, but you're quite right: I've updated these to use the move constructor instead, which appears to work just as well.

Terrence Cole [:terrence]

Assignee

Comment 16

•

9 years ago

(In reply to Jon Coppeard (:jonco) from comment #14) > Comment on attachment 8764437 [details] [diff] [review] > use_perfect_forwarding_for_rekey-v0.diff > > Review of attachment 8764437 [details] [diff] [review]: > ----------------------------------------------------------------- > > I don't understand this (see below). > > ::: js/public/HashTable.h > @@ +1800,5 @@ > > MOZ_ASSERT(table); > > mozilla::ReentrancyGuard g(*this); > > MOZ_ASSERT(p.found()); > > + using Entry = typename HashTableEntry<T>::NonConstT; > > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); > > Why does this require a const_cast when it didn't before? Because we need to use the move constructor to avoid the barriers. The copy takes a const&, so can source from the const Entry here by taking a ref. The move constructor cannot take a const, however, so we have to cast it out here. This actually turns up an issue that I didn't catch before though: later in this function we remove(*p.entry_) after having moved it. From what I understand about rvalue references, this is actually allowed, if not particularly clear. Move isn't really "move", it's more like a new kind of reference/tag that every one agrees should move, but it doesn't have actual poisoning semantics itself. And indeed, it appears to work fine when used like this, as long as none of the rekeyable keys do poisoning on "move". > @@ +1801,5 @@ > > mozilla::ReentrancyGuard g(*this); > > MOZ_ASSERT(p.found()); > > + using Entry = typename HashTableEntry<T>::NonConstT; > > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); > > + HashPolicy::setKey(copy, const_cast<Key&>(k)); > > I don't understand what's going on here. We've forwarded the key all the > way down to this method and then we just cast it to a reference. What am I > missing? The key was only part of the issue. The majority of the new perfect forwarding is actually so that we can create |copy| using the move constructor instead of degrading to the copy constructor. The setKey here actually just dispatches to HashPolicy::rekey, so we lose the at some point anyway. I've just hacked InitialShapeSet's rekey to avoid the barriers in this particular case. Actually(!), thinking about it again, we don't have to degrade here if we continue perfectly forwarding all the way into rekey. I was thinking we'd have to change a ton of rekey implementations, but it should just degrade if they take const&, which I think most do. At least it seems to compile alright, I guess I'll send it to try after I iron out the setLive issue. This stuff is stupid subtle. At least we've got an assertion backstopping it.

Terrence Cole [:terrence]

Assignee

Comment 17

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=088aa93ad2b4

Jon Coppeard (:jonco)

Comment 18

•

9 years ago

(In reply to Terrence Cole [:terrence] from comment #16) > > > + using Entry = typename HashTableEntry<T>::NonConstT; > > > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); > > > > Why does this require a const_cast when it didn't before? > > Because we need to use the move constructor to avoid the barriers. The copy > takes a const&, so can source from the const Entry here by taking a ref. The > move constructor cannot take a const, however, so we have to cast it out > here. Oh OK, because T is |const KeyType| for HashSet. So I guess that Move didn't actually do anything before. > This actually turns up an issue that I didn't catch before though: later in > this function we remove(*p.entry_) after having moved it. From what I > understand about rvalue references, this is actually allowed, if not > particularly clear. Move isn't really "move", it's more like a new kind of > reference/tag that every one agrees should move, but it doesn't have actual > poisoning semantics itself. Well it's down to the type what happens when you move it. IIUC post-move the object must be destructible but not release any originally owned resources. > And indeed, it appears to work fine when used > like this, as long as none of the rekeyable keys do poisoning on "move". So this looks bad since we are accessing something after we move it, and even if that works it relies on the implementation of the move constructors not to poison/null the moved object's contents out after the move. However I then realised that this is not the case since we move the T stored in the HashTableEntry<T>, but we call remove() with a reference to the outer HashTableEntry<T>. I think this is fine regardless of how move is implemented, but we could change the code to extract a pointer to this first so it is more clear what's happening.

Terrence Cole [:terrence]

Assignee

Comment 19

•

9 years ago

Oh, good point! You're right: it doesn't even matter as long as we're not violating move semantics.

Terrence Cole [:terrence]

Assignee

Comment 20

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=59e403fe6306

Terrence Cole [:terrence]

Assignee

Comment 21

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a998eed38a04

Jon Coppeard (:jonco)

Comment 22

•

9 years ago

Comment on attachment 8764437 [details] [diff] [review] use_perfect_forwarding_for_rekey-v0.diff Review of attachment 8764437 [details] [diff] [review]: ----------------------------------------------------------------- Cancelling review while waiting for new patch.

Attachment #8764437 - Flags: review?(jcoppeard) → feedback+

Julian Seward [:jseward]

Comment 23

•

9 years ago

Ping .. is there any progess here?

Terrence Cole [:terrence]

Assignee

Comment 24

•

9 years ago

(In reply to Julian Seward [:jseward] from comment #23) > Ping .. is there any progess here? Sadly, no. The TreeHerder failures on jstestbrowser were not trivial to reproduce locally and I've been distracted tracking down similar failures in bug 1237058. I'm almost done there though, at which point I can get back to this. Sorry for the delay.

Steve Fink [:sfink] [:s:]

Updated

•

9 years ago

Blocks: tsan

Terrence Cole [:terrence]

Assignee

Comment 25

•

9 years ago

Attached file compartments-before.txt — Details

Terrence Cole [:terrence]

Assignee

Comment 26

•

9 years ago

Attached file compartments-after.txt — Details

Terrence Cole [:terrence]

Assignee

Comment 27

•

9 years ago

This is suffering from exactly the same symptoms as those in bug 1237058. Which is *very* odd as the mechanisms being poked are extremely different. The attachments I've just attached are are from before and after the patches here, running on top of the patch from bug 1285605, piped through grep to collect only the "Compartments Collected" lines, then run through grep -v to strip out the DOM_WORKER "2 of 2" collections. What's interesting is that even in the before run, our compartments pile up rapidly up to 6000+ strong. There are many, many GC's that are unable to collect these compartments until we happen to trigger one that does collect basically all of them in one swoop. We then build back up to 6000+ again before exiting. The after run is the same (down to having approximately the same number of GC's), except that there are no collections that free substantial numbers of compartments and we end up collecting over 11000+ compartments before the terminating GC, where we're killed because we take too long. Oh, the other difference is that the before run takes 4 minutes and the after run takes 11 minutes. Carrying those extra compartments *really* adds up. Exponentially fast even. The following is the GC that clears up all the compartments from the before run: GC(T+204.028s) ================================================================= Invocation Kind: Normal Reason: COMPONENT_UTILS Incremental: no - requested Zones Collected: 5 of 5 (-1) Compartments Collected: 5249 of 5249 (-5170) MinorGCs since last GC: 1 Store Buffer Overflows: 0 MMU 20ms:0.0%; 50ms:0.0% SCC Sweep Total (MaxPause): 114.209ms (114.209ms) HeapSize: 993.664 MiB Chunk Delta (magnitude): -731 (731) Arenas Relocated: 0.000 MiB ---- Totals ---- Total Time: 711.769ms Max Pause: 711.769ms Begin Callback: 5881.877ms Wait Background Thread: 363.486ms Relazify Functions: 34.669ms Purge: 0.187ms Mark: 33.689ms Other: 4.273ms Mark Roots: 3.778ms Other: 0.443ms Mark Cross Compartment Wrappers: 0.110ms Mark Rooters: 0.049ms Mark Runtime-wide Data: 1.702ms Mark Embedding: 0.145ms Mark Compartments: 1.329ms Unmark: 25.638ms Sweep: 278.274ms Other: 9.543ms Mark During Sweeping: 4.132ms Mark Incoming Black Pointers: 0.055ms Mark Weak: 0.138ms Mark Incoming Gray Pointers: 0.037ms Mark Gray: 3.726ms Mark Gray and Weak: 0.176ms Finalize Start Callbacks: 3.457ms Per-Slice Weak Callback: 1.056ms Per-Compartment Weak Callback: 2.399ms Sweep Atoms: 7.897ms Sweep Symbol Registry: 0.049ms Sweep Compartments: 65.518ms Sweep Discard Code: 4.265ms Sweep Inner Views: 0.260ms Sweep Cross Compartment Wrappers: 1.525ms Sweep Type Objects: 1.051ms Sweep Breakpoints: 22.585ms Sweep Regexps: 1.337ms Sweep Miscellaneous: 395.137ms Sweep type information: 30.236ms Other: 30.232ms Sweep type tables and compilations: 0.002ms Free type arena: 0.002ms Sweep Object: 78.921ms Sweep String: 0.850ms Sweep Script: 12.353ms Sweep Shape: 45.617ms Sweep JIT code: 0.048ms Finalize End Callback: 13.593ms Deallocate: 36.296ms End Callback: 0.101ms Minor GCs to Evict Nursery: 1.213ms Other: 0.882ms Mark Roots: 0.331ms Other: 0.318ms Mark Rooters: 0.013ms Barriers: 0.002ms Unmark gray: 0.001ms The Start Callback is 5 seconds and the total GC time is .7 seconds, so that's weird. The before case does a COMPONENT_UTILS collection 168 times and the after does 177 COMPONENT_UTILS collections, so it's not a matter of how we're triggering the collection. I still have no idea what's going wrong, but it appears to be something subtle.

Terrence Cole [:terrence]

Assignee

Updated

•

9 years ago

See Also: → 1237058

Terrence Cole [:terrence]

Assignee

Comment 28

•

9 years ago

Running |Cu.forceGC(); Cu.forceCC(); Cu.forceGC(); Cu.forceCC();| after ever test does not keep the ghost windows from piling up endlessly.

Terrence Cole [:terrence]

Assignee

Comment 29

•

9 years ago

I lied. I was doing the new GC's in the parent process by accident. Doing the GC's in the content process is keeping the window/compartment count low, as expected. It seems that forceCC() followed by forceGC() is sufficient to flush all the garbage. Talking to Bill this morning, it seems like the most likely reason that these windows are kept alive is wrapper remapping. We should have a mechanism to address this: ScheduledForDestruction. I need to figure out how that's supposed to work and see if I can figure out why it's not.

Terrence Cole [:terrence]

Assignee

Comment 30

•

9 years ago

COMPARTMENT_REVIVED collections are getting triggered as expected, but they are not releasing any compartments.

Terrence Cole [:terrence]

Assignee

Comment 31

•

9 years ago

(In reply to Terrence Cole [:terrence] from comment #30) > COMPARTMENT_REVIVED collections are getting triggered as expected, but they > are not releasing any compartments. This is because the compartments being held live are held because the [HTMLDocument] wrapper is live because the nsDocument is held in the wrapper cache. COMPARTMENT_REVIVIED collections are only effective at cleaning up compartments that were held live via barriers.

Terrence Cole [:terrence]

Assignee

Comment 32

•

9 years ago

This should all be resolved by bug 1296484.

Terrence Cole [:terrence]

Assignee

Comment 33

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=52e2d3ac700a

Terrence Cole [:terrence]

Assignee

Comment 34

•

9 years ago

Attached patch use_perfect_forwarding_for_rekey-v1.diff (obsolete) — Details — Splinter Review

It looks like bug 1292218 is disabling the collections from 1296484 by disrupting the COMPARTMENT_REVIVED heuristics. I think if we back that out with this landing we should be good. I'll do another try push. Otherwise the actual changes look good now. Sorry to ask you to review Steve, but Jon is on PTO until next week and I'd like to get this in ASAP.

Attachment #8764437 - Attachment is obsolete: true

Attachment #8786065 - Flags: review?(sphink)

Terrence Cole [:terrence]

Assignee

Comment 35

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=4a4876c657c7

Steve Fink [:sfink] [:s:]

Comment 36

•

9 years ago

Comment on attachment 8786065 [details] [diff] [review] use_perfect_forwarding_for_rekey-v1.diff Review of attachment 8786065 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/public/HashTable.h @@ +1867,5 @@ > MOZ_ASSERT(table); > mozilla::ReentrancyGuard g(*this); > MOZ_ASSERT(p.found()); > + using Entry = typename HashTableEntry<T>::NonConstT; > + Entry copy(mozilla::Move(const_cast<Entry&>(*p))); I do not like calling this 'copy', given that it's trying hard to *not* copy. 'entry' would work fine for me. @@ +1873,2 @@ > remove(*p.entry_); > + putNewInfallibleInternal(l, mozilla::Move(copy)); Ouch. Ok, this doesn't look right. l and k may be the same thing. k gets moved out via setKey, and then l is used. These are all references, so if eg Lookup and Key are Vector<U>, then the vector storage may have been stolen away and nulled out.

Attachment #8786065 - Flags: review?(sphink)

Jon Coppeard (:jonco)

Comment 38

•

9 years ago

Attached patch bug844769-forward-for-rekey v2 — Details — Splinter Review

Updated rekeying patch with Steve's review comments.

Attachment #8786065 - Attachment is obsolete: true

Jon Coppeard (:jonco)

Comment 39

•

9 years ago

Attached patch bug844769-alternate — Details — Splinter Review

Here's an alternative approach to removing the race. This just uses the incremental barrier flag which should always be off while we are collecting. I think this should work (but please check my logic), and it doesn't seem to provoke the reftest failures on try.

Attachment #8789853 - Flags: review?(terrence)

Terrence Cole [:terrence]

Assignee

Comment 40

•

9 years ago

Comment on attachment 8789853 [details] [diff] [review] bug844769-alternate Review of attachment 8789853 [details] [diff] [review]: ----------------------------------------------------------------- Ah, good observation! Let's try it.

Attachment #8789853 - Flags: review?(terrence) → review+

Comment 41

•

9 years ago

Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/0fda02f7c6a0 Don't access heap state from off main thread r=terrence

Steve Fink [:sfink] [:s:]

Comment 42

•

9 years ago

Comment on attachment 8789775 [details] [diff] [review] bug844769-forward-for-rekey v2 Review of attachment 8789775 [details] [diff] [review]: ----------------------------------------------------------------- I know you didn't ask for review yet, but this beautifully resolves the only issue I had with the patch. Whether you want to land this or not is up to you, but r=me.

Attachment #8789775 - Flags: review+

Phil Ringnalda (:philor)

Comment 43

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0fda02f7c6a0

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox51: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Steve Fink [:sfink] [:s:]

Comment 44

•

9 years ago

Since this is marked RESOLVED now and can easily get lost -- what's the state of the other patch? It would still be good to fix the rekeying barriers, if I'm understanding the state of things correctly. Is the patch landable, or should it be forked off to another bug? needinfo both since Jon is on PTO.

Flags: needinfo?(terrence)

Flags: needinfo?(jcoppeard)

Terrence Cole [:terrence]

Assignee

Comment 45

•

9 years ago

Filed bug 1302559 to follow up.

Flags: needinfo?(terrence)

Jon Coppeard (:jonco)

Updated

•

9 years ago

Flags: needinfo?(jcoppeard)

You need to log in before you can comment on or make changes to this bug.