Closed Bug 844769 Opened 7 years ago Closed 3 years ago

TSan: Thread data race in AutoTraceSession::AutoTraceSession(JSRuntime *rt, js::HeapState heapState) vs. JSRuntime::isHeapBusy()

Categories

(Core :: JavaScript: GC, defect, critical)

x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox51 --- fixed

People

(Reporter: posidron, Assigned: terrence)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tsan][tsan-test-blocker])

Attachments

(6 files, 3 obsolete files)

Attached file trace (obsolete) —
During Firefox start-up with ThreadSanitizer (LLVM version), we get a data race reported as described in the attached log. Trace was created on mozilla-central with changeset 122820:c233837cce08.

According to the TSan devs, most of the reported traces should be real data races, even though they can be "benign". We need to determine if the race can/should be fixed, or put on the ignore list. Even for benign races, TSan devs suggest to fix them (second priority), as they can also cause problems [1].

[1] http://software.intel.com/en-us/blogs/2013/01/06/benign-data-races-what-could-possibly-go-wrong
This one still shows up on jit-tests. I'll attach a new trace.
Component: General → JavaScript: GC
Attached file Shell trace
The background parse thread is here:

    bool isHeapBusy() const { return heapState_ != JS::HeapState::Idle; }

And the main thread writes to heapState_ in AutoTraceSession::AutoTraceSession.
Attachment #717812 - Attachment is obsolete: true
Flags: needinfo?(terrence)
According to the GC analysis, the one path from defaultNewGroup to isHeapBusy is via:
  #739477 = js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #758165 = js::ObjectGroup* js::ObjectGroupCompartment::makeGroup(js::ExclusiveContext*, js::Class*, class JS::Handle<js::TaggedProto>, uint32)
  #753457 = js::ObjectGroup* js::Allocate(js::ExclusiveContext*) [with T = js::ObjectGroup; js::AllowGC allowGC = (js::AllowGC)1u]
  #752868 = uint8 js::gc::GCRuntime::checkAllocatorState(JSContext*, int32) [with js::AllowGC allowGC = (js::AllowGC)1u]
  #5293 = uint8 JS::shadow::Runtime::isHeapBusy() const
Flags: needinfo?(terrence)
Note that that should not actually be possible: the background thread uses an ExclusiveContext, which should cause us to skip the checkAllocatorState [1]. This is enforced by the type system, so I'm not sure what's going on here. Maybe a debug build would give us a cleaner stack? 

1- https://dxr.mozilla.org/mozilla-central/source/js/src/gc/Allocator.cpp#135-139
(In reply to Terrence Cole [:terrence] from comment #4)
> Maybe a debug build would give us a cleaner stack? 

Good point. It's in TenuredCell::readBarrier (did the analysis miss this path?):

    #0 JS::shadow::Runtime::isHeapBusy() const jspubtd.h:164 (js+0x000100d43722)
    #1 js::InternalBarrierMethods<js::ObjectGroup*>::readBarrier(js::ObjectGroup*) Heap.h:1279 (js+0x000100dd5e08)
    #2 js::ReadBarrieredBase<js::ObjectGroup*>::read() const Barrier.h:553 (js+0x000100dd5dbe)
    #3 js::ReadBarriered<js::ObjectGroup*>::get() const Barrier.h:608 (js+0x000100dd5d10)
    #4 js::ReadBarriered<js::ObjectGroup*>::operator js::ObjectGroup* const() const Barrier.h:620 (js+0x000100daeaf8)
    #5 js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class const*, js::TaggedProto, JSObject*) ObjectGroup.cpp:495 (js+0x000100dade09)
    #6 js::NewObjectWithClassProtoCommon(js::ExclusiveContext*, js::Class const*, JS::Handle<JSObject*>, js::gc::AllocKind, js::NewObjectKind) jsobj.cpp:795 (js+0x000100a4642e)
    #7 js::NewObjectWithClassProto(js::ExclusiveContext*, js::Class const*, JS::Handle<JSObject*>, js::gc::AllocKind, js::NewObjectKind) jsobjinlines.h:703 (js+0x000100017b52)
Flags: needinfo?(terrence)
Does the traverse.py analysis only give us one path from A to B?
Flags: needinfo?(terrence) → needinfo?(sphink)
I think the best way to address this is to finally make the on-main-thread assertion we've wanted to have forever. I think if we add a custom ReadBarriered::get that takes the ExclusiveContext and does the get() || unbarrieredGet() internally, this shouldn't be so cumbersome. We do need to go out of line for this, so hopefully it won't be a slowdown. If it is, we can always open code it.
Assignee: nobody → terrence
Note that this is not a sec issue: we never fire the barrier from OMT so our correctness does not depend at all on the value we're reading.
Keywords: sec-want
(In reply to Terrence Cole [:terrence] from comment #6)
> Does the traverse.py analysis only give us one path from A to B?

If that's what you ask for. You can also use "routes from <X> to <Y>" and it'll give you multiple routes, making some attempt to make the routes look different. In this case, it still only displayed that one route, which means it is all it could find.

As for *why* it didn't see any other paths, it looks like the callgraph.txt I had lying around is too old. Sorry about that. But if I update, I still don't get the above path, which seems like it's a bug in my multiple route processing, since if I trace it through one link at a time it now has the full path. (It isn't terribly useful to print out *all* paths, since for most cases there are an exponential number of distinct ones. So some cleverness is required. Apparently, more than I have implemented at the moment.)

But I can fake it.

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup
Path from #1125956 to #3050: avoiding [1180999]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1099478 = _ZN2js11ObjectGroup12setTypeDescrEPNS_9TypeDescrE$void js::ObjectGroup::setTypeDescr(js::TypeDescr*)
  #1099465 = _ZN2js11ObjectGroup11setAddendumENS0_12AddendumKindEPvb$void js::ObjectGroup::setAddendum(uint32, void*, uint8)
  #1186658 = _ZN2js34PreliminaryObjectArrayWithTemplate15writeBarrierPreEPS0_$void js::PreliminaryObjectArrayWithTemplate::writeBarrierPre(js::PreliminaryObjectArrayWithTemplate*)
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr
Path from #1125956 to #3050: avoiding [1180999, 1099478]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1186716 = _ZN2js14MallocProviderINS_16ExclusiveContextEE4new_INS_22ObjectGroupCompartment8NewTableEIPN2JS4ZoneEEEEPT_DpOT0_$T* js::MallocProvider<Client>::new_(Args&& ...) [with T = js::ObjectGroupCompartment::NewTable; Args = {JS::Zone*}; Client = js::ExclusiveContext]
  #1116372 = _ZN2js14MallocProviderINS_16ExclusiveContextEE10pod_mallocIhEEPT_m$T* js::MallocProvider<Client>::pod_malloc(size_t) [with T = unsigned char; Client = js::ExclusiveContext; size_t = long unsigned int]
  #1101809 = _ZN2js16ExclusiveContext13onOutOfMemoryENS_13AllocFunctionEmPv$void* js::ExclusiveContext::onOutOfMemory(int32, uint64, void*)
  #1101811 = _ZN9JSRuntime13onOutOfMemoryEN2js13AllocFunctionEmPvP9JSContext$void* JSRuntime::onOutOfMemory(int32, uint64, void*, JSContext*)
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716
Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1186728 = _ZN2js13TypeNewScript4makeEP9JSContextPNS_11ObjectGroupEP10JSFunction$uint8 js::TypeNewScript::make(JSContext*, js::ObjectGroup*, JSFunction*)
  #1099464 = _ZN2js11ObjectGroup12setNewScriptEPNS_13TypeNewScriptE$void js::ObjectGroup::setNewScript(js::TypeNewScript*)
  #1099465 = _ZN2js11ObjectGroup11setAddendumENS0_12AddendumKindEPvb$void js::ObjectGroup::setAddendum(uint32, void*, uint8)
  #1186658 = _ZN2js34PreliminaryObjectArrayWithTemplate15writeBarrierPreEPS0_$void js::PreliminaryObjectArrayWithTemplate::writeBarrierPre(js::PreliminaryObjectArrayWithTemplate*)
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728
Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1186723 = _ZNK2js13ReadBarrieredIPNS_11ObjectGroupEEcvKS2_Ev$js::ReadBarriered<T>::operator const T() const [with T = js::ObjectGroup*]
  #1187231 = _ZNK2js13ReadBarrieredIPNS_11ObjectGroupEE3getEv$const T js::ReadBarriered<T>::get() const [with T = js::ObjectGroup*]
  #1187517 = _ZNK2js17ReadBarrieredBaseIPNS_11ObjectGroupEE4readEv$void js::ReadBarrieredBase<T>::read() const [with T = js::ObjectGroup*]
  #1187683 = _ZN2js22InternalBarrierMethodsIPNS_11ObjectGroupEE11readBarrierES2_$static void js::InternalBarrierMethods<T*>::readBarrier(T*) [with T = js::ObjectGroup]
  #1098443 = _ZN2js2gc11TenuredCell11readBarrierEPS1_$void js::gc::TenuredCell::readBarrier(js::gc::TenuredCell*)
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

Ooh, at last the one jandem mentioned above. Any more?

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728 and #1186723
Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728, 1186723]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1110236 = _ZN2js17AddTypePropertyIdEPNS_16ExclusiveContextEPNS_11ObjectGroupEP8JSObject4jsidNS_7TypeSet4TypeE$void js::AddTypePropertyId(js::ExclusiveContext*, js::ObjectGroup*, JSObject*, jsid, js::TypeSet::Type)
  #1110265 = _ZN2js17ConstraintTypeSet7addTypeEPNS_16ExclusiveContextENS_7TypeSet4TypeE$void js::ConstraintTypeSet::addType(js::ExclusiveContext*, js::TypeSet::Type)
  #1204170 = _ZN2js17ConstraintTypeSet16postWriteBarrierEPNS_16ExclusiveContextENS_7TypeSet4TypeE$void js::ConstraintTypeSet::postWriteBarrier(js::ExclusiveContext*, js::TypeSet::Type)
  #1204171 = _ZN2js2gc11StoreBuffer10putGenericI10TypeSetRefEEvRKT_$void js::gc::StoreBuffer::putGeneric(TypeSetRef*) [with T = TypeSetRef]
  #1205212 = _ZN2js2gc11StoreBuffer3putINS1_13GenericBufferE10TypeSetRefEEvRT_RKT0_$void js::gc::StoreBuffer::put(js::gc::StoreBuffer::GenericBuffer*, TypeSetRef*) [with Buffer = js::gc::StoreBuffer::GenericBuffer; Edge = TypeSetRef]
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

...and I can keep going. You can get it through a pre barrier. Calling setDelegate. Here's a fun one:

(Cmd) route from #1125956 to isHeapBusy avoiding ObjectGroupCompartment::makeGroup and setTypeDescr and #1186716 and #1186728 and #1186723 and #1110236 and #1099867 and #1186719
Path from #1125956 to #3050: avoiding [1180999, 1099478, 1186716, 1186728, 1186723, 1110236, 1099867, 1186719]
  #1125956 = _ZN2js11ObjectGroup15defaultNewGroupEPNS_16ExclusiveContextEPKNS_5ClassENS_11TaggedProtoEP8JSObject$js::ObjectGroup* js::ObjectGroup::defaultNewGroup(js::ExclusiveContext*, js::Class*, js::TaggedProto, JSObject*)
  #1099781 = _ZN2jsL8NameToIdEPNS_12PropertyNameE$String.h:jsid js::NameToId(js::PropertyName*)
  #13302 = _ZL24NON_INTEGER_ATOM_TO_JSIDP6JSAtom$jsfriendapi.h:jsid NON_INTEGER_ATOM_TO_JSID(JSAtom*)
  #13304 = _ZN2js6detail13IdMatchesAtomE4jsidP6JSAtom$uint8 js::detail::IdMatchesAtom(jsid, JSAtom*)
  #15290 = _Z23INTERNED_STRING_TO_JSIDP9JSContextP8JSString$jsid INTERNED_STRING_TO_JSID(JSContext*, JSString*)
  #902801 = _Z22JS_StringHasBeenPinnedP9JSContextP8JSString$uint8 JS_StringHasBeenPinned(JSContext*, JSString*)
  #1178664 = _ZN2js16AssertHeapIsIdleEP9JSContext$void js::AssertHeapIsIdle(JSContext*)
  #1184273 = _ZN2js16AssertHeapIsIdleEP9JSRuntime$void js::AssertHeapIsIdle(JSRuntime*)
  #3050 = _ZNK2JS6shadow7Runtime10isHeapBusyEv$uint8 JS::shadow::Runtime::isHeapBusy() const

There are a few more before it fails to find any more. But in short: yes, you can get to isHeapBusy from defaultNewGroup.
Flags: needinfo?(sphink)
The reason that I allowed readBarrier from OMT originally was that I could not figure out how to make rekey use the move constructor instead of the copy constructor. The trick is that perfect forwarding requires a new template name for each and every level forwarded. By deleting the copy constructor in InitialShapeEntry, I was able to track down and fix all of the remaining places in HashTable that did not perfectly forward. With that, rekey can use a mozilla::Move'd value and we can successfully avoid triggering read barriers from OMT.

The next patch adds the actual OMT assertion to readBarrier and fixes up all of the places that we are currently triggering it from the parse thread. (At least all that I was able to turn up with the shell test suites.)
Attachment #8764437 - Flags: review?(jcoppeard)
Attachment #8764439 - Flags: review?(jcoppeard)
Comment on attachment 8764439 [details] [diff] [review]
do_not_fire_read_barriers_omt-v0.diff

Review of attachment 8764439 [details] [diff] [review]:
-----------------------------------------------------------------

Passing the context to get() is a bit of a pain, but it's great that we can now assert that we don't trigger barriers off main thread.

::: js/src/gc/Barrier.h
@@ +609,1 @@
>  

I don't see how the copy assignment operator (just before here in the file) triggers the read barrier:

    ReadBarriered& operator=(const ReadBarriered& v) {
        T prior = this->value;
        this->value = v.value;
        this->post(prior, v.value);
        return *this;
    }

Doesn't this need to call get() or read() somewhere?

::: js/src/jsatominlines.h
@@ +29,5 @@
>  
>  inline JSAtom*
>  js::AtomStateEntry::asPtrUnbarriered() const
>  {
>      MOZ_ASSERT(bits != 0);

This seems to only be used during GC.  Can we add an assertion to that effect?

::: js/src/vm/Shape.h
@@ +1107,5 @@
> +    }
> +
> +    InitialShapeEntry& operator=(InitialShapeEntry&& other) {
> +        shape = other.shape.unbarrieredGet();
> +        proto = other.proto.unbarrieredGet();

I'm a bit confused by this, because I can't see how assigning the raw T to a ReadBarriered<T> will work.  Whatever though, I think we want to give ReadBarriered a move assignment operator taking a T&& and do |x = mozilla::Move(other.x)| here for these.
Attachment #8764439 - Flags: review?(jcoppeard) → review+
Comment on attachment 8764437 [details] [diff] [review]
use_perfect_forwarding_for_rekey-v0.diff

Review of attachment 8764437 [details] [diff] [review]:
-----------------------------------------------------------------

I don't understand this (see below).

::: js/public/HashTable.h
@@ +1800,5 @@
>          MOZ_ASSERT(table);
>          mozilla::ReentrancyGuard g(*this);
>          MOZ_ASSERT(p.found());
> +        using Entry = typename HashTableEntry<T>::NonConstT;
> +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));

Why does this require a const_cast when it didn't before?

@@ +1801,5 @@
>          mozilla::ReentrancyGuard g(*this);
>          MOZ_ASSERT(p.found());
> +        using Entry = typename HashTableEntry<T>::NonConstT;
> +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));
> +        HashPolicy::setKey(copy, const_cast<Key&>(k));

I don't understand what's going on here.  We've forwarded the key all the way down to this method and then we just cast it to a reference.  What am I missing?
(In reply to Jon Coppeard (:jonco) from comment #13)
> Comment on attachment 8764439 [details] [diff] [review]
> do_not_fire_read_barriers_omt-v0.diff
> 
> Review of attachment 8764439 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Passing the context to get() is a bit of a pain, but it's great that we can
> now assert that we don't trigger barriers off main thread.
> 
> ::: js/src/gc/Barrier.h
> @@ +609,1 @@
> >  
> 
> I don't see how the copy assignment operator (just before here in the file)
> triggers the read barrier:
> 
>     ReadBarriered& operator=(const ReadBarriered& v) {
>         T prior = this->value;
>         this->value = v.value;
>         this->post(prior, v.value);
>         return *this;
>     }
> 
> Doesn't this need to call get() or read() somewhere?

Yeah, this looks like a bug. I didn't see it because InitialShapeEntry does not have copy-assign, only copy-construct. The copy constructor that it uses on it's children does have the right barrier logic. I'll make this use v.get() and fixup anything else that breaks.

> ::: js/src/jsatominlines.h
> @@ +29,5 @@
> >  
> >  inline JSAtom*
> >  js::AtomStateEntry::asPtrUnbarriered() const
> >  {
> >      MOZ_ASSERT(bits != 0);
> 
> This seems to only be used during GC.  Can we add an assertion to that
> effect?

That's not true anymore. I changed AtomHasher::match to use the unbarriered variant on the entries we're checking as we don't want incidental usage in the hashtable to expand the live set. I think this is a pre-existing bug.

> ::: js/src/vm/Shape.h
> @@ +1107,5 @@
> > +    }
> > +
> > +    InitialShapeEntry& operator=(InitialShapeEntry&& other) {
> > +        shape = other.shape.unbarrieredGet();
> > +        proto = other.proto.unbarrieredGet();
> 
> I'm a bit confused by this, because I can't see how assigning the raw T to a
> ReadBarriered<T> will work.  Whatever though, I think we want to give
> ReadBarriered a move assignment operator taking a T&& and do |x =
> mozilla::Move(other.x)| here for these.

It avoids the barrier by using |MOZ_IMPLICIT ReadBarriered(const T& v)|, which appears to be a tighter match than |ReadBarriered& operator=(const ReadBarriered& v)|. I cribbed this from rekey where we don't have the option of Move, but you're quite right: I've updated these to use the move constructor instead, which appears to work just as well.
(In reply to Jon Coppeard (:jonco) from comment #14)
> Comment on attachment 8764437 [details] [diff] [review]
> use_perfect_forwarding_for_rekey-v0.diff
> 
> Review of attachment 8764437 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I don't understand this (see below).
> 
> ::: js/public/HashTable.h
> @@ +1800,5 @@
> >          MOZ_ASSERT(table);
> >          mozilla::ReentrancyGuard g(*this);
> >          MOZ_ASSERT(p.found());
> > +        using Entry = typename HashTableEntry<T>::NonConstT;
> > +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));
> 
> Why does this require a const_cast when it didn't before?

Because we need to use the move constructor to avoid the barriers. The copy takes a const&, so can source from the const Entry here by taking a ref. The move constructor cannot take a const, however, so we have to cast it out here.

This actually turns up an issue that I didn't catch before though: later in this function we remove(*p.entry_) after having moved it. From what I understand about rvalue references, this is actually allowed, if not particularly clear. Move isn't really "move", it's more like a new kind of reference/tag that every one agrees should move, but it doesn't have actual poisoning semantics itself. And indeed, it appears to work fine when used like this, as long as none of the rekeyable keys do poisoning on "move".

> @@ +1801,5 @@
> >          mozilla::ReentrancyGuard g(*this);
> >          MOZ_ASSERT(p.found());
> > +        using Entry = typename HashTableEntry<T>::NonConstT;
> > +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));
> > +        HashPolicy::setKey(copy, const_cast<Key&>(k));
> 
> I don't understand what's going on here.  We've forwarded the key all the
> way down to this method and then we just cast it to a reference.  What am I
> missing?

The key was only part of the issue. The majority of the new perfect forwarding is actually so that we can create |copy| using the move constructor instead of degrading to the copy constructor. The setKey here actually just dispatches to HashPolicy::rekey, so we lose the at some point anyway. I've just hacked InitialShapeSet's rekey to avoid the barriers in this particular case.

Actually(!), thinking about it again, we don't have to degrade here if we continue perfectly forwarding all the way into rekey. I was thinking we'd have to change a ton of rekey implementations, but it should just degrade if they take const&, which I think most do. At least it seems to compile alright, I guess I'll send it to try after I iron out the setLive issue.

This stuff is stupid subtle. At least we've got an assertion backstopping it.
(In reply to Terrence Cole [:terrence] from comment #16)
> > > +        using Entry = typename HashTableEntry<T>::NonConstT;
> > > +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));
> > 
> > Why does this require a const_cast when it didn't before?
> 
> Because we need to use the move constructor to avoid the barriers. The copy
> takes a const&, so can source from the const Entry here by taking a ref. The
> move constructor cannot take a const, however, so we have to cast it out
> here.

Oh OK, because T is |const KeyType| for HashSet.  So I guess that Move didn't actually do anything before.

> This actually turns up an issue that I didn't catch before though: later in
> this function we remove(*p.entry_) after having moved it. From what I
> understand about rvalue references, this is actually allowed, if not
> particularly clear. Move isn't really "move", it's more like a new kind of
> reference/tag that every one agrees should move, but it doesn't have actual
> poisoning semantics itself. 

Well it's down to the type what happens when you move it.  IIUC post-move the object must be destructible but not release any originally owned resources.

> And indeed, it appears to work fine when used
> like this, as long as none of the rekeyable keys do poisoning on "move".

So this looks bad since we are accessing something after we move it, and even if that works it relies on the implementation of the move constructors not to poison/null the moved object's contents out after the move.

However I then realised that this is not the case since we move the T stored in the HashTableEntry<T>, but we call remove() with a reference to the outer HashTableEntry<T>.

I think this is fine regardless of how move is implemented, but we could change the code to extract a pointer to this first so it is more clear what's happening.
Oh, good point! You're right: it doesn't even matter as long as we're not violating move semantics.
Comment on attachment 8764437 [details] [diff] [review]
use_perfect_forwarding_for_rekey-v0.diff

Review of attachment 8764437 [details] [diff] [review]:
-----------------------------------------------------------------

Cancelling review while waiting for new patch.
Attachment #8764437 - Flags: review?(jcoppeard) → feedback+
Ping .. is there any progess here?
(In reply to Julian Seward [:jseward] from comment #23)
> Ping .. is there any progess here?

Sadly, no. The TreeHerder failures on jstestbrowser were not trivial to reproduce locally and I've been distracted tracking down similar failures in bug 1237058. I'm almost done there though, at which point I can get back to this. Sorry for the delay.
Attached file compartments-after.txt
This is suffering from exactly the same symptoms as those in bug 1237058. Which is *very* odd as the mechanisms being poked are extremely different.

The attachments I've just attached are are from before and after the patches here, running on top of the patch from bug 1285605, piped through grep to collect only the "Compartments Collected" lines, then run through grep -v to strip out the DOM_WORKER "2 of 2" collections.

What's interesting is that even in the before run, our compartments pile up rapidly up to 6000+ strong. There are many, many GC's that are unable to collect these compartments until we happen to trigger one that does collect basically all of them in one swoop. We then build back up to 6000+ again before exiting.

The after run is the same (down to having approximately the same number of GC's), except that there are no collections that free substantial numbers of compartments and we end up collecting over 11000+ compartments before the terminating GC, where we're killed because we take too long.

Oh, the other difference is that the before run takes 4 minutes and the after run takes 11 minutes. Carrying those extra compartments *really* adds up. Exponentially fast even. The following is the GC that clears up all the compartments from the before run:

GC(T+204.028s) =================================================================
  Invocation Kind: Normal
  Reason: COMPONENT_UTILS
  Incremental: no - requested
  Zones Collected: 5 of 5 (-1)
  Compartments Collected: 5249 of 5249 (-5170)
  MinorGCs since last GC: 1
  Store Buffer Overflows: 0
  MMU 20ms:0.0%; 50ms:0.0%
  SCC Sweep Total (MaxPause): 114.209ms (114.209ms)
  HeapSize: 993.664 MiB
  Chunk Delta (magnitude): -731  (731)
  Arenas Relocated: 0.000 MiB
  ---- Totals ----
    Total Time: 711.769ms
    Max Pause: 711.769ms
      Begin Callback: 5881.877ms
      Wait Background Thread: 363.486ms
      Relazify Functions: 34.669ms
      Purge: 0.187ms
      Mark: 33.689ms
        Other: 4.273ms
        Mark Roots: 3.778ms
          Other: 0.443ms
          Mark Cross Compartment Wrappers: 0.110ms
          Mark Rooters: 0.049ms
          Mark Runtime-wide Data: 1.702ms
          Mark Embedding: 0.145ms
          Mark Compartments: 1.329ms
        Unmark: 25.638ms
      Sweep: 278.274ms
        Other: 9.543ms
        Mark During Sweeping: 4.132ms
          Mark Incoming Black Pointers: 0.055ms
          Mark Weak: 0.138ms
          Mark Incoming Gray Pointers: 0.037ms
          Mark Gray: 3.726ms
          Mark Gray and Weak: 0.176ms
        Finalize Start Callbacks: 3.457ms
          Per-Slice Weak Callback: 1.056ms
          Per-Compartment Weak Callback: 2.399ms
        Sweep Atoms: 7.897ms
        Sweep Symbol Registry: 0.049ms
        Sweep Compartments: 65.518ms
          Sweep Discard Code: 4.265ms
          Sweep Inner Views: 0.260ms
          Sweep Cross Compartment Wrappers: 1.525ms
          Sweep Type Objects: 1.051ms
          Sweep Breakpoints: 22.585ms
          Sweep Regexps: 1.337ms
          Sweep Miscellaneous: 395.137ms
          Sweep type information: 30.236ms
            Other: 30.232ms
            Sweep type tables and compilations: 0.002ms
            Free type arena: 0.002ms
        Sweep Object: 78.921ms
        Sweep String: 0.850ms
        Sweep Script: 12.353ms
        Sweep Shape: 45.617ms
        Sweep JIT code: 0.048ms
        Finalize End Callback: 13.593ms
        Deallocate: 36.296ms
      End Callback: 0.101ms
      Minor GCs to Evict Nursery: 1.213ms
        Other: 0.882ms
        Mark Roots: 0.331ms
          Other: 0.318ms
          Mark Rooters: 0.013ms
      Barriers: 0.002ms
        Unmark gray: 0.001ms

The Start Callback is 5 seconds and the total GC time is .7 seconds, so that's weird. The before case does a COMPONENT_UTILS collection 168 times and the after does 177 COMPONENT_UTILS collections, so it's not a matter of how we're triggering the collection.

I still have no idea what's going wrong, but it appears to be something subtle.
See Also: → 1237058
Running |Cu.forceGC(); Cu.forceCC(); Cu.forceGC(); Cu.forceCC();| after ever test does not keep the ghost windows from piling up endlessly.
I lied. I was doing the new GC's in the parent process by accident. Doing the GC's in the content process is keeping the window/compartment count low, as expected. It seems that forceCC() followed by forceGC() is sufficient to flush all the garbage.

Talking to Bill this morning, it seems like the most likely reason that these windows are kept alive is wrapper remapping. We should have a mechanism to address this: ScheduledForDestruction. I need to figure out how that's supposed to work and see if I can figure out why it's not.
COMPARTMENT_REVIVED collections are getting triggered as expected, but they are not releasing any compartments.
(In reply to Terrence Cole [:terrence] from comment #30)
> COMPARTMENT_REVIVED collections are getting triggered as expected, but they
> are not releasing any compartments.

This is because the compartments being held live are held because the [HTMLDocument] wrapper is live because the nsDocument is held in the wrapper cache. COMPARTMENT_REVIVIED collections are only effective at cleaning up compartments that were held live via barriers.
This should all be resolved by bug 1296484.
It looks like bug 1292218 is disabling the collections from 1296484 by disrupting the COMPARTMENT_REVIVED heuristics. I think if we back that out with this landing we should be good. I'll do another try push. Otherwise the actual changes look good now.

Sorry to ask you to review Steve, but Jon is on PTO until next week and I'd like to get this in ASAP.
Attachment #8764437 - Attachment is obsolete: true
Attachment #8786065 - Flags: review?(sphink)
Comment on attachment 8786065 [details] [diff] [review]
use_perfect_forwarding_for_rekey-v1.diff

Review of attachment 8786065 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/public/HashTable.h
@@ +1867,5 @@
>          MOZ_ASSERT(table);
>          mozilla::ReentrancyGuard g(*this);
>          MOZ_ASSERT(p.found());
> +        using Entry = typename HashTableEntry<T>::NonConstT;
> +        Entry copy(mozilla::Move(const_cast<Entry&>(*p)));

I do not like calling this 'copy', given that it's trying hard to *not* copy. 'entry' would work fine for me.

@@ +1873,2 @@
>          remove(*p.entry_);
> +        putNewInfallibleInternal(l, mozilla::Move(copy));

Ouch. Ok, this doesn't look right. l and k may be the same thing. k gets moved out via setKey, and then l is used. These are all references, so if eg Lookup and Key are Vector<U>, then the vector storage may have been stolen away and nulled out.
Attachment #8786065 - Flags: review?(sphink)
Duplicate of this bug: 1150245
Updated rekeying patch with Steve's review comments.
Attachment #8786065 - Attachment is obsolete: true
Here's an alternative approach to removing the race.  This just uses the incremental barrier flag which should always be off while we are collecting.  I think this should work (but please check my logic), and it doesn't seem to provoke the reftest failures on try.
Attachment #8789853 - Flags: review?(terrence)
Comment on attachment 8789853 [details] [diff] [review]
bug844769-alternate

Review of attachment 8789853 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, good observation! Let's try it.
Attachment #8789853 - Flags: review?(terrence) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0fda02f7c6a0
Don't access heap state from off main thread r=terrence
Comment on attachment 8789775 [details] [diff] [review]
bug844769-forward-for-rekey v2

Review of attachment 8789775 [details] [diff] [review]:
-----------------------------------------------------------------

I know you didn't ask for review yet, but this beautifully resolves the only issue I had with the patch. Whether you want to land this or not is up to you, but r=me.
Attachment #8789775 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/0fda02f7c6a0
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Since this is marked RESOLVED now and can easily get lost -- what's the state of the other patch? It would still be good to fix the rekeying barriers, if I'm understanding the state of things correctly. Is the patch landable, or should it be forked off to another bug?

needinfo both since Jon is on PTO.
Flags: needinfo?(terrence)
Flags: needinfo?(jcoppeard)
Filed bug 1302559 to follow up.
Flags: needinfo?(terrence)
Flags: needinfo?(jcoppeard)
You need to log in before you can comment on or make changes to this bug.