Closed Bug 830100 Opened 11 years ago Closed 11 years ago

Allow destruction of TransportFlows off the STS thread

Categories

(Core :: WebRTC: Networking, defect, P1)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla22
Tracking Status
firefox22 + fixed
firefox23 --- fixed

People

(Reporter: ekr, Assigned: ekr)

References

Details

(Whiteboard: [WebRTC][blocking-webrtc+][qa-])

Attachments

(2 files, 4 obsolete files)

Currently it is unsafe to Release() a TFlow or TLayer off the STS thread because if you are the last holder, you will destroy it and the dtor is not thread-safe. This needs some kind of fixing.
When a fix for this lands, we'll need to back out fixes like bug 827878
Blocks: 827878
Summary: All destruction of TransportFlows off the STS thread → Allow destruction of TransportFlows off the STS thread
Nice to clean up, but wouldn't block going to release
Assignee: nobody → ekr
Priority: -- → P3
Whiteboard: [webrtc][blocking-webrtc-]
Whiteboard: [webrtc][blocking-webrtc-] → [WebRTC][blocking-webrtc+]
Comment on attachment 715440 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 715440 [details] [diff] [review]:
-----------------------------------------------------------------

Tim, Adam,

This patch is a compromise between the right thing (complete thread safety) and the
current thing (no thread safety). It also adds a bunch of asserts to try to detect
API misuse.

Aside from the usual code review, I would be interested in your opinion of whether
this hits the right balance.
Attachment #715440 - Flags: review?(tterribe)
Attachment #715440 - Flags: review?(adam)
Comment on attachment 715440 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 715440 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me. Nits inline.

::: media/mtransport/test/transport_unittests.cpp
@@ +193,5 @@
>  
>      flow_->SignalPacketReceived.connect(this, &TransportTestPeer::PacketReceived);
>    }
>  
> +  void ConnectSocket(TransportTestPeer *peer) {

Now that this is just a thunk to call ConnectSocket_s on the proper thread, I think you want to change the "Dispatch" calls in TransportTest::ConnectSocket and TransportTest::ConnectSocketExpectFail to straight function calls.

@@ +323,5 @@
>    }
>  
> +  TransportLayer::State state() {
> +    TransportLayer::State tstate;
> +    

WS

@@ +327,5 @@
> +    
> +    RUN_ON_THREAD(test_utils->sts_target(),
> +                  WrapRunnableRet(flow_, &TransportFlow::state, &tstate),
> +                  NS_DISPATCH_SYNC);
> +    

WS

::: media/mtransport/transportflow.cpp
@@ +21,5 @@
> +  if (!CheckThreadInt()) {
> +    MOZ_ASSERT(SignalStateChange.is_empty());
> +    MOZ_ASSERT(SignalPacketReceived.is_empty());
> +  }
> +  if (!layers_.empty()) {

Perhaps add a comment that the use of layers_ here is unprotected, and why that's generally going to be okay (either that or be stronger in the header file comment about calling the destructor off the target thread not being safe except in the narrow exceptional case it describes).

@@ +46,5 @@
> +  // Enforce that if any of the layers have a thread binding,
> +  // they all have the same binding.
> +  if (target_) {
> +    const nsCOMPtr<nsIEventTarget>& lthread = layer->GetThread();
> +    

WS

::: media/mtransport/transportflow.h
@@ +22,5 @@
>  // Generally, one reads and writes to the top layer.
> +
> +// This code has a confusing hybrid threading model which
> +// probably needs some eventual refactoring.
> +//

Please add a bug # for the refactoring effort.

@@ +109,5 @@
>    void StateChange(TransportLayer *layer, TransportLayer::State state);
>    void PacketReceived(TransportLayer* layer, const unsigned char *data,
>        size_t len);
> +  static void DestroyLayers(std::deque<TransportLayer *> layers);
> +  

WS

::: media/mtransport/transportlayer.h
@@ +84,5 @@
>  
> +  // Check if we are on the right thread
> +  void CheckThread() {
> +    NS_ABORT_IF_FALSE(CheckThreadInt(), "Wrong thread");
> +  }

Why did this move to public? I can't find where it's used...

@@ +85,5 @@
> +  // Check if we are on the right thread
> +  void CheckThread() {
> +    NS_ABORT_IF_FALSE(CheckThreadInt(), "Wrong thread");
> +  }
> +  

WS

::: media/mtransport/transportlayerloopback.h
@@ -129,5 @@
>    nsresult QueuePacket(const unsigned char *data, size_t len);
>  
>    TransportLayerLoopback* peer_;
>    nsCOMPtr<nsITimer> timer_;
> -  nsCOMPtr<nsIEventTarget> target_;

I presume this is just simple cleanup, unrelated to the bug -- right?
Attachment #715440 - Flags: review?(adam) → review+
Right, so on the broader opinion that you solicited regarding the approach: I think this should be fine until we get around to a more comprehensive refactor, although (as I indicated in my review), I think I'd prefer to see a little more emphatic documentation around the compromise of allowing off-thread access to the layers deque in the flow destructor.
Comment on attachment 715440 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 715440 [details] [diff] [review]:
-----------------------------------------------------------------

This is not too terrible (I say without really understanding all the places a TransportFlow can be derefed), but it all seems a little more complicated than it needs to be. AFAIK the only thread that any flow can be bound to is the STS thread, and baking that assumption in would allow you to get rid of all the target_ tracking code.

::: media/mtransport/transportflow.cpp
@@ +23,5 @@
> +    MOZ_ASSERT(SignalPacketReceived.is_empty());
> +  }
> +  if (!layers_.empty()) {
> +    RUN_ON_THREAD(target_,
> +                  WrapRunnableNM(&TransportFlow::DestroyLayers,layers_),

So, this copies layers_ by value... multiple times... which is actually a fairly expensive operation. In libstdc++ this involves 8 0.5 kB allocations at a minimum, in addition to copying the whole list.

I guess I'm confused why you're using a deque in the first place... you only ever insert items on the front, and never remove them. The only other operation supported is searching for a layer with a specific id. You could simply provide a getter for TransportLayer::downward_ and remove the need for the deque entirely.

::: media/mtransport/transportflow.h
@@ +41,5 @@
> +// restriction by thread-locking the signals, but previous
> +// attempts have caused deadlocks.
> +//
> +// Most of these invariants are enforced by hard asserts
> +// (i.e., those which fire even in production builds).

AFAICT, only the MOZ_CRASH in PushLayer is such a "hard assert". NS_ABORT_IF_FALSE does not abort in debug builds, and neither does MOZ_ASSERT.

::: media/mtransport/transportlayer.h
@@ +102,5 @@
>    // Return the layer id for this layer
>    virtual const std::string id() = 0;
>  
>    // The id of the flow
> +  virtual const std::string& flow_id() {

This doesn't need to be virtual, does it? Presumably a sub-class would just set flow_id instead of overriding.
Attachment #715440 - Flags: review?(tterribe) → review+
Attachment #715440 - Attachment is obsolete: true
Comment on attachment 715440 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 715440 [details] [diff] [review]:
-----------------------------------------------------------------

derf:
Here is a revised patch. I kept the deque and the thread tracking because I am hoping
to use the thread tracking for futureproofing. there is a question below about
whether these fails should be hard or not....

::: media/mtransport/transportflow.cpp
@@ +23,5 @@
> +    MOZ_ASSERT(SignalPacketReceived.is_empty());
> +  }
> +  if (!layers_.empty()) {
> +    RUN_ON_THREAD(target_,
> +                  WrapRunnableNM(&TransportFlow::DestroyLayers,layers_),

I agree that would work. I guess this is a taste issue, since I think it's nice to have it all in one
place. Also, if/when I change these to be RefPtrs, it will make ownership clearer.

But let's do this: I will change it from being a deque being a ScopedDeletePtr. This
removes the copy.

::: media/mtransport/transportflow.h
@@ +41,5 @@
> +// restriction by thread-locking the signals, but previous
> +// attempts have caused deadlocks.
> +//
> +// Most of these invariants are enforced by hard asserts
> +// (i.e., those which fire even in production builds).

You mean production builds, right? 

Anyway, do you think we should fix the comment or make these hard crashes?

::: media/mtransport/transportlayer.h
@@ +102,5 @@
>    // Return the layer id for this layer
>    virtual const std::string id() = 0;
>  
>    // The id of the flow
> +  virtual const std::string& flow_id() {

Fair enough.

::: media/mtransport/transportlayerloopback.h
@@ -129,5 @@
>    nsresult QueuePacket(const unsigned char *data, size_t len);
>  
>    TransportLayerLoopback* peer_;
>    nsCOMPtr<nsITimer> timer_;
> -  nsCOMPtr<nsIEventTarget> target_;

actually, it was shadowing the same member in TLayer.
Attachment #717929 - Flags: review?(tterribe)
Comment on attachment 717929 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 717929 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/mtransport/test/transport_unittests.cpp
@@ +149,5 @@
>  
> +  void DisconnectDestroyFlow() {
> +    loopback_->Disconnect();
> +    disconnect_all();  // Disconnect from the signals;
> +     flow_ = nullptr;

indent

@@ +151,5 @@
> +    loopback_->Disconnect();
> +    disconnect_all();  // Disconnect from the signals;
> +     flow_ = nullptr;
> +  }
> +    

WS

@@ +329,5 @@
> +    
> +    RUN_ON_THREAD(test_utils->sts_target(),
> +                  WrapRunnableRet(flow_, &TransportFlow::state, &tstate),
> +                  NS_DISPATCH_SYNC);
> +    

WS

::: media/mtransport/transportflow.h
@@ +43,5 @@
> +// restriction by thread-locking the signals, but previous
> +// attempts have caused deadlocks.
> +//
> +// Most of these invariants are enforced by hard asserts
> +// (i.e., those which fire even in production builds).

I think we should make critical assertions like this hard asserts in production code at least through Aurora and likely Beta, though some/many/most of them should be ones we can disable in the future if we want (in production, or after we've been in production and clean for a cycle or two).  See bug 843695 for an example; you can key off the same MOZ_WEBRTC_ASSERT_ALWAYS.

Ones not on main performance paths may be more reasonable to make MOZ_CRASH (or to back up a MOZ_ASSERT with failure returns).
We should get this one reviewed and landed, as it will fix real race conditions we have (see bug 835283)
Updating the priority based on what we now know (should have done this in February when I made this a blocker bug).
Priority: P3 → P1
Comment on attachment 717929 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 717929 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Eric Rescorla (:ekr) from comment #9)
> You mean production builds, right? 

Err, yes. That.

> Anyway, do you think we should fix the comment or make these hard crashes?

That depends on how confident you are we won't violate them in production. Given the level of review I've done, I would personally make them hard crashes instead of fixing the comment.

Anyway, this version of the patch leaks, and bug 844493 has bitrotted it. Please post a new version and ask for re-review.

::: media/mtransport/transportflow.cpp
@@ +32,5 @@
> +                WrapRunnableNM(&TransportFlow::DestroyLayers, layers_.forget()),
> +                NS_DISPATCH_NORMAL);
> +}
> +
> +void TransportFlow::DestroyLayers(std::deque<TransportLayer *>* layers) {

This doesn't appear to be taking a ScopedDeletePointer (or any other kind of smart pointer), so this will leak the deque.

@@ +46,5 @@
>    nsresult rv = layer->Init();
>    if (!NS_SUCCEEDED(rv))
>      return rv;
>  
> +  // Enforce that if any of the layers have a thread binding,

This has bitrotted w.r.t. bug 844493. This check now needs to be in PushLayers() as well.
Attachment #717929 - Flags: review?(tterribe) → review-
Attachment #717929 - Attachment is obsolete: true
Comment on attachment 731896 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 731896 [details] [diff] [review]:
-----------------------------------------------------------------

I believe this addresses's derf's comments, but I'm not operating at 100% this morning.
Attachment #731896 - Flags: review?(tterribe)
Comment on attachment 731896 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 731896 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with a few nits and one important question.

::: media/mtransport/test/transport_unittests.cpp
@@ +151,5 @@
> +    loopback_->Disconnect();
> +    disconnect_all();  // Disconnect from the signals;
> +     flow_ = nullptr;
> +  }
> +    

WS.

@@ +326,5 @@
>    }
>  
> +  TransportLayer::State state() {
> +    TransportLayer::State tstate;
> +    

WS.

@@ +330,5 @@
> +    
> +    RUN_ON_THREAD(test_utils->sts_target(),
> +                  WrapRunnableRet(flow_, &TransportFlow::state, &tstate),
> +                  NS_DISPATCH_SYNC);
> +    

WS.

::: media/mtransport/transportflow.cpp
@@ +58,5 @@
>    }
>  
>    nsresult rv = layer->Init();
>    if (!NS_SUCCEEDED(rv)) {
>      // Set ourselves to have failed.

I should have asked this in bug 844493, but unlike PushLayers(), we don't destroy all existing layers on error. Should we? Otherwise it seems like we run the risk of that layer changing state on its own and overriding the error state. Feel free to address this in a new bug.

@@ +132,5 @@
>      }
>  
>      // Now destroy the rest of the flow, because it's no longer
>      // in an acceptable state.
> +    while (!layers_->empty()) {

It doesn't save much code, but it seems like DestroyLayers() could be re-used here if it also cleared the queue.
Attachment #731896 - Flags: review?(tterribe) → review+
Attachment #731896 - Attachment is obsolete: true
Comment on attachment 734302 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 734302 [details] [diff] [review]:
-----------------------------------------------------------------

Tim,

I did a lot of cleanup here and added some more tests.
Attachment #734302 - Flags: review?(tterribe)
Whiteboard: [WebRTC][blocking-webrtc+] → [WebRTC][blocking-webrtc+][webrtc-uplift]
Comment on attachment 734302 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 734302 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, also thanks for adding more tests.

::: media/mtransport/transportflow.cpp
@@ +60,5 @@
>  }
>  
>  nsresult TransportFlow::PushLayer(TransportLayer *layer) {
> +  CheckThread();
> +  ScopedDeletePtr<TransportLayer> layer_tmp(layer);  // Destroy on failure.

Nice catch!
Attachment #734302 - Flags: review?(tterribe) → review+
Comment on attachment 734302 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

Review of attachment 734302 [details] [diff] [review]:
-----------------------------------------------------------------

Adam, I am having trouble committing from this machine.
Attachment #734302 - Flags: checkin?(adam)
Attachment #734302 - Flags: checkin?(adam) → checkin+
Comment on attachment 735840 [details] [diff] [review]
Avoid deadlocks by sending packets asynchronously

https://tbpl.mozilla.org/?tree=Try&rev=8f04baa38ba1

Could use the flow argument instead of flow_ in SendPacket_s() I assume, but really doesn't matter.
Attachment #735840 - Flags: review?(ekr)
Attachment #735840 - Attachment is obsolete: true
Attachment #735840 - Flags: review?(ekr)
Comment on attachment 735908 [details] [diff] [review]
Avoid deadlocks by sending packets asynchronously

Switch to static function per IRC
Attachment #735908 - Flags: review?(ekr)
Comment on attachment 735908 [details] [diff] [review]
Avoid deadlocks by sending packets asynchronously

Review of attachment 735908 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm with nit below.

::: media/mtransport/test/sctp_unittest.cpp
@@ +196,5 @@
>    int sent() const { return sent_; }
>    int received() const { return received_; }
>    bool connected() const { return connected_; }
>  
> +  static TransportResult SendPacket_s(const unsigned char* data, size_t len, TransportFlow *flow) {

I think this would be better as a RefPtr or a const RefPtr& so it's clear what the
semantics are.
Attachment #735908 - Flags: review?(ekr) → review+
https://hg.mozilla.org/mozilla-central/rev/18282c2c682d
https://hg.mozilla.org/mozilla-central/rev/63f2b10a44e2
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: mozilla22 → mozilla23
Whiteboard: [WebRTC][blocking-webrtc+][webrtc-uplift] → [WebRTC][blocking-webrtc+][webrtc-uplift][qa-]
Comment on attachment 734302 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

[Approval Request Comment]
Bug caused by (feature/regressing bug #): N/A

User impact if declined: Random failures and timing-related crashes.  This bug or some equivalent is needed on Aurora for stability.

Testing completed (on m-c, etc.): on m-c, and will be getting a lot of testing and fuzzing.

Risk to taking this patch (and alternatives if risky): Moderately low - it should only make major changes in the flow when a release would have happened off the STS thread - which would have been a major problem.

String or IDL/UUID changes made by this patch: none
Attachment #734302 - Flags: approval-mozilla-aurora?
Comment on attachment 735908 [details] [diff] [review]
Avoid deadlocks by sending packets asynchronously

[Approval Request Comment]

See other patch in this bug; this needs to be included with it (bustage fix to unit tests)
Attachment #735908 - Flags: approval-mozilla-aurora?
Comment on attachment 734302 [details] [diff] [review]
Refactor transport flow to allow destruction on any thread

We'll also track for release, given this is blocking-webrtc+
Attachment #734302 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #735908 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/73b137330c23
https://hg.mozilla.org/releases/mozilla-aurora/rev/b6d86fda0787
Whiteboard: [WebRTC][blocking-webrtc+][webrtc-uplift][qa-] → [WebRTC][blocking-webrtc+][qa-]
Target Milestone: mozilla23 → mozilla22
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: