Closed Bug 1049801 Opened 7 years ago Closed 7 years ago

Intermittent test_bug949946.html,test_bug1002702.html | application crashed [@ mozilla::ipc::MessageChannel::DispatchOnChannelConnected(int)]

Categories

(Core :: DOM: Workers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- unaffected
firefox33 --- fixed
firefox34 --- fixed
firefox-esr24 --- unaffected
firefox-esr31 --- unaffected
b2g-v1.4 --- unaffected
b2g-v2.0 --- unaffected
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: bkelly)

References

Details

(Keywords: crash, intermittent-failure)

Attachments

(2 files, 9 obsolete files)

1.21 KB, patch
khuey
: review+
Details | Diff | Splinter Review
6.20 KB, patch
bkelly
: review+
Details | Diff | Splinter Review
https://tbpl.mozilla.org/php/getParsedLog.php?id=45346430&tree=Mozilla-Central

b2g_emulator_vm mozilla-central opt test mochitest-8 on 2014-08-06 10:17:55 PDT for push f41a267983c1
slave: tst-linux64-spot-1111

10:25:11     INFO -  203 INFO TEST-START | /tests/dom/workers/test/test_bug1002702.html
10:25:12     INFO -  204 INFO TEST-OK | /tests/dom/workers/test/test_bug1002702.html | took 1695ms
10:25:14     INFO -  205 INFO [Parent 692] WARNING: pipe error (68): Connection reset by peer: file ../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 452
10:25:14     INFO -  206 INFO ############ ErrorPage.js
10:25:14     INFO -  207 INFO System JS : ERROR chrome://specialpowers/content/SpecialPowersObserver.js:96 - NS_ERROR_NOT_INITIALIZED: Component returned failure code: 0xc1f30001 (NS_ERROR_NOT_INITIALIZED) [nsIMessageSender.sendAsyncMessage]
10:25:14     INFO -  208 INFO ############################### browserElementPanning.js loaded
10:25:14     INFO -  209 INFO ######################## BrowserElementChildPreload.js loaded
10:30:48     INFO -  mozcrash INFO | Downloading symbols from: http://pvtbuilds.pvt.build.mozilla.org/pub/mozilla.org/b2g/tinderbox-builds/mozilla-central-emulator/20140806090025/b2g-34.0a1.en-US.android-arm.crashreporter-symbols.zip
10:31:09     INFO -  DeviceRunner TEST-UNEXPECTED-FAIL | /tests/dom/workers/test/test_bug1002702.html | application timed out after 330.0 seconds with no output
10:31:09  WARNING -  PROCESS-CRASH | /tests/dom/workers/test/test_bug1002702.html | application crashed [@ mozilla::ipc::MessageChannel::DispatchOnChannelConnected(int)]
10:31:09     INFO -  Crash dump filename: /tmp/tmp70P6mW/5e89afa6-cacf-6237-245e6783-65ddee11.dmp
10:31:09     INFO -  Operating system: Android
10:31:09     INFO -                    0.0.0 Linux 2.6.29-g41a03df #22 Thu Jun 26 10:59:09 CST 2014 armv7l Android/full/generic:4.0.4.0.4.0.4/OPENMASTER/eng.cltbld.20140806.122837:eng/test-keys
10:31:09     INFO -  CPU: arm
10:31:09     INFO -       0 CPUs
10:31:09     INFO -  Crash reason:  SIGSEGV
10:31:09     INFO -  Crash address: 0x5a5a5a5e
10:31:09     INFO -  Thread 24 (crashed)
10:31:09     INFO -   0  libxul.so!mozilla::ipc::MessageChannel::DispatchOnChannelConnected(int) [WeakPtr.h : 93 + 0x2]
10:31:09     INFO -       r4 = 0x4372f1a0    r5 = 0x43187be0    r6 = 0x460bfb60    r7 = 0x4372f1ac
10:31:09     INFO -       r8 = 0x460bfb58    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb28    lr = 0x40bfe5d1    pc = 0x40bfe3e4
10:31:09     INFO -      Found by: given as instruction pointer in context
10:31:09     INFO -   1  libxul.so!RunnableMethod<IPC::Channel, bool (IPC::Channel::*)(IPC::Message*), Tuple1<IPC::Message*> >::Run() [tuple.h:f41a267983c1 : 393 + 0x7]
10:31:09     INFO -       r4 = 0x4372f1a0    r5 = 0x43187be0    r6 = 0x460bfb60    r7 = 0x4372f1ac
10:31:09     INFO -       r8 = 0x460bfb58    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb30    pc = 0x40bfe5d1
10:31:09     INFO -      Found by: call frame info
10:31:09     INFO -   2  libxul.so!MessageLoop::RunTask(Task*) [message_loop.cc:f41a267983c1 : 357 + 0x5]
10:31:09     INFO -       r4 = 0x4372f1a0    r5 = 0x43187be0    r6 = 0x460bfb60    r7 = 0x4372f1ac
10:31:09     INFO -       r8 = 0x460bfb58    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb38    pc = 0x40bf3941
10:31:09     INFO -      Found by: call frame info
10:31:09     INFO -   3  libxul.so!MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [message_loop.cc:f41a267983c1 : 365 + 0x5]
10:31:09     INFO -       r4 = 0x00000001    r5 = 0x460bfb50    r6 = 0x460bfb60    r7 = 0x4372f1ac
10:31:09     INFO -       r8 = 0x460bfb58    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb48    pc = 0x40bf759f
10:31:09     INFO -      Found by: call frame info
10:31:09     INFO -   4  libxul.so!MessageLoop::DoWork() [message_loop.cc:f41a267983c1 : 443 + 0x7]
10:31:09     INFO -       r4 = 0x4372f1a0    r5 = 0x460bfb50    r6 = 0x460bfb60    r7 = 0x4372f1ac
10:31:09     INFO -       r8 = 0x460bfb58    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb50    pc = 0x40bf815d
10:31:09     INFO -      Found by: call frame info
10:31:09     INFO -   5  libxul.so!mozilla::ipc::DoWorkRunnable::Run() [MessagePump.cpp:f41a267983c1 : 233 + 0x7]
10:31:09     INFO -       r4 = 0x4372f1a0    r5 = 0x00000001    r6 = 0x43151740    r7 = 0xffffffff
10:31:09     INFO -       r8 = 0x460bfbdf    r9 = 0x00000000   r10 = 0x00000000    fp = 0x00000001
10:31:09     INFO -       sp = 0x460bfb80    pc = 0x40c00623
10:31:09     INFO -      Found by: call frame info
This appears to be MessageChannel::DispatchOnChannelConnected() firing asynchronously after the MessageChannel has been destructed.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Here's a stab at allowing the DispatchOnChannelConnected runnable task to be canceled when the MessageChannel is destroyed.

https://tbpl.mozilla.org/?tree=Try&rev=d8672b248575
*sad trombone*

Let's try again, this time with better nsRefPtr usage.

https://tbpl.mozilla.org/?tree=Try&rev=7eef6d41e6eb
Attachment #8469701 - Attachment is obsolete: true
This is a slightly better version of the patch that should avoid DispatchOnChannelConnected() racing with Clear() at the cost of a small amount of memory.

https://tbpl.mozilla.org/?tree=Try&rev=b45449e71216

I think this patch should be pretty solid at this point, so flagging for review.
Attachment #8469782 - Attachment is obsolete: true
Attachment #8470083 - Flags: review?(bent.mozilla)
It seems this patch could also fix bug 1040288 which might be the same problem, but with a different crash point.
Comment on attachment 8470083 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v2)

Review of attachment 8470083 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +1424,5 @@
>  MessageChannel::OnChannelConnected(int32_t peer_id)
>  {
> +    // OnChannelConnected() should only ever be called once.
> +    MOZ_ASSERT(!mOnChannelConnectedTask);
> +    mOnChannelConnectedTask = new RefCountedTask(NewRunnableMethod(

I don't think this is safe, you're setting something on this thread (should be the IO thread right? We should have an assertion here...) and then reading/modifying it on the other thread in Clear().
Attachment #8470083 - Flags: review?(bent.mozilla)
Here's a new version that uses locking to safely store and then cancel the mOnChannelConnectedTask.  Since we only get connected once per channel, I think the overhead should be fine.

Do follow-up on our IRC conversation, I believe we are guaranteed that MessageChannel exists when MessageChannel::OnChannelConnected() is called.  This method is called from ProcessLink::OnChannelConnected(), but MessageChannel deletes the link prior to destructing itself.  So the channel must exist at this point.

I also added a check in ProcessChannel::OnChannelConnected() to avoid forcing the channel back to the connected state if its already errored out of the opening state.  This just seemed wrong to me.

https://tbpl.mozilla.org/?tree=Try&rev=120f04be4d5d
Attachment #8470083 - Attachment is obsolete: true
Attachment #8470278 - Flags: review?(bent.mozilla)
Comment on attachment 8470278 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v3)

Well, this patch is no good as the monitor is not available in Clear() if the actor was never opened.
Attachment #8470278 - Flags: review?(bent.mozilla)
Comment on attachment 8470347 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v4)

Try looks quite good for this patch.
Attachment #8470347 - Flags: review?(bent.mozilla)
Hmm, the stack in comment 13 looks like a different problem. :-\
https://tbpl.mozilla.org/php/getParsedLog.php?id=45880696&tree=Mozilla-Inbound

Really hoping we can get a review here soon. This seems to be hitting with pretty high frequency these days.
Flags: needinfo?(bent.mozilla)
Summary: Intermittent test_bug1002702.html | application crashed [@ mozilla::ipc::MessageChannel::DispatchOnChannelConnected(int)] → Intermittent test_bug949946.html,test_bug1002702.html | application crashed [@ mozilla::ipc::MessageChannel::DispatchOnChannelConnected(int)]
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #17)
> Really hoping we can get a review here soon. This seems to be hitting with
> pretty high frequency these days.

The needinfo flag does not increase my review bandwidth.
Flags: needinfo?(bent.mozilla)
Rebased to latest m-c.  Also, here's a new try so we can see if it helps with the latest orange we are seeing:

  https://tbpl.mozilla.org/?tree=Try&rev=29a52763fd81

(Not sure why those extra commits showed up in the log on the try...)
Attachment #8470347 - Attachment is obsolete: true
Attachment #8470347 - Flags: review?(bent.mozilla)
Attachment #8473043 - Flags: review?(bent.mozilla)
Botched the rebase in the last version.

https://tbpl.mozilla.org/?tree=Try&rev=4af2a73c3369
Attachment #8473043 - Attachment is obsolete: true
Attachment #8473043 - Flags: review?(bent.mozilla)
Attachment #8473117 - Flags: review?(bent.mozilla)
Hmm, it seems this patch introduces a MessageChannel leak somehow.  I'll need more time to investigate.
Attachment #8473117 - Flags: review?(bent.mozilla)
My current theory is that GeckoChildProcessHost does not behave well if its OnChannelConnected() method is never called.  It sets its mProcessState member there which then influences various other paths.

Going to add some debug to a try build to investigate that.
Actually, comparing my orange try run to a previous green result on tbpl, I can see that there are no additional leaks.  The pre-existing code leaks 4964 bytes, while my new patch results in a leak of 5084 bytes.  The increase is solely due to the increased size of MessageChannel and does not reflect any new object leaks.

Kyle, is there some way I can increase the allowed number of leaked bytes here?
Flags: needinfo?(khuey)
Comment on attachment 8473117 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v6)

Reflagging for review as I don't see a real leak here.
Attachment #8473117 - Flags: review?(bent.mozilla)
Nathan pointed me towards the pref on IRC.  This one line patch increases the leak threshold to account for the already-leaked objects increasing slightly in size.
Attachment #8473842 - Flags: review?(khuey)
Flags: needinfo?(khuey)
The try build looks very good.  149 triggers without a failure on b2g emulator debug M11.
Comment on attachment 8473117 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v6)

Review of attachment 8473117 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +252,5 @@
> +    mOnChannelConnectedTask = new RefCountedTask(NewRunnableMethod(
> +        this,
> +        &MessageChannel::DispatchOnChannelConnected));
> +
> +

Nit: Only need one newline here.

@@ +1499,5 @@
>  void
>  MessageChannel::OnChannelConnected(int32_t peer_id)
>  {
> +    {
> +      MonitorAutoLock lock(*mMonitor);

I don't believe it's necessary to use a lock here or in DispatchOnChannelConnected since a) mPeerPid is only set once before the mOnChannelConnectedTask is posted and b) mPeerPid is only checked once when mOnChannelConnectedTask actually runs.

::: ipc/glue/MessageLink.cpp
@@ +348,5 @@
>          MonitorAutoLock lock(*mChan->mMonitor);
> +        // If the channel has errored out for some reason, do not force
> +        // it back into the connected state.
> +        if (NS_WARN_IF(mChan->mChannelState != ChannelOpening)) {
> +          return;

I'm not sure returning here is correct... Is the previous listener guaranteed to already have been notified? (mExistingListener)
Yes, I think we need to call mExistingListener->OnChannelConnected() there.  Here's an updated patch that does this.

  https://tbpl.mozilla.org/?tree=Try&rev=7e4595209f73

This may require another tweak to the allowed leak on b2g as it has changed slightly since P1 was last tested.  This is also not tested locally as I need to do a clobber build.
Attachment #8473117 - Attachment is obsolete: true
Attachment #8473117 - Flags: review?(bent.mozilla)
Attachment #8476068 - Flags: review?(bent.mozilla)
Comment on attachment 8476068 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v7)

Review of attachment 8476068 [details] [diff] [review]:
-----------------------------------------------------------------

::: ipc/glue/MessageChannel.cpp
@@ +1512,5 @@
> +    if (!mListener) {
> +      return;
> +    }
> +
> +    MOZ_ASSERT(mPeerPidSet);

Nit: Might as well always assert this, put it above the early-return.

::: ipc/glue/MessageLink.cpp
@@ +349,3 @@
>      {
>          MonitorAutoLock lock(*mChan->mMonitor);
> +        // Only update channel state if its still thinks its opening.  Don not

Nit: typo "Don"

@@ +349,5 @@
>      {
>          MonitorAutoLock lock(*mChan->mMonitor);
> +        // Only update channel state if its still thinks its opening.  Don not
> +        // force it into connected if it has errored out.
> +        if (mChan->mChannelState == ChannelOpening) {

Can you assert that mChannelState is ChannelError here if it's not ChannelOpening?
Attachment #8476068 - Flags: review?(bent.mozilla) → review+
> @@ +349,5 @@
> >      {
> >          MonitorAutoLock lock(*mChan->mMonitor);
> > +        // Only update channel state if its still thinks its opening.  Don not
> > +        // force it into connected if it has errored out.
> > +        if (mChan->mChannelState == ChannelOpening) {
> 
> Can you assert that mChannelState is ChannelError here if it's not
> ChannelOpening?

After looking at this more closely to see if ChannelError is safe to assert, I'm fairly convinced this can't happen.  We either SynchronouslyClose() before setting ChannelError or the error is coming up from the link.

So this patch just asserts we are in ChannelOpening state and removes the conditional logic I previously added.

I do want to see another try after this change, though, as its hard to tell all the corner cases.

https://tbpl.mozilla.org/?tree=Try&rev=5a32a9f04f21
Attachment #8476068 - Attachment is obsolete: true
Attachment #8476371 - Flags: review+
Blocks: 1013571
That try showed we clearly cannot assert ChannelOpening there.

I'm trying just dropping the changes to ProcessLink::OnChannelConnected() completely since they should not be strictly necessary for solving this bug:

  https://tbpl.mozilla.org/?tree=Try&rev=0097196f942f
Ok, that last try hit a duplicate call to OnChannelConnected() from the link.  Either we need to just add back the conditional or possibly calling the mExistingListener is wrong.

This adds back the conditional, but also always calls mExistingListener:

  https://tbpl.mozilla.org/?tree=Try&rev=7385a4beaa15
Attachment #8476068 - Attachment is obsolete: false
Attachment #8476371 - Attachment is obsolete: true
Just to clarify, the try in comment 47 is patch v7 originally r+'d in comment 43.
Update the patch to fix the review nits, but not attempt to assert ChannelOpening or ChannelError in ProcessLink::OnChannelConnected.  Instead, I expanded the comment to indicate we're not forcing the connected state if the channel has errored, has started closing, or otherwise changed state.

The last try is still running and so far looks good.  Getting a jump on this one:

  https://tbpl.mozilla.org/?tree=Try&rev=c1039023f525
Attachment #8476068 - Attachment is obsolete: true
Attachment #8476722 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/327482e1a672
https://hg.mozilla.org/mozilla-central/rev/909a529527a7
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
The failure was B2G-only, but the code is cross-platform? Should we get this on Aurora33 as well?
Comment on attachment 8476722 [details] [diff] [review]
Cancel the DispatchOnChannelConnected runnable when destructing the MessageChannel. (v9)

Approval Request Comment
[Feature/regressing bug #]: Bug 1013571
[User impact if declined]: Race condition in worker startup/teardown leading to use-after-free.  Only effects e10s mode, but I think it would be prudent to uplift to avoid potential crashes.
[Describe test coverage new/current, TBPL]:  This was caught by the pre-existing test dom/workers/test/test_bug1002702.html
[Risks and why]: I believe this is low risk.
[String/UUID change made/needed]: None
Attachment #8476722 - Flags: approval-mozilla-aurora?
Comment on attachment 8473842 [details] [diff] [review]
P1 Increase allowed leak threshold in tests to account for increased MessageChannel size. (v0)

Needed for P2.  See approval request for P2.
Attachment #8473842 - Flags: approval-mozilla-aurora?
Attachment #8476722 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8473842 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/597626afe5e1

No B2G leak testing on 33, so no need for that patch :)
Duplicate of this bug: 1040288
You need to log in before you can comment on or make changes to this bug.