Content process hang on HttpBaseChannel::ExplicitSetUploadStream

RESOLVED FIXED in Firefox 54

Status

()

defect
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: dx, Assigned: baku)

Tracking

52 Branch
mozilla54
Points:
---

Firefox Tracking Flags

(firefox54 fixed)

Details

(Whiteboard: [necko-active])

Attachments

(3 attachments, 1 obsolete attachment)

Version: 52.0a2 (20170101004005)

The main thread of the content process hung while attempting to POST to the twitter csp_report URL, apparently originated by the twitter push service worker. Shortly afterwards (switching tabs a few times noticing that the tab contents weren't rendered), the main process also hung. I had some tabs from twitter.com and one from mobile.twitter.com

Attaching a "thread apply all bt full" of the content process where the twitter stuff I mentioned is visible.

I was unable to submit a SIGABRT crash report or a proper backtrace of the main process because another firefox process got started while I was looking at the content process with gdb, which ran the auto-updater and replaced all the binaries. Sorry about that. The version I stated above is also my guess based on the update history (second to last entry)

The backtrace I got from the other process looks corrupt but I'm attaching it anyway as "main-process-corrupt-backtrace.txt" in case it's meaningful.

Backtrace of main thread:

Thread 1 "Web Content" received signal SIGINT, Interrupt.
0x00007f2fe9ff310f in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f2fe9ff310f in pthread_cond_wait@@GLIBC_2.3.2 () at /usr/lib/libpthread.so.0
#1  0x00007f2fe9dd16ad in PR_WaitCondVar (cvar=0x7f2de14fe0c0, timeout=4294967295) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/nsprpub/pr/src/pthreads/ptsynch.c:396
#2  0x00007f2fe67dd151 in mozilla::CondVar::Wait(unsigned int) (aInterval=4294967295, this=0x7f2ddb2b8c80) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/obj-firefox/dist/include/mozilla/CondVar.h:79
#3  0x00007f2fe67dd151 in mozilla::Monitor::Wait(unsigned int) (aInterval=4294967295, this=0x7f2ddb2b8c78) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/obj-firefox/dist/include/mozilla/Monitor.h:40
#4  0x00007f2fe67dd151 in mozilla::dom::(anonymous namespace)::RemoteInputStream::ReallyBlockAndWaitForStream() (this=0x7f2ddb2b8c50) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/ipc/Blob.cpp:1117
#5  0x00007f2fe67e1549 in mozilla::dom::(anonymous namespace)::RemoteInputStream::BlockAndWaitForStream() (this=this@entry=0x7f2ddb2b8c50) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/ipc/Blob.cpp:1099
#6  0x00007f2fe67e17b5 in mozilla::dom::(anonymous namespace)::RemoteInputStream::Available(uint64_t*) (this=0x7f2ddb2b8c50, aAvailable=0x7ffe95ee5858) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/ipc/Blob.cpp:1194
#7  0x00007f2fe59bc7ef in mozilla::net::HttpBaseChannel::ExplicitSetUploadStream(nsIInputStream*, nsACString_internal const&, long, nsACString_internal const&, bool) (this=0x7f2c76469828, aStream=0x7f2ddb2b8c50, aContentType=..., aContentLength=-1, aMethod=..., aStreamHasHeaders=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/netwerk/protocol/http/HttpBaseChannel.cpp:802
#8  0x00007f2fe64ef8c3 in mozilla::dom::FetchDriver::HttpFetch() (this=this@entry=0x7f2e02ff41c0) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/fetch/FetchDriver.cpp:355
#9  0x00007f2fe64efb4e in mozilla::dom::FetchDriver::Fetch(mozilla::dom::FetchDriverObserver*) (this=this@entry=0x7f2e02ff41c0, aObserver=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/fetch/FetchDriver.cpp:96
#10 0x00007f2fe64efcc9 in mozilla::dom::MainThreadFetchRunnable::Run() (this=0x7f2db8174540) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/dom/fetch/Fetch.cpp:174
#11 0x00007f2fe74a9008 in mozilla::ThrottledEventQueue::Inner::ExecuteRunnable() (this=0x7f2e06381b00) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/xpcom/threads/ThrottledEventQueue.cpp:161
#12 0x00007f2fe74a9008 in mozilla::ThrottledEventQueue::Inner::Executor::Run() (this=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/xpcom/threads/ThrottledEventQueue.cpp:74
#13 0x00007f2fe74aa02b in nsThread::ProcessNextEvent(bool, bool*) (this=0x7f2fd7677f20, aMayWait=<optimized out>, aResult=0x7ffe95ee5d30) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/xpcom/threads/nsThread.cpp:1216
#14 0x00007f2fe74bcb79 in NS_ProcessNextEvent(nsIThread*, bool) (aThread=aThread@entry=0x7f2fd7677f20, aMayWait=aMayWait@entry=true) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/xpcom/glue/nsThreadUtils.cpp:361
#15 0x00007f2fe751ac8c in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (this=0x7f2fd764cd80, aDelegate=0x7ffe95ee5f20) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/glue/MessagePump.cpp:124
#16 0x00007f2fe7e7e12e in MessageLoop::RunInternal() (this=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:232
#17 0x00007f2fe7e7e12e in MessageLoop::RunHandler() (this=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:225
#18 0x00007f2fe7e7e12e in MessageLoop::Run() (this=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:205
#19 0x00007f2fe69006d4 in nsBaseAppShell::Run() (this=0x7f2fcb2772e0) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/widget/nsBaseAppShell.cpp:156
#20 0x00007f2fe6df2ddd in XRE_RunAppShell() () at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/toolkit/xre/nsEmbedFunctions.cpp:869
#21 0x00007f2fe7e7e12e in MessageLoop::RunInternal() (this=0x7ffe95ee5f20) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:232
#22 0x00007f2fe7e7e12e in MessageLoop::RunHandler() (this=0x7ffe95ee5f20) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:225
#23 0x00007f2fe7e7e12e in MessageLoop::Run() (this=this@entry=0x7ffe95ee5f20) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/base/message_loop.cc:205
#24 0x00007f2fe6df353b in XRE_InitChildProcess(int, char**, XREChildData const*) (aArgc=<optimized out>, aArgv=<optimized out>, aChildData=<optimized out>) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/toolkit/xre/nsEmbedFunctions.cpp:701
#25 0x00000000004100d1 in content_process_main(int, char**) (argc=9, argv=0x7ffe95ee6218) at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/app/../contentproc/plugin-container.cpp:197
#26 0x00007f2fe3e27291 in __libc_start_main () at /usr/lib/libc.so.6
#27 0x000000000040f901 in _start ()
Component: Untriaged → Networking
Product: Firefox → Core
Ben could this be resolved by 1328761?
Flags: needinfo?(bkelly)
Forgot to mention in the description, bug 711883 looks relevant (it's about doing that RemoteInputStream::Available call in the main thread, talks about it like an optimization)

1328761 is not related IMO, but I'm not really experienced with this stuff. It talks about similar things but it doesn't look like a main thread hang, and I don't see anything in the patches stopping the RemoteInputStream::Available call from happening in the main thread. It also doesn't seem to get anywhere near https://dxr.mozilla.org/mozilla-aurora/rev/fd66a1e49411faca500b26ba9ca7dd5a311ff6e5/dom/fetch/FetchDriver.cpp#365 (the link is not the exact revision)
As dx said, I don't think they are related.

AFAICT this stack is:

1) A worker thread is doing `fetch(url, { method:'POST', body: blob })`
2) This bounces to the main thread to perform the actual network operation
3) We then try to get the size of the blob
4) In this case the blob is a `RemoteInputStream`
5) The `RemoteInputStream` in this case does have its mStream set yet and must wait for it to be set.

Note we don't hit this case because the owning thread is the worker thread:

https://dxr.mozilla.org/mozilla-central/source/dom/file/ipc/Blob.cpp#982

Arguably that check should take place where we are on the owning thread or not.

Also, we could theoretically pass in the length from the blob to SetExplicitUploadStream() from FetchDriver.  That would avoid calling Available() here.

What I don't understand, though, is why mStream never shows up.  I could understand some jank, but it seems it should not deadlock forever.

Andrea, what do you think about the various ideas above and possible other stuff going on here?
Flags: needinfo?(bkelly) → needinfo?(amarchesini)
What is happening here is:

0. we have remote blob, owned by a worker thread.
1. InputStreamChild::Recv__delete__() should be called by a normal runnable dispatched from the IPC thread, but it's not because:
2. the worker is busy doing a sync event loop.
3. This sync event loop exists because we are constructing a URL object (thread 65)
4. but the URL object cannot be constructed because the main-thread is blocked by ReallyBlockAndWaitForStream.

Basically, it's a dead-lock.

This is going away when the URL parser will be thread-safe (servo \o/). However, this race condition can be reproduced with any sync event loop. The real fix will be to get rid of ReallyBlockAndWaitForStream. This seems to be a big task.
Flags: needinfo?(amarchesini)
Andrea, can we reasonably get the length of the blob in FetchDriver so we can avoid needing the Available() here?  I'm not familiar when blob lengths are available, etc.
Flags: needinfo?(amarchesini)
Posted patch lock.patch (obsolete) — Splinter Review
This could be a temporary fix.
Flags: needinfo?(amarchesini)
Attachment #8828443 - Flags: review?(bkelly)
Comment on attachment 8828443 [details] [diff] [review]
lock.patch

Review of attachment 8828443 [details] [diff] [review]:
-----------------------------------------------------------------

I don't really like spinning the event loop here, but its better than potentially deadlocking the main thread.

Olli, are you ok with this?
Attachment #8828443 - Flags: review?(bkelly)
Attachment #8828443 - Flags: review+
Attachment #8828443 - Flags: feedback?(bugs)
Comment on attachment 8828443 [details] [diff] [review]
lock.patch

So are all the callers ok with spinning event loop? If so, then fine.
And I assume this doesn't expose spinning in any way to JS. Like JS code does not start to run at unexpected times.
Attachment #8828443 - Flags: feedback?(bugs)
Assignee: nobody → amarchesini
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/25c65ac95fc5
Spin the event loop instead blocking in RemoteInputStream::ReallyBlockAndWaitForStream, r=bkelly
sorry had to back this out for test faiilures like https://treeherder.mozilla.org/logviewer.html#?job_id=70585308&repo=mozilla-inbound
Flags: needinfo?(amarchesini)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8c38fdf34cef
Backed out changeset 25c65ac95fc5 for test_blob_worker_xhr_read.html | Test timed outs
Whiteboard: [necko-active]
Are we sure this patch is actually correct?  I really don't like to see a nested eq looping when not at the even queue root.  Is the code calling this method always reentrant ready?  I'd rather fix this correctly than doing this quick and dirty hack...
Posted patch blob3.patchSplinter Review
Here a completely different approach. But before I need to give a bit of introduction:

this crash shows the main-thread blocked and waiting for an PBlobStream actor. This actor is supposed to be created on a worker thread. So, definitely we are dealing with a PBlob owned by PBackground (PContent is main-thread only and we don't expose PBlob to any root IPC protocol).
The issue here is: how is possible that we don't have a PBlob actor on the main-thread, but we have it on the worker? What I suspect is this:
https://dxr.mozilla.org/mozilla-central/source/dom/base/StructuredCloneHolder.cpp#503
When we send a Blob with StructuredCloneAlgorithm we ensure that we have a PBackground-PBlob actor on the destination context (thread/process). But:
https://dxr.mozilla.org/mozilla-central/source/dom/base/StructuredCloneHolder.cpp#511 this can fail in case PBackground is not initialized yet on the current thread (main-thread basically).
If this happens, PBackground-PBlob doesn't exist on the main-thread, and for this reason we end up exposing the existing the original BlobImpl to the destination thread. On average this is fine. But probably here it is not.
This patch checks if we can create a PBlob actor for the current PBackground thread starting for the RemoteImpl. If we can, we don't need any lock.
The best would be to have PBackground active on the main-thread always, but this is a separate bug.
Attachment #8828443 - Attachment is obsolete: true
Flags: needinfo?(amarchesini)
Attachment #8838320 - Flags: review?(bkelly)
Attachment #8838320 - Flags: review?(bkelly) → review+
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/4becea6f05df
Create the PBlob actor on current thread if doesn't exist yet, r=bkelly
https://hg.mozilla.org/mozilla-central/rev/4becea6f05df
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
You need to log in before you can comment on or make changes to this bug.