Closed
Bug 1211706
Opened 10 years ago
Closed 10 years ago
deadlock of nsSocketTransportService::mLock trying to download large file from google drive
Categories
(Core :: Networking, defect)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla44
People
(Reporter: dbaron, Assigned: mcmanus)
References
Details
(Keywords: regression)
Attachments
(2 files)
29.29 KB,
text/plain
|
Details | |
1.33 KB,
patch
|
u408661
:
review+
|
Details | Diff | Splinter Review |
I was trying to download a large file from google drive, and hit a deadlock of nsSocketTransportService::mLock.
Both the main thread and the timer thread are trying to acquire the lock.
The socket thread has the lock:
#0 0x00007f98284de8dd in poll () at ../sysdeps/unix/syscall-template.S:81
#1 0x00007f98281c5f01 in pt_poll_now (__timeout=5000, __nfds=1, __fds=0x7f98153d16d0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46
#2 0x00007f98281c5f01 in pt_poll_now (op=op@entry=0x7f98153d1748) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:583
#3 0x00007f98281c6148 in pt_Continue (op=op@entry=0x7f98153d1748) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:706
#4 0x00007f98281c6257 in pt_Write (fd=0x25cc5a0, buf=0x7f98281ceca9 <magicChar>, amount=1) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:1341
#5 0x00007f98281b495d in PR_SetPollableEvent (event=<optimized out>) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/io/prpolevt.c:200
#6 0x00007f981995d0de in nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal*) (this=0x25cba00, thread=<optimized out>) at /home/dbaron/builds/ssd/mozilla-central/mozilla/netwerk/base/nsSocketTransportService2.cpp:775
khuey says something like "It's pretty clear that if you poll while holding the lock you're going to have a bad time".
Reporter | ||
Comment 1•10 years ago
|
||
Reporter | ||
Comment 2•10 years ago
|
||
(I was trying to download https://drive.google.com/a/mozilla.com/file/d/0B_0LdM1CVycINGJuS0Zuc0d1VjQ/view?usp=sharing by hitting the download button, which should be accessible to Mozilla employees.)
PR_SetPollableEvent tries to write to the underlying fd to signal it ... in dbaron's case that write fails (probably with EAGAIN) and then we go down this weird continuation path http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptio.c?force=1#1331.
Reporter | ||
Comment 4•10 years ago
|
||
In frame 5 on the socket thread, bytes is 0, amount is 1, and fd->secret->nonblocking is 0. errno is 11 (EAGAIN), although that might not be reliable.
Reporter | ||
Comment 5•10 years ago
|
||
And regarding the socket thread stack in comment 0: frame #1 repeatedly calls frame #0, which waits for 5 seconds (__timeout) and then returns, after which it just does the same again.
(Is the polling waiting for something to happen that requires one of the threads that's trying to acquire the lock?)
Reporter | ||
Comment 6•10 years ago
|
||
(In reply to David Baron [:dbaron] ⌚UTC-7 from comment #2)
> (I was trying to download
> https://drive.google.com/a/mozilla.com/file/d/0B_0LdM1CVycINGJuS0Zuc0d1VjQ/
> view?usp=sharing by hitting the download button, which should be accessible
> to Mozilla employees.)
Er, hit the Download button, and then hit the Download anyway button after the not-virus-scanned warning. I'm 2 for 2 on attempts to repro the bug at this point; it's possible it requires a debug build.
Reporter | ||
Comment 7•10 years ago
|
||
Now 3 for 3, and I have an rr recording of the third try.
Reporter | ||
Comment 8•10 years ago
|
||
(In reply to David Baron [:dbaron] ⌚UTC-7 from comment #7)
> Now 3 for 3, and I have an rr recording of the third try.
And the way to use this rr recording is to break on pt_Continue (since I had to Ctrl+C the recording to end it, which leaves nothing good to break on). It appears to be the only call to pt_Continue during the execution of the process.
Reporter | ||
Comment 9•10 years ago
|
||
I still have the deadlocked browser up-and-running, and jduell said you'd probably be interested, so if there's stuff you want me to look at in gdb, let me know in the morning. I can't leave it running forever. (I have the rr recording as well, but rr has been a pain to work with on this trace.)
Part of what may have been unusual was having very high network speeds; I think it was downloading a >1GB file in ~20 seconds. (That said, I ended up doing the download successfully using distro-packaged firefox (so not debug, and release rather than nightly) on a different machine.)
Flags: needinfo?(mcmanus)
Reporter | ||
Comment 10•10 years ago
|
||
Also, for what it's worth, I can still reproduce the deadlock reliably on my laptop, but I can't reproduce the bug on the desktop right next to it (although there the download fails after a few hundred megabytes... I think after progressing more slowly than on the laptop).
Reporter | ||
Comment 11•10 years ago
|
||
Now I can't repro on my laptop anymore, though I still have the one process still running, and one rr recording with a somewhat broken version of rr.
Comment 12•10 years ago
|
||
dbaron: can I get your rr recording, plus the build you made the recording on? I'd like to dig around and see if I can make any progress on figuring this out.
Assignee | ||
Comment 13•10 years ago
|
||
david - thanks for all the data.. fwiw i was able to repro using the same download link. (haven't even stared at the stacks yet. just clicked on the str :))
Flags: needinfo?(mcmanus)
Reporter | ||
Comment 14•10 years ago
|
||
(In reply to Nicholas Hurley [:nwgh][:hurley] from comment #12)
> dbaron: can I get your rr recording, plus the build you made the recording
> on? I'd like to dig around and see if I can make any progress on figuring
> this out.
rr recordings aren't really transferable between machines; they depend on system libraries and CPU characteristics as well.
Assignee | ||
Comment 15•10 years ago
|
||
2 problems here.
1] is that https://bugzilla.mozilla.org/show_bug.cgi?id=1204614 (landed on firefox 43) creates a billion (approximately) socket thread events and fills the pipe that is used to break the network polling loop in the presence of xpcom events. Deadlock ensues. That patch shouldn't be doing this.
2] The deadlock shouldn't be able to happen either -that pipe isn't meant to be a balanced semaphore (its basically a flag, using a pipe so it works with poll), so its probably fine to do a non-blocking write to it or just tolerate the failure. Hopefully that doesn't require surgery to nspr. This has been a problem since day 1, apparently. Its probably at the root of some occasional mysterious hangs of networking unrelated to the trigger of #1. so its good to know
tomorrow I'll patch up #1 in this bug and figure out what to do about #2.
Assignee: nobody → mcmanus
Comment 17•10 years ago
|
||
[Tracking Requested - why for this release]:
Possibly a regression from bug Bug 1211706
tracking-firefox43:
--- → ?
tracking-firefox44:
--- → ?
Assignee | ||
Comment 18•10 years ago
|
||
My interpretation of #1 in comment 15 might be a little off.
Upon further review I am seeing that the threadEvent pipe fills up with tens of thousands of signals. I had interpreted that to mean that 1204614 was dispatching thousands of things within one dispatch loop. (the loop would clear those signals.. or reading the nspr code 1000 of them anyhow)..
but it turns out that's not true.. there are just a handful (single digits) happening per event loop - totally reasonable. What appears to be happening is that the dispatch loop is not getting a poll() indication for the threadEvent pipe.. so it doesn't clear it. and over time that means it fills up to capacity and deadlocks (problem 2 - which I think is only a theoretical problem if the pipe were working). The reason things work well enough to get to the deadlock is that enough network data is flowing in this example that the threadEvent is superfulous to breaking the poll() in the dispatch loop (real network data is doing it anyhow).
Assignee | ||
Comment 20•10 years ago
|
||
so it turns out PR_Poll() does not necessarily call system poll() (or semantic equivalent) over the the whole file descriptor set if one or members of the set has events pending on the PRFileDesc _above_ the IO layer. For example, the SSL socket provider can make a FD show up as readable or writeable regardless of the state of the IO socket at the bottom of the stack.
When that happens, PR_Poll returns those SSL events without doing the poll() on the other sockets in the set - in this case the event thread pipe was being ignored and servicing of it was getting starved.
That can probably be worked around easily enough in nsSocketTransportService2.cpp but it raises a whole other path of investigation - why is that tight loop happening without the state of the ssl socket changing (and thus giving the threadevent a fighting chance).
Assignee | ||
Comment 21•10 years ago
|
||
Assignee | ||
Comment 22•10 years ago
|
||
Attachment #8671530 -
Flags: review?(hurley)
Assignee | ||
Comment 23•10 years ago
|
||
so the fix for this bug is pretty trivial -
* we need to buffer stream data in between the time the transaction is suspended and the time the h2 flow control for the stream kicks in. this is fine.
* when the transaction is resumed we need to feed that buffered data to the transaction via nsIEvents instead of network i/o. that is mostly fine.
the new bug was that when we were not able to feed the transaction any more data (full pipe, or out of buffered data) we continued to schedule the "push data at the transaction" event. Which creates the spin loop. easy to fix.
The spin loop showed two latent nspr bugs that I will file elsewhere and can be fixed independently
* a full pipe in a pollableEvent can deadlock
* pr_poll() will starve certain members of the fdset that is passed into it if other members always have events above the IO layer
Attachment #8671530 -
Flags: review?(hurley) → review+
Assignee | ||
Comment 24•10 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/b8b6944505bcdf215561bf9296a08ff5da4a29df
bug 1211706 - fix network io loop spin when clearing h2 flow control buffer r=hurley
Assignee | ||
Comment 25•10 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #23)
> The spin loop showed two latent nspr bugs that I will file elsewhere and can
> be fixed independently
> * a full pipe in a pollableEvent can deadlock
> * pr_poll() will starve certain members of the fdset that is passed into it
> if other members always have events above the IO layer
fyi bug 1213139 and bug 1213084
Comment 26•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla44
Assignee | ||
Comment 28•10 years ago
|
||
this bug was caused by bug 1204614 - which originally landed in 43, but was backed out to only be on 44. So this fix (comment 26 which is on 44) should be the only checkin required.
Comment 29•10 years ago
|
||
Fabulous, thank Patrick.
Untracking for 43 and setting the tracking flag for 44 since this was a recent regression.
You need to log in
before you can comment on or make changes to this bug.
Description
•