1211706 - deadlock of nsSocketTransportService::mLock trying to download large file from google drive

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Description

•

10 years ago

I was trying to download a large file from google drive, and hit a deadlock of nsSocketTransportService::mLock. Both the main thread and the timer thread are trying to acquire the lock. The socket thread has the lock: #0 0x00007f98284de8dd in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f98281c5f01 in pt_poll_now (__timeout=5000, __nfds=1, __fds=0x7f98153d16d0) at /usr/include/x86_64-linux-gnu/bits/poll2.h:46 #2 0x00007f98281c5f01 in pt_poll_now (op=op@entry=0x7f98153d1748) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:583 #3 0x00007f98281c6148 in pt_Continue (op=op@entry=0x7f98153d1748) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:706 #4 0x00007f98281c6257 in pt_Write (fd=0x25cc5a0, buf=0x7f98281ceca9 <magicChar>, amount=1) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/pthreads/ptio.c:1341 #5 0x00007f98281b495d in PR_SetPollableEvent (event=<optimized out>) at /home/dbaron/builds/ssd/mozilla-central/mozilla/nsprpub/pr/src/io/prpolevt.c:200 #6 0x00007f981995d0de in nsSocketTransportService::OnDispatchedEvent(nsIThreadInternal*) (this=0x25cba00, thread=<optimized out>) at /home/dbaron/builds/ssd/mozilla-central/mozilla/netwerk/base/nsSocketTransportService2.cpp:775 khuey says something like "It's pretty clear that if you poll while holding the lock you're going to have a bad time".

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 1

•

10 years ago

Attached file gdb session — Details

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 2

•

10 years ago

(I was trying to download https://drive.google.com/a/mozilla.com/file/d/0B_0LdM1CVycINGJuS0Zuc0d1VjQ/view?usp=sharing by hitting the download button, which should be accessible to Mozilla employees.)

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 3

•

10 years ago

PR_SetPollableEvent tries to write to the underlying fd to signal it ... in dbaron's case that write fails (probably with EAGAIN) and then we go down this weird continuation path http://mxr.mozilla.org/mozilla-central/source/nsprpub/pr/src/pthreads/ptio.c?force=1#1331.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 4

•

10 years ago

In frame 5 on the socket thread, bytes is 0, amount is 1, and fd->secret->nonblocking is 0. errno is 11 (EAGAIN), although that might not be reliable.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 5

•

10 years ago

And regarding the socket thread stack in comment 0: frame #1 repeatedly calls frame #0, which waits for 5 seconds (__timeout) and then returns, after which it just does the same again. (Is the polling waiting for something to happen that requires one of the threads that's trying to acquire the lock?)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 6

•

10 years ago

(In reply to David Baron [:dbaron] ⌚UTC-7 from comment #2) > (I was trying to download > https://drive.google.com/a/mozilla.com/file/d/0B_0LdM1CVycINGJuS0Zuc0d1VjQ/ > view?usp=sharing by hitting the download button, which should be accessible > to Mozilla employees.) Er, hit the Download button, and then hit the Download anyway button after the not-virus-scanned warning. I'm 2 for 2 on attempts to repro the bug at this point; it's possible it requires a debug build.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 7

•

10 years ago

Now 3 for 3, and I have an rr recording of the third try.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 8

•

10 years ago

(In reply to David Baron [:dbaron] ⌚UTC-7 from comment #7) > Now 3 for 3, and I have an rr recording of the third try. And the way to use this rr recording is to break on pt_Continue (since I had to Ctrl+C the recording to end it, which leaves nothing good to break on). It appears to be the only call to pt_Continue during the execution of the process.

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 9

•

10 years ago

I still have the deadlocked browser up-and-running, and jduell said you'd probably be interested, so if there's stuff you want me to look at in gdb, let me know in the morning. I can't leave it running forever. (I have the rr recording as well, but rr has been a pain to work with on this trace.) Part of what may have been unusual was having very high network speeds; I think it was downloading a >1GB file in ~20 seconds. (That said, I ended up doing the download successfully using distro-packaged firefox (so not debug, and release rather than nightly) on a different machine.)

Flags: needinfo?(mcmanus)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 10

•

10 years ago

Also, for what it's worth, I can still reproduce the deadlock reliably on my laptop, but I can't reproduce the bug on the desktop right next to it (although there the download fails after a few hundred megabytes... I think after progressing more slowly than on the laptop).

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 11

•

10 years ago

Now I can't repro on my laptop anymore, though I still have the one process still running, and one rr recording with a somewhat broken version of rr.

u408661

Comment 12

•

10 years ago

dbaron: can I get your rr recording, plus the build you made the recording on? I'd like to dig around and see if I can make any progress on figuring this out.

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

10 years ago

david - thanks for all the data.. fwiw i was able to repro using the same download link. (haven't even stared at the stacks yet. just clicked on the str :))

Flags: needinfo?(mcmanus)

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Reporter

Comment 14

•

10 years ago

(In reply to Nicholas Hurley [:nwgh][:hurley] from comment #12) > dbaron: can I get your rr recording, plus the build you made the recording > on? I'd like to dig around and see if I can make any progress on figuring > this out. rr recordings aren't really transferable between machines; they depend on system libraries and CPU characteristics as well.

Patrick McManus [:mcmanus]

Assignee

Comment 15

•

10 years ago

2 problems here. 1] is that https://bugzilla.mozilla.org/show_bug.cgi?id=1204614 (landed on firefox 43) creates a billion (approximately) socket thread events and fills the pipe that is used to break the network polling loop in the presence of xpcom events. Deadlock ensues. That patch shouldn't be doing this. 2] The deadlock shouldn't be able to happen either -that pipe isn't meant to be a balanced semaphore (its basically a flag, using a pipe so it works with poll), so its probably fine to do a non-blocking write to it or just tolerate the failure. Hopefully that doesn't require surgery to nspr. This has been a problem since day 1, apparently. Its probably at the root of some occasional mysterious hangs of networking unrelated to the trigger of #1. so its good to know tomorrow I'll patch up #1 in this bug and figure out what to do about #2.

Assignee: nobody → mcmanus

Patrick McManus [:mcmanus]

Assignee

Updated

•

10 years ago

Blocks: 1204614

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 17

•

10 years ago

[Tracking Requested - why for this release]: Possibly a regression from bug Bug 1211706

tracking-firefox43: --- → ?

tracking-firefox44: --- → ?

Patrick McManus [:mcmanus]

Assignee

Comment 18

•

10 years ago

My interpretation of #1 in comment 15 might be a little off. Upon further review I am seeing that the threadEvent pipe fills up with tens of thousands of signals. I had interpreted that to mean that 1204614 was dispatching thousands of things within one dispatch loop. (the loop would clear those signals.. or reading the nspr code 1000 of them anyhow).. but it turns out that's not true.. there are just a handful (single digits) happening per event loop - totally reasonable. What appears to be happening is that the dispatch loop is not getting a poll() indication for the threadEvent pipe.. so it doesn't clear it. and over time that means it fills up to capacity and deadlocks (problem 2 - which I think is only a theoretical problem if the pipe were working). The reason things work well enough to get to the deadlock is that enough network data is flowing in this example that the threadEvent is superfulous to breaking the poll() in the dispatch loop (real network data is doing it anyhow).

Patrick McManus [:mcmanus]

Assignee

Comment 20

•

10 years ago

so it turns out PR_Poll() does not necessarily call system poll() (or semantic equivalent) over the the whole file descriptor set if one or members of the set has events pending on the PRFileDesc _above_ the IO layer. For example, the SSL socket provider can make a FD show up as readable or writeable regardless of the state of the IO socket at the bottom of the stack. When that happens, PR_Poll returns those SSL events without doing the poll() on the other sockets in the set - in this case the event thread pipe was being ignored and servicing of it was getting starved. That can probably be worked around easily enough in nsSocketTransportService2.cpp but it raises a whole other path of investigation - why is that tight loop happening without the state of the ssl socket changing (and thus giving the threadevent a fighting chance).

Patrick McManus [:mcmanus]

Assignee

Comment 21

•

10 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=49655759a6b9

Patrick McManus [:mcmanus]

Assignee

Comment 22

•

10 years ago

Attached patch fix network io loop spin when clearing h2 flow control buffer — Details — Splinter Review

Attachment #8671530 - Flags: review?(hurley)

Patrick McManus [:mcmanus]

Assignee

Comment 23

•

10 years ago

so the fix for this bug is pretty trivial - * we need to buffer stream data in between the time the transaction is suspended and the time the h2 flow control for the stream kicks in. this is fine. * when the transaction is resumed we need to feed that buffered data to the transaction via nsIEvents instead of network i/o. that is mostly fine. the new bug was that when we were not able to feed the transaction any more data (full pipe, or out of buffered data) we continued to schedule the "push data at the transaction" event. Which creates the spin loop. easy to fix. The spin loop showed two latent nspr bugs that I will file elsewhere and can be fixed independently * a full pipe in a pollableEvent can deadlock * pr_poll() will starve certain members of the fdset that is passed into it if other members always have events above the IO layer

u408661

Updated

•

10 years ago

Attachment #8671530 - Flags: review?(hurley) → review+

Patrick McManus [:mcmanus]

Assignee

Comment 24

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/b8b6944505bcdf215561bf9296a08ff5da4a29df bug 1211706 - fix network io loop spin when clearing h2 flow control buffer r=hurley

Patrick McManus [:mcmanus]

Assignee

Comment 25

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #23) > The spin loop showed two latent nspr bugs that I will file elsewhere and can > be fixed independently > * a full pipe in a pollableEvent can deadlock > * pr_poll() will starve certain members of the fdset that is passed into it > if other members always have events above the IO layer fyi bug 1213139 and bug 1213084

Carsten Book [:Tomcat]

Comment 26

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/b8b6944505bc

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox44: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla44

Patrick McManus [:mcmanus]

Assignee

Comment 28

•

10 years ago

this bug was caused by bug 1204614 - which originally landed in 43, but was backed out to only be on 44. So this fix (comment 26 which is on 44) should be the only checkin required.

Liz Henry (:lizzard) (relman/hg->git project)

Comment 29

•

10 years ago

Fabulous, thank Patrick. Untracking for 43 and setting the tracking flag for 44 since this was a recent regression.

tracking-firefox43: ? → ---

tracking-firefox44: ? → +

Keywords: regression

gdb session 10 years ago David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla) 29.29 KB, text/plain		Details
fix network io loop spin when clearing h2 flow control buffer 10 years ago Patrick McManus [:mcmanus] 1.33 KB, patch	u408661 : review+	Details \| Diff \| Splinter Review