Closed Bug 1152048 Opened 10 years ago Closed 9 years ago

nsSocketTransportService::Dispatch may deadlock during shutdown

Categories

(Core :: Networking, defect)

All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox40 --- wontfix
firefox41 --- fixed
firefox42 --- fixed

People

(Reporter: mayhemer, Assigned: u408661)

References

Details

Attachments

(3 files, 1 obsolete file)

https://crash-stats.mozilla.com/report/index/27ae8618-8676-436d-8056-c6db72150406#allthreads, thread 5 https://crash-stats.mozilla.com/report/index/bb6d4bd5-75b5-475d-a926-4207d2150404#allthreads, thread 4 PR_Lock nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) [1] nsSocketTransportService::Run() [2] nsThread::ProcessNextEvent(bool, bool*) NS_ProcessNextEvent(nsIThread*, bool) mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) MessageLoop::RunHandler() MessageLoop::Run() nsThread::ThreadFunc(void*) _PR_NativeRunThread pr_root _callthreadstartex [1] http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l128 [2] http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l815
Blocks: 1158189
No longer blocks: 1124880
Our theory here is that we might be locking a lock twice (and getting it into some sort of inconsistent state as a result). We signed up Valentin to try to look into that, by adding some code that will assert if this particular lock gets locked reentrantly.
Assignee: nobody → valentin.gosu
Oh, actually this one is for Nick :)
Assignee: valentin.gosu → hurley
Honza, I believe you said you could test this to see where we're acquiring the lock twice on the same thread (assuming that's the issue). I've got a try run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf46564546e9 for windows builds with the results at https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/hurley@todesschaf.org-cf46564546e9 Can you give these a try and see if you either still reproduce the deadlock, or end up asserting when we acquire the lock in the wrong place?
Flags: needinfo?(honzab.moz)
I never said I was able to reproduce this exact deadlock, sorry. We have to let this go out and collect crash data.
Flags: needinfo?(honzab.moz)
Attached patch debugging patchSplinter Review
This patch should crash when we want it to, even on release builds. Let's land this and see what happens :) (We'll back out as soon as we get some data on the deadlock, and certainly before this goes out to a significant number of users.)
Attachment #8612467 - Flags: review?(honzab.moz)
Comment on attachment 8612467 [details] [diff] [review] debugging patch Review of attachment 8612467 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/base/nsSocketTransportService2.cpp @@ +66,5 @@ > + :mLock(&mutex) > +{ > + PRThread *currentThread = PR_GetCurrentThread(); > + MOZ_RELEASE_ASSERT(sDebugOwningThread != currentThread); > + sDebugOwningThread = currentThread; do this assignment after you acquired the lock!
Attachment #8612467 - Flags: review?(honzab.moz) → review+
> do this assignment after you acquired the lock! *facepalm* thanks! Fixed in the landed version :) remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/91a3fb327007
Keywords: leave-open
Backed out and re-landed DONTBUILD to fix the bug number in the commit message remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/71a80e43615c remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/85e4bd6b0baa
Should this be uplifted? What versions did/does it affect? I'm following up on this since I'm tracking the bug that it's blocking (for shutdown hangs)
Flags: needinfo?(hurley)
Flags: needinfo?(honzab.moz)
No, this patch should not be uplifted, as it is not a fix - it's merely a patch designed to cause crashes so we can see where, exactly, things are going wrong (which, so far, hasn't succeeded - I've seen no crash reports around my assertion). This patch will be backed out before hitting release, possibly (probably?) before beta.
Flags: needinfo?(hurley)
Flags: needinfo?(honzab.moz)
Looks like we (finally) got a single hit on the assertion: https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286-089bb2150704
Note: I haven't dug into it yet, this is just so I don't have to dig up that crash-stats URL again :)
(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #13) > Looks like we (finally) got a single hit on the assertion: > https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286- > 089bb2150704 Looks like we crash inside nsSocketTransportService::Dispatch (up the stack) what winds up in CallWindowProcCrashProtected that calls nsWindow::ProcessMessage that processes another message that enters the lock again. Really interesting :D
Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which just do try/catch. Event more interesting :D
The Bdprovider.dll called by send (an LSP) seems to belong to bitdefender.
(In reply to Honza Bambas (:mayhemer) from comment #16) > Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which > just do try/catch. Event more interesting :D Right, this looks like a suspend/resume state (note PostSleepWakeNotification in the crash stack). It seems like the machine goes to sleep while the lock is held (looks like on the main thread) by nsSocketTransportService::OnDispatchedEvent. Later, when the machine wakes up, the network link change bits take hold and try to do their cleanup. nsHttpConnectionMgr::PruneDeadConnections is called, which posts an event, causing nsSocketTransportService::Dispatch to be called, which tries to acquire the lock re-entrantly, which would cause our issues. I'm not convinced that bitdefender has anything to do with the crash, as this is a race that could happen at any time, with or without bitdefender - though I could certainly see bitdefender making the race more likely to be lost. Right now, I'm mostly at a loss for how to fix this, except to say that these accesses all occur on the main thread, which means any reads of nsSocketTransportService::mThread and nsSocketTransportService::mThreadEvent don't strictly need to be protected by the lock - we could just have these main-thread reads be unlocked (writes of course still would need to be locked); but down that path lies the madness of poor locking, so it's less of a real suggestion and more of a "I'm gonna throw my hands up for now and mull it over for a while" kind of thing.
That's quite a stack, nick :) (comment 13) the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is delievered synchronously after a windows message pre-empts execution. (the latter makes sense for sleep/wake, the synchronous part of the necko delivery surprised me). The IO Service code hooks that observer and changes it into the link_changed events as apropos. The lock assertion is helpful, without it we would just have corruption of the socket thread event queue (which was partially changed before going to sleep and is now being changed again as part of the observer handling). It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully re-entrant against this in ways that other observers are not. Looking at that stack I am concerned about the integrity of the timer code too. ni a couple folks that seem to have been involved in the wake-observer code. I think we can fix the necko issue by having the ioService handler just post a fully async event back to the main thread to initiate the work, rather than doing it synchronously.
Flags: needinfo?(netzen)
Flags: needinfo?(jmathies)
in general, this unexpectedly re-entrant condition could cause all manner of hard to identify corruption (maybe the write segments bug?) beyond the case caught here so we should prioritize a fix.
(In reply to Patrick McManus [:mcmanus] from comment #19) > That's quite a stack, nick :) (comment 13) > > the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is > delievered synchronously after a windows message pre-empts execution. (the > latter makes sense for sleep/wake, the synchronous part of the necko > delivery surprised me). The IO Service code hooks that observer and changes > it into the link_changed events as apropos. > > The lock assertion is helpful, without it we would just have corruption of > the socket thread event queue (which was partially changed before going to > sleep and is now being changed again as part of the observer handling). > > It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully > re-entrant against this in ways that other observers are not. > > Looking at that stack I am concerned about the integrity of the timer code > too. > > ni a couple folks that seem to have been involved in the wake-observer code. > > I think we can fix the necko issue by having the ioService handler just post > a fully async event back to the main thread to initiate the work, rather > than doing it synchronously. I'm confused by all this, what are you suggesting we do differently here?
(In reply to Jim Mathies [:jimm] from comment #21) > I'm confused by all this, what are you suggesting we do differently here? can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non re-entrant stack?
Sorry I'm not the right person to ask for this.
Flags: needinfo?(netzen)
(In reply to Patrick McManus [:mcmanus] from comment #22) > (In reply to Jim Mathies [:jimm] from comment #21) > > > I'm confused by all this, what are you suggesting we do differently here? > > can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non > re-entrant stack? Yeah I think it'd be safe to fire these off via async runnables independent from the windows notification. From what I remember we had some problems with timers not getting reset right after a long sleep. I don't see that regressing due to async notifications.
Flags: needinfo?(jmathies)
nick, given that bagder is on PTO for a while, would you be interested in making the IOService change (assuming you think it helps too)? We could get that backported easily enough I would think. We could separately make the async runnable change for generating NS_WIDGET_WAKE_OBSERVER_TOPIC too (comment 24) but I have less confidence in what that might impact, so it would be better to do it on just nightly imo.
(In reply to Patrick McManus [:mcmanus] from comment #25) > nick, given that bagder is on PTO for a while, would you be interested in > making the IOService change (assuming you think it helps too)? We could get > that backported easily enough I would think. I'm already working on it, believe it or not :)
Attached patch revert lock change (obsolete) — Splinter Review
This reverts the debugging changes I made (we will definitely want to uplift this one). A patch with the fix comes next.
Attachment #8636087 - Flags: review?(mcmanus)
And here's the one that should fix the actual issue.
Attachment #8636088 - Flags: review?(mcmanus)
Comment on attachment 8636087 [details] [diff] [review] revert lock change Review of attachment 8636087 [details] [diff] [review]: ----------------------------------------------------------------- if we revert we can't confirm the bug is fixed. how about MOZ_DIAGNOSTIC_ASSERT ?
Comment on attachment 8636088 [details] [diff] [review] Async wakeup notifications Review of attachment 8636088 [details] [diff] [review]: ----------------------------------------------------------------- sgtm; daniel is back this week right?
Attachment #8636088 - Flags: review?(mcmanus)
Attachment #8636088 - Flags: review?(daniel)
Attachment #8636088 - Flags: feedback+
and is there a bug filed for making a similar change to NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24?
Attachment #8636087 - Flags: review?(mcmanus)
BTW, have you checked how many shutdown deadlock have happened when this patch was in?
Flags: needinfo?(hurley)
(In reply to Honza Bambas (:mayhemer) from comment #33) > BTW, have you checked how many shutdown deadlock have happened when this > patch was in? There are two - both of which seem to have the same stack involved. So it looks like this is being hit maybe once a month (give or take).
Flags: needinfo?(hurley)
(In reply to Patrick McManus [:mcmanus] from comment #30) > Comment on attachment 8636087 [details] [diff] [review] > revert lock change > > Review of attachment 8636087 [details] [diff] [review]: > ----------------------------------------------------------------- > > if we revert we can't confirm the bug is fixed. how about > MOZ_DIAGNOSTIC_ASSERT ? That's doable. Though with the frequency (or lack thereof) of this deadlock/crash, confirmation is going to be hard to come by, anyway :)
(In reply to Patrick McManus [:mcmanus] from comment #32) > and is there a bug filed for making a similar change to > NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24? Just filed https://bugzilla.mozilla.org/show_bug.cgi?id=1186074
New lock-related patch making it a diagnostic assert instead of a release assert, so we can (hopefully) verify this as fixed at some point in the future.
Attachment #8636087 - Attachment is obsolete: true
Attachment #8636658 - Flags: review?(daniel)
(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #34) > (In reply to Honza Bambas (:mayhemer) from comment #33) > > BTW, have you checked how many shutdown deadlock have happened when this > > patch was in? > > There are two - both of which seem to have the same stack involved. So it > looks like this is being hit maybe once a month (give or take). this deadlock is infrequent, but I have high hopes that the patch fixes a lot of other weird behavior. the root cause here is re-entrancy and this deadlock is just one of a billion ways that could go poorly. (I'm wondering about the famous writesegments bug)
See Also: → 1186074
Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Review of attachment 8636658 [details] [diff] [review]: ----------------------------------------------------------------- I approve!
Attachment #8636658 - Flags: review?(daniel) → review+
Comment on attachment 8636088 [details] [diff] [review] Async wakeup notifications Review of attachment 8636088 [details] [diff] [review]: ----------------------------------------------------------------- LGTM
Attachment #8636088 - Flags: review?(daniel) → review+
url: https://hg.mozilla.org/integration/mozilla-inbound/rev/cd1d76c2f219987ad981e1f3e9ac76845a9962a2 changeset: cd1d76c2f219987ad981e1f3e9ac76845a9962a2 user: Nicholas Hurley <hurley@todesschaf.org> date: Fri Jul 31 13:33:48 2015 -0700 description: Bug 1152048 - Make the RELEASE_ASSERT on the sts lock a DIAGNOSTIC_ASSERT. r=bagder url: https://hg.mozilla.org/integration/mozilla-inbound/rev/a0582ac1d92e3484975e8dd6144cf42f8ad27d4b changeset: a0582ac1d92e3484975e8dd6144cf42f8ad27d4b user: Nicholas Hurley <hurley@todesschaf.org> date: Fri Jul 31 13:39:48 2015 -0700 description: Bug 1152048 - Send wakeup notification asynchronously to avoid reentrancy issues. r=bagder
Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Approval Request Comment [Feature/regressing bug #]: this bug [User impact if declined]: could cause crashes we don't want (instead of deadlocks) [Describe test coverage new/current, TreeHerder]: on m-c [Risks and why]: none (just changes assert type to not fire on release builds) [String/UUID change made/needed]: none
Attachment #8636658 - Flags: approval-mozilla-aurora?
Keywords: leave-open
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Depends on: 1190985
Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Seems safe for uplift to Aurora.
Attachment #8636658 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Needs rebasing for Aurora uplift.
Flags: needinfo?(hurley)
Oh nevermind, just the assert patch is supposed to be getting uplifted.
Flags: needinfo?(hurley)
Depends on: 1277943
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: