<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 3

•

10 years ago

Honza, I believe you said you could test this to see where we're acquiring the lock twice on the same thread (assuming that's the issue). I've got a try run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf46564546e9 for windows builds with the results at https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/hurley@todesschaf.org-cf46564546e9 Can you give these a try and see if you either still reproduce the deadlock, or end up asserting when we acquire the lock in the wrong place?

Flags: needinfo?(honzab.moz)

Reporter

Comment 4

•

10 years ago

I never said I was able to reproduce this exact deadlock, sorry. We have to let this go out and collect crash data.

Flags: needinfo?(honzab.moz)

Assignee

Comment 5

•

10 years ago

Attached patch debugging patch — Details — Splinter Review

This patch should crash when we want it to, even on release builds. Let's land this and see what happens :) (We'll back out as soon as we get some data on the deadlock, and certainly before this goes out to a significant number of users.)

Attachment #8612467 - Flags: review?(honzab.moz)

Reporter

Comment 6

•

10 years ago

Comment on attachment 8612467 [details] [diff] [review] debugging patch Review of attachment 8612467 [details] [diff] [review]: ----------------------------------------------------------------- ::: netwerk/base/nsSocketTransportService2.cpp @@ +66,5 @@ > + :mLock(&mutex) > +{ > + PRThread *currentThread = PR_GetCurrentThread(); > + MOZ_RELEASE_ASSERT(sDebugOwningThread != currentThread); > + sDebugOwningThread = currentThread; do this assignment after you acquired the lock!

Attachment #8612467 - Flags: review?(honzab.moz) → review+

Assignee

Comment 7

•

10 years ago

> do this assignment after you acquired the lock! *facepalm* thanks! Fixed in the landed version :) remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/91a3fb327007

https://hg.mozilla.org/integration/mozilla-inbound/rev/85e4bd6b0baa

Assignee

Updated

•

10 years ago

Keywords: leave-open

Pulsebot

Comment 8

•

10 years ago

Assignee

Comment 9

•

10 years ago

Backed out and re-landed DONTBUILD to fix the bug number in the commit message remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/71a80e43615c remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/85e4bd6b0baa

https://hg.mozilla.org/mozilla-central/rev/85e4bd6b0baa

Comment 10

•

10 years ago

Liz Henry (:lizzard) (relman/hg->git project)

Comment 11

•

10 years ago

Should this be uplifted? What versions did/does it affect? I'm following up on this since I'm tracking the bug that it's blocking (for shutdown hangs)

Flags: needinfo?(hurley)

Flags: needinfo?(honzab.moz)

Assignee

Comment 12

•

10 years ago

No, this patch should not be uplifted, as it is not a fix - it's merely a patch designed to cause crashes so we can see where, exactly, things are going wrong (which, so far, hasn't succeeded - I've seen no crash reports around my assertion). This patch will be backed out before hitting release, possibly (probably?) before beta.

Flags: needinfo?(hurley)

Flags: needinfo?(honzab.moz)

Assignee

Comment 13

•

10 years ago

Looks like we (finally) got a single hit on the assertion: https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286-089bb2150704

Assignee

Comment 14

•

10 years ago

Note: I haven't dug into it yet, this is just so I don't have to dig up that crash-stats URL again :)

Reporter

Comment 15

•

10 years ago

(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #13) > Looks like we (finally) got a single hit on the assertion: > https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286- > 089bb2150704 Looks like we crash inside nsSocketTransportService::Dispatch (up the stack) what winds up in CallWindowProcCrashProtected that calls nsWindow::ProcessMessage that processes another message that enters the lock again. Really interesting :D

Reporter

Comment 16

•

10 years ago

Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which just do try/catch. Event more interesting :D

Reporter

Comment 17

•

10 years ago

The Bdprovider.dll called by send (an LSP) seems to belong to bitdefender.

Assignee

Comment 18

•

10 years ago

(In reply to Honza Bambas (:mayhemer) from comment #16) > Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which > just do try/catch. Event more interesting :D Right, this looks like a suspend/resume state (note PostSleepWakeNotification in the crash stack). It seems like the machine goes to sleep while the lock is held (looks like on the main thread) by nsSocketTransportService::OnDispatchedEvent. Later, when the machine wakes up, the network link change bits take hold and try to do their cleanup. nsHttpConnectionMgr::PruneDeadConnections is called, which posts an event, causing nsSocketTransportService::Dispatch to be called, which tries to acquire the lock re-entrantly, which would cause our issues. I'm not convinced that bitdefender has anything to do with the crash, as this is a race that could happen at any time, with or without bitdefender - though I could certainly see bitdefender making the race more likely to be lost. Right now, I'm mostly at a loss for how to fix this, except to say that these accesses all occur on the main thread, which means any reads of nsSocketTransportService::mThread and nsSocketTransportService::mThreadEvent don't strictly need to be protected by the lock - we could just have these main-thread reads be unlocked (writes of course still would need to be locked); but down that path lies the madness of poor locking, so it's less of a real suggestion and more of a "I'm gonna throw my hands up for now and mull it over for a while" kind of thing.

Comment 19

•

10 years ago

That's quite a stack, nick :) (comment 13) the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is delievered synchronously after a windows message pre-empts execution. (the latter makes sense for sleep/wake, the synchronous part of the necko delivery surprised me). The IO Service code hooks that observer and changes it into the link_changed events as apropos. The lock assertion is helpful, without it we would just have corruption of the socket thread event queue (which was partially changed before going to sleep and is now being changed again as part of the observer handling). It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully re-entrant against this in ways that other observers are not. Looking at that stack I am concerned about the integrity of the timer code too. ni a couple folks that seem to have been involved in the wake-observer code. I think we can fix the necko issue by having the ioService handler just post a fully async event back to the main thread to initiate the work, rather than doing it synchronously.

Flags: needinfo?(netzen)

Flags: needinfo?(jmathies)

Comment 20

•

10 years ago

in general, this unexpectedly re-entrant condition could cause all manner of hard to identify corruption (maybe the write segments bug?) beyond the case caught here so we should prioritize a fix.

Jim Mathies [:jimm]

Comment 21

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #19) > That's quite a stack, nick :) (comment 13) > > the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is > delievered synchronously after a windows message pre-empts execution. (the > latter makes sense for sleep/wake, the synchronous part of the necko > delivery surprised me). The IO Service code hooks that observer and changes > it into the link_changed events as apropos. > > The lock assertion is helpful, without it we would just have corruption of > the socket thread event queue (which was partially changed before going to > sleep and is now being changed again as part of the observer handling). > > It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully > re-entrant against this in ways that other observers are not. > > Looking at that stack I am concerned about the integrity of the timer code > too. > > ni a couple folks that seem to have been involved in the wake-observer code. > > I think we can fix the necko issue by having the ioService handler just post > a fully async event back to the main thread to initiate the work, rather > than doing it synchronously. I'm confused by all this, what are you suggesting we do differently here?

Comment 22

•

10 years ago

(In reply to Jim Mathies [:jimm] from comment #21) > I'm confused by all this, what are you suggesting we do differently here? can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non re-entrant stack?

Brian R. Bondy [:bbondy]

Comment 23

•

10 years ago

Sorry I'm not the right person to ask for this.

Flags: needinfo?(netzen)

Jim Mathies [:jimm]

Comment 24

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #22) > (In reply to Jim Mathies [:jimm] from comment #21) > > > I'm confused by all this, what are you suggesting we do differently here? > > can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non > re-entrant stack? Yeah I think it'd be safe to fire these off via async runnables independent from the windows notification. From what I remember we had some problems with timers not getting reset right after a long sleep. I don't see that regressing due to async notifications.

Flags: needinfo?(jmathies)

Comment 25

•

10 years ago

nick, given that bagder is on PTO for a while, would you be interested in making the IOService change (assuming you think it helps too)? We could get that backported easily enough I would think. We could separately make the async runnable change for generating NS_WIDGET_WAKE_OBSERVER_TOPIC too (comment 24) but I have less confidence in what that might impact, so it would be better to do it on just nightly imo.

Assignee

Comment 26

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #25) > nick, given that bagder is on PTO for a while, would you be interested in > making the IOService change (assuming you think it helps too)? We could get > that backported easily enough I would think. I'm already working on it, believe it or not :)

Assignee

Comment 27

•

10 years ago

Attached patch revert lock change (obsolete) — Details — Splinter Review

This reverts the debugging changes I made (we will definitely want to uplift this one). A patch with the fix comes next.

Attachment #8636087 - Flags: review?(mcmanus)

Assignee

Comment 28

•

10 years ago

Attached patch Async wakeup notifications — Details — Splinter Review

And here's the one that should fix the actual issue.

Attachment #8636088 - Flags: review?(mcmanus)

Assignee

Comment 29

•

10 years ago

Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb82f45dda08

Comment 30

•

10 years ago

Comment on attachment 8636087 [details] [diff] [review] revert lock change Review of attachment 8636087 [details] [diff] [review]: ----------------------------------------------------------------- if we revert we can't confirm the bug is fixed. how about MOZ_DIAGNOSTIC_ASSERT ?

Comment 31

•

10 years ago

Comment on attachment 8636088 [details] [diff] [review] Async wakeup notifications Review of attachment 8636088 [details] [diff] [review]: ----------------------------------------------------------------- sgtm; daniel is back this week right?

Attachment #8636088 - Flags: review?(mcmanus)

Attachment #8636088 - Flags: review?(daniel)

Attachment #8636088 - Flags: feedback+

Comment 32

•

10 years ago

and is there a bug filed for making a similar change to NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24?

Updated

•

10 years ago

Attachment #8636087 - Flags: review?(mcmanus)

Reporter

Comment 33

•

10 years ago

BTW, have you checked how many shutdown deadlock have happened when this patch was in?

Flags: needinfo?(hurley)

Assignee

Comment 34

•

10 years ago

(In reply to Honza Bambas (:mayhemer) from comment #33) > BTW, have you checked how many shutdown deadlock have happened when this > patch was in? There are two - both of which seem to have the same stack involved. So it looks like this is being hit maybe once a month (give or take).

Flags: needinfo?(hurley)

Assignee

Comment 35

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #30) > Comment on attachment 8636087 [details] [diff] [review] > revert lock change > > Review of attachment 8636087 [details] [diff] [review]: > ----------------------------------------------------------------- > > if we revert we can't confirm the bug is fixed. how about > MOZ_DIAGNOSTIC_ASSERT ? That's doable. Though with the frequency (or lack thereof) of this deadlock/crash, confirmation is going to be hard to come by, anyway :)

Assignee

Comment 36

•

10 years ago

(In reply to Patrick McManus [:mcmanus] from comment #32) > and is there a bug filed for making a similar change to > NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24? Just filed https://bugzilla.mozilla.org/show_bug.cgi?id=1186074

Assignee

Comment 37

•

10 years ago

Attached patch diagnostic (not release) assert for the lock — Details — Splinter Review

New lock-related patch making it a diagnostic assert instead of a release assert, so we can (hopefully) verify this as fixed at some point in the future.

Attachment #8636087 - Attachment is obsolete: true

Attachment #8636658 - Flags: review?(daniel)

Comment 38

•

10 years ago

(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #34) > (In reply to Honza Bambas (:mayhemer) from comment #33) > > BTW, have you checked how many shutdown deadlock have happened when this > > patch was in? > > There are two - both of which seem to have the same stack involved. So it > looks like this is being hit maybe once a month (give or take). this deadlock is infrequent, but I have high hopes that the patch fixes a lot of other weird behavior. the root cause here is re-entrancy and this deadlock is just one of a billion ways that could go poorly. (I'm wondering about the famous writesegments bug)

Daniel Stenberg [:bagder]

Updated

•

10 years ago

Comment 39

•

10 years ago

Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Review of attachment 8636658 [details] [diff] [review]: ----------------------------------------------------------------- I approve!

Attachment #8636658 - Flags: review?(daniel) → review+

Comment 40

•

10 years ago

Comment on attachment 8636088 [details] [diff] [review] Async wakeup notifications Review of attachment 8636088 [details] [diff] [review]: ----------------------------------------------------------------- LGTM

Attachment #8636088 - Flags: review?(daniel) → review+

https://hg.mozilla.org/mozilla-central/rev/cd1d76c2f219 https://hg.mozilla.org/mozilla-central/rev/a0582ac1d92e

Assignee

Comment 41

•

10 years ago

url: https://hg.mozilla.org/integration/mozilla-inbound/rev/cd1d76c2f219987ad981e1f3e9ac76845a9962a2 changeset: cd1d76c2f219987ad981e1f3e9ac76845a9962a2 user: Nicholas Hurley <hurley@todesschaf.org> date: Fri Jul 31 13:33:48 2015 -0700 description: Bug 1152048 - Make the RELEASE_ASSERT on the sts lock a DIAGNOSTIC_ASSERT. r=bagder url: https://hg.mozilla.org/integration/mozilla-inbound/rev/a0582ac1d92e3484975e8dd6144cf42f8ad27d4b changeset: a0582ac1d92e3484975e8dd6144cf42f8ad27d4b user: Nicholas Hurley <hurley@todesschaf.org> date: Fri Jul 31 13:39:48 2015 -0700 description: Bug 1152048 - Send wakeup notification asynchronously to avoid reentrancy issues. r=bagder

Carsten Book [:Tomcat]

Comment 42

•

10 years ago

Assignee

Comment 43

•

10 years ago

Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Approval Request Comment [Feature/regressing bug #]: this bug [User impact if declined]: could cause crashes we don't want (instead of deadlocks) [Describe test coverage new/current, TreeHerder]: on m-c [Risks and why]: none (just changes assert type to not fire on release builds) [String/UUID change made/needed]: none

Attachment #8636658 - Flags: approval-mozilla-aurora?

Assignee

Updated

•

10 years ago

Keywords: leave-open

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Assignee

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox41: --- → affected

status-firefox42: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla42

Nathan Froyd [:froydnj]

Updated

•

10 years ago

Depends on: 1190985

Comment 44

•

10 years ago

Comment on attachment 8636658 [details] [diff] [review] diagnostic (not release) assert for the lock Seems safe for uplift to Aurora.

Attachment #8636658 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Comment 45

•

10 years ago

Needs rebasing for Aurora uplift.

status-firefox40: affected → wontfix

Flags: needinfo?(hurley)

Comment 46

•

10 years ago

Oh nevermind, just the assert patch is supposed to be getting uplifted.

Flags: needinfo?(hurley)

https://hg.mozilla.org/releases/mozilla-aurora/rev/70d924f428cc

Comment 47

•

10 years ago

status-firefox41: affected → fixed