Closed Bug 1152047 Opened 5 years ago Closed 5 years ago
WSAdmission Manager::Get Session Count may deadlock during shutdown
NtWaitForSingleObject ntdll.dll@0x1e1d6 RtlpEnterCriticalSectionContended RtlpEnterCriticalSectionContended PR_Lock mozilla::net::nsWSAdmissionManager::GetSessionCount(int&) mozilla::net::WebSocketChannel::StopSession(nsresult) mozilla::net::WebSocketChannel::ReleaseSession() mozilla::net::WebSocketChannel::ProcessInput(unsigned char*, unsigned int) mozilla::net::WebSocketChannel::OnInputStreamReady(nsIAsyncInputStream*) nsInputStreamReadyEvent::Run() nsThread::ProcessNextEvent(bool, bool*) NS_ProcessNextEvent(nsIThread*, bool) nsSocketTransportService::Run() https://crash-stats.mozilla.com/report/index/5acfd3bf-631b-46fd-b67b-731e42150407#allthreads, thread 9 https://crash-stats.mozilla.com/report/index/73b21efe-1098-4285-b57d-771342150402#allthreads, thread 8
hm. any ideas on what would cause this honza? The stacks don't actually show anything else holding that lock and its only ever used as an AutoLock. StaticMutex.h does say this though * The same caveats that apply to StaticAutoPtr apply to StaticMutex. In * particular, do not use StaticMutex as a stack variable or a class instance * variable, because this class relies on the fact that global variablies are * initialized to 0 in order to initialize mMutex. It is only safe to use * StaticMutex as a global or static variable. sLock is defined as a static member of a new'd object (a singleton that is lazy allocated). I wonder if the compiler trips up on that? It would be easy enough to make sLock be genuinely global and static.
I have seen happening this when the lock is reentered during a single stack, not necessarily during the one where the deadlock occurred. When you reenter the lock, it gets into a bad state that later leads to a deadlock.
(In reply to Patrick McManus [:mcmanus] from comment #1) > sLock is defined as a static member of a new'd object (a singleton that is > lazy allocated). I wonder if the compiler trips up on that? This shouldn't be a problem, see http://stackoverflow.com/questions/1421671/when-are-static-c-class-members-initialized Honza's theory might be right. Methods nsWSAdmissionManager::ConditionallyConnect(), nsWSAdmissionManager::OnConnected() and nsWSAdmissionManager::OnStopSession() can end up calling WebSocketChannel::BeginOpen() while holding the lock and if WebSocketChannel::BeginOpen() fails the lock is reentered in nsWSAdmissionManager::GetSessionCount(). I.e. the potential callstack is: nsWSAdmissionManager::GetSessionCount << lock WebSocketChannel::StopSession WebSocketChannel::AbortSession WebSocketChannel::BeginOpen FailDelayManager::DelayOrBegin FailDelayManager::ConnectNext nsWSAdmissionManager::OnStopSession << lock
Attachment #8605261 - Flags: review?(mcmanus)
Comment on attachment 8605261 [details] [diff] [review] fix Review of attachment 8605261 [details] [diff] [review]: ----------------------------------------------------------------- nice analysis!
Attachment #8605261 - Flags: review?(mcmanus) → review+
Honza should this also get uplifted to aurora? Did it affect 39 at all? Thanks.
Same here. Dragana, can you request approvals?
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
It's too late for 39 at this point unless we want to do an RC2. I still don't know if this affects 39. The fix could still make it into 40.
I will needinfo Michal it is his bug
Flags: needinfo?(dd.mozilla) → needinfo?(michal.novotny)
(In reply to Liz Henry (:lizzard) from comment #9) > It's too late for 39 at this point unless we want to do an RC2. I still > don't know if this affects 39. > The fix could still make it into 40. Yes, 39 should be affected. But this fix introduced a new problem that was fixed in 1170645, so we would need to uplift that too. Anyway, I didn't see any crash report on the crash-stats for this bug, so I'm not sure how critical this problem really is.
Michael, can we have the uplift request to aurora? thanks
Comment on attachment 8605261 [details] [diff] [review] fix Approval Request Comment [Feature/regressing bug #]: don't know [User impact if declined]: potential deadlock [Describe test coverage new/current, TreeHerder]: there are websocket tests that should cover this code path [Risks and why]: low as long as this lands together with 1170645 which fixed another problem introduced by this patch [String/UUID change made/needed]: none
Attachment #8605261 - Flags: approval-mozilla-aurora?
Comment on attachment 8605261 [details] [diff] [review] fix Improve the stability, taking it. Taking also Bug 1170645
Attachment #8605261 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.