Closed Bug 1152047 Opened 5 years ago Closed 5 years ago

mozilla::net::nsWSAdmissionManager::GetSessionCount may deadlock during shutdown

Categories

(Core :: Networking: WebSockets, defect, critical)

All
Windows
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla41
Tracking Status
firefox39 --- ?
firefox40 + fixed
firefox41 + fixed

People

(Reporter: mayhemer, Assigned: michal)

References

Details

Attachments

(1 file)

NtWaitForSingleObject
ntdll.dll@0x1e1d6
RtlpEnterCriticalSectionContended
RtlpEnterCriticalSectionContended
PR_Lock
mozilla::net::nsWSAdmissionManager::GetSessionCount(int&)
mozilla::net::WebSocketChannel::StopSession(nsresult)
mozilla::net::WebSocketChannel::ReleaseSession()
mozilla::net::WebSocketChannel::ProcessInput(unsigned char*, unsigned int)
mozilla::net::WebSocketChannel::OnInputStreamReady(nsIAsyncInputStream*)
nsInputStreamReadyEvent::Run()
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
nsSocketTransportService::Run()

https://crash-stats.mozilla.com/report/index/5acfd3bf-631b-46fd-b67b-731e42150407#allthreads, thread 9 
https://crash-stats.mozilla.com/report/index/73b21efe-1098-4285-b57d-771342150402#allthreads, thread 8
hm. any ideas on what would cause this honza? The stacks don't actually show anything else holding that lock and its only ever used as an AutoLock.

StaticMutex.h does say this though

 * The same caveats that apply to StaticAutoPtr apply to StaticMutex.  In
 * particular, do not use StaticMutex as a stack variable or a class instance
 * variable, because this class relies on the fact that global variablies are
 * initialized to 0 in order to initialize mMutex.  It is only safe to use
 * StaticMutex as a global or static variable.

sLock is defined as a static member of a new'd object (a singleton that is lazy allocated). I wonder if the compiler trips up on that?

It would be easy enough to make sLock be genuinely global and static.
I have seen happening this when the lock is reentered during a single stack, not necessarily during the one where the deadlock occurred.  When you reenter the lock, it gets into a bad state that later leads to a deadlock.
Blocks: 1158189
No longer blocks: 1124880
Assignee: nobody → michal.novotny
Attached patch fixSplinter Review
(In reply to Patrick McManus [:mcmanus] from comment #1)
> sLock is defined as a static member of a new'd object (a singleton that is
> lazy allocated). I wonder if the compiler trips up on that?

This shouldn't be a problem, see http://stackoverflow.com/questions/1421671/when-are-static-c-class-members-initialized

Honza's theory might be right. Methods nsWSAdmissionManager::ConditionallyConnect(), nsWSAdmissionManager::OnConnected() and nsWSAdmissionManager::OnStopSession() can end up calling WebSocketChannel::BeginOpen() while holding the lock and if WebSocketChannel::BeginOpen() fails the lock is reentered in nsWSAdmissionManager::GetSessionCount(). I.e. the potential callstack is:

  nsWSAdmissionManager::GetSessionCount << lock
  WebSocketChannel::StopSession
  WebSocketChannel::AbortSession
  WebSocketChannel::BeginOpen
  FailDelayManager::DelayOrBegin
  FailDelayManager::ConnectNext
  nsWSAdmissionManager::OnStopSession  << lock
Attachment #8605261 - Flags: review?(mcmanus)
Comment on attachment 8605261 [details] [diff] [review]
fix

Review of attachment 8605261 [details] [diff] [review]:
-----------------------------------------------------------------

nice analysis!
Attachment #8605261 - Flags: review?(mcmanus) → review+
https://hg.mozilla.org/mozilla-central/rev/40155ce1f054
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla41
Honza should this also get uplifted to aurora? Did it affect 39 at all? Thanks.
Flags: needinfo?(honzab.moz)
Same here.  Dragana, can you request approvals?
Flags: needinfo?(honzab.moz) → needinfo?(dd.mozilla)
It's too late for 39 at this point unless we want to do an RC2. I still don't know if this affects 39. 
The fix could still make it into 40.
I will needinfo Michal it is his bug
Flags: needinfo?(dd.mozilla) → needinfo?(michal.novotny)
(In reply to Liz Henry (:lizzard) from comment #9)
> It's too late for 39 at this point unless we want to do an RC2. I still
> don't know if this affects 39. 
> The fix could still make it into 40.

Yes, 39 should be affected. But this fix introduced a new problem that was fixed in 1170645, so we would need to uplift that too. Anyway, I didn't see any crash report on the crash-stats for this bug, so I'm not sure how critical this problem really is.
Flags: needinfo?(michal.novotny)
Michael, can we have the uplift request to aurora? thanks
Flags: needinfo?(michal.novotny)
Comment on attachment 8605261 [details] [diff] [review]
fix

Approval Request Comment
[Feature/regressing bug #]: don't know
[User impact if declined]: potential deadlock
[Describe test coverage new/current, TreeHerder]: there are websocket tests that should cover this code path
[Risks and why]: low as long as this lands together with 1170645 which fixed another problem introduced by this patch
[String/UUID change made/needed]: none
Flags: needinfo?(michal.novotny)
Attachment #8605261 - Flags: approval-mozilla-aurora?
Comment on attachment 8605261 [details] [diff] [review]
fix

Improve the stability, taking it.
Taking also Bug 1170645
Attachment #8605261 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.