Closed Bug 1292181 Opened 8 years ago Closed 8 years ago

Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent - crash in nsHttpConnectionMgr

Categories

(Core :: Networking, defect)

49 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox49 blocking fixed
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: dragana, Assigned: dragana)

Details

(Keywords: topcrash, Whiteboard: [necko-active])

Crash Data

Attachments

(1 file, 2 obsolete files)

shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent

The change that we made was bug 698882. We tried to uplift the bug to 47 and 48 needed to backed it out from both when they went into beta - 1279163 and bug 698882 comment 143.
Summary: Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent → Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent on 49Beta
Whiteboard: [necko-active]
Bug 1158189 was originally fixed by switching to MSVC 2015. It is worth asking if there is a difference between beta and aurora regarding toolchain.
[Tracking Requested - why for this release]: Currently the top crash in beta. We should track this.
Fixing signature so it shows up in crash stats.
Crash Signature: shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent]
Attached patch bug_1292181.patch (obsolete) — Splinter Review
Assignee: nobody → dd.mozilla
Status: NEW → ASSIGNED
Attachment #8778280 - Flags: review?(mcmanus)
There is at least 10 different signatures under this one. The line that we can search is to short, it would be good if signature is at least 5 lines deeper into the stack.

:marcia do you know who we can ask about this? It is very hard to search.
Flags: needinfo?(mozillamarcia.knous)
I look through 100 out of 380 crashes from the last week for ff51, 50 and 49aurora and nightly (not beta)
there are 31 hangs in nsHttpConnectionMgr -> our old shutdown hang that hos just hidden :( :( :(

that is approx. 117 crashes on nightly and aurora in last week.
(In reply to Dragana Damjanovic [:dragana] from comment #5)
> There is at least 10 different signatures under this one. The line that we
> can search is to short, it would be good if signature is at least 5 lines
> deeper into the stack.
> 
> :marcia do you know who we can ask about this? It is very hard to search.

For Socorro searching, that would be someone in that team - #breakpad is a good place to start.
Flags: needinfo?(mozillamarcia.knous)
in 100 crashes there is 10 crashes on nightly and 21 on aurora, that means we have around 38 on nightly and 79 on aurora, this is slightly more than what we use to have.
(In reply to Marcia Knous [:marcia - use ni] from comment #7)
> (In reply to Dragana Damjanovic [:dragana] from comment #5)
> > There is at least 10 different signatures under this one. The line that we
> > can search is to short, it would be good if signature is at least 5 lines
> > deeper into the stack.
> > 
> > :marcia do you know who we can ask about this? It is very hard to search.
> 
> For Socorro searching, that would be someone in that team - #breakpad is a
> good place to start.

After a chat on #breakpad
I have submitted bug 1292608.
Comment on attachment 8778280 [details] [diff] [review]
bug_1292181.patch

Review of attachment 8778280 [details] [diff] [review]:
-----------------------------------------------------------------

my other concern here is PollableEvent does not write to the pipe when called on the socket thread as compared to the old system. That was actually a deadlock, but you could work around that with the awful 2048 check for the moment - do you think it should do so again?

I'm on PTO for the next week.. nick or hozna can review.

::: netwerk/base/PollableEvent.cpp
@@ +220,5 @@
>    if (PR_GetCurrentThread() == gSocketThread) {
>      SOCKET_LOG(("PollableEvent::Signal OnSocketThread nop\n"));
>      return true;
>    }
> +#ifdef XP_WIN

so because of the ifdef you've got two { and one }.. I hate that :(

can you rewrite the ifdef into a bool assignment if just put the bool into the conditional?

@@ +221,5 @@
>      SOCKET_LOG(("PollableEvent::Signal OnSocketThread nop\n"));
>      return true;
>    }
> +#ifdef XP_WIN
> +  if (mSignaled >= 2048) {

oh dear. write the comment :)

@@ +245,5 @@
>    // necessary because of the "dont signal on socket thread" optimization
>    MOZ_ASSERT(PR_GetCurrentThread() == gSocketThread);
>  
>    SOCKET_LOG(("PollableEvent::Clear\n"));
> +  mSignaled = 0;

I think this needs to be decremented by the number actually read.. which sadly can be less than all of them by API..

also there is a branch below that has an assert about not reading more than 1 that needs to be suppressed
Attachment #8778280 - Flags: review?(mcmanus) → review-
If this patch(some version of it) helps to decrease number of hangs, that means that writing more often to the socket help -> we should try to write to the socket more often during shutdown, something like a timer to send an event every 10ms(this is only for shutdown!)
49 top crash, tracking.
(In reply to Patrick McManus [:mcmanus] PTO until Aug 15 from comment #10)
> Comment on attachment 8778280 [details] [diff] [review]
> bug_1292181.patch
> 
> Review of attachment 8778280 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> my other concern here is PollableEvent does not write to the pipe when
> called on the socket thread as compared to the old system. That was actually
> a deadlock, but you could work around that with the awful 2048 check for the
> moment - do you think it should do so again?
> 

I have decided to be conservative because I want to uplift this to beta. I have changed back to the same behavior as before bug 698882, so we will signal also for events dispatched on SocketThread. And on nightly we can try to introduce some optimizations. The change will be only for Windows.

> I'm on PTO for the next week.. nick or hozna can review.
> 
ok
Summary: Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent on 49Beta → Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent - crash in nsHttpConnectionMgr
Attached patch bug_1292181.patch (obsolete) — Splinter Review
Attachment #8778280 - Attachment is obsolete: true
Attachment #8779638 - Flags: review?(hurley)
Comment on attachment 8779638 [details] [diff] [review]
bug_1292181.patch

Review of attachment 8779638 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/PollableEvent.cpp
@@ +231,5 @@
> +
> +#ifndef XP_WIN
> +  // To wake up the poll writing once is enough, but for Windows that can cause
> +  // hangs so we will write for every event.
> +  // For no Windows systems it is enough to write just once.

nit: non-Windows
Attachment #8779638 - Flags: review?(hurley) → review+
Attachment #8779638 - Attachment is obsolete: true
Attachment #8780122 - Flags: review+
Keywords: checkin-needed
Pushed by kwierso@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5ad3abb5627c
Let each dispatch event sends data to the socket. r=mcmanus yolo CLOSED TREE
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/5ad3abb5627c
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
so far there are no more crashes on nightly builds after the patch has landed: http://bit.ly/2bBBTHu :-)
can we uplift this to aurora and beta (where this is currently ~10% of all recorded browser crashes) now as well?
Dragana, can you request uplift, if you think it is not too risky? I'd like to get this into the beta 5 build on Thursday morning. This may fix our #2 top crash in beta. Thanks!
Flags: needinfo?(dd.mozilla)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #21)
> Dragana, can you request uplift, if you think it is not too risky? I'd like
> to get this into the beta 5 build on Thursday morning. This may fix our #2
> top crash in beta. Thanks!

I was waiting for couple of days to see if everything works fine. 2-3 days are enough because this is very much used code path. I will request an uplift.
Flags: needinfo?(dd.mozilla)
Comment on attachment 8780122 [details] [diff] [review]
bug_1292181.patch

Approval Request Comment
[Feature/regressing bug #]: Should fix #2 top crash in beta. The crash is very low volume on Nightly and Aurora so it is only possible to see an improvement on beta.
[User impact if declined]: shutdown hang on Windows
[Describe test coverage new/current, TreeHerder]: It heavily used code path.
[Risks and why]: low
[String/UUID change made/needed]: none
Attachment #8780122 - Flags: approval-mozilla-beta?
Attachment #8780122 - Flags: approval-mozilla-aurora?
Comment on attachment 8780122 [details] [diff] [review]
bug_1292181.patch

High volume crash on beta, let's take this for aurora and for beta 5. 
Thanks Dragana!
Attachment #8780122 - Flags: approval-mozilla-beta?
Attachment #8780122 - Flags: approval-mozilla-beta+
Attachment #8780122 - Flags: approval-mozilla-aurora?
Attachment #8780122 - Flags: approval-mozilla-aurora+
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | mozilla::CondVar::Wait | nsEventQueue::GetEvent] [@ shutdownhang | mozilla::CondVar::Wait | nsEventQueue::GetEvent]
some of these crashes still seem to be happening in 49.0b5: http://bit.ly/2b7Q6GB - but i think in reduced volume (it is still a bit early to tell for sure)
(In reply to [:philipp] from comment #27)
> some of these crashes still seem to be happening in 49.0b5:
> http://bit.ly/2b7Q6GB - but i think in reduced volume (it is still a bit
> early to tell for sure)

Thanks, Philipp!
This patch will not remove the crash completely, but just reduce it to the volume the signatures from bug 1158189 had on 48 and earlier.
Bug 1272911 should hopefully sort out splitting the signature.
Keywords: topcrash
(In reply to [:philipp] from comment #27)
> some of these crashes still seem to be happening in 49.0b5:
> http://bit.ly/2b7Q6GB - but i think in reduced volume (it is still a bit
> early to tell for sure)

Looking at 49b4, 49b5, 49b6 and 49b7:
49b4 has 3765 crashes
49b5 1527
49b6 1028
49b7 592

This is total number not per week number. This is reduced to do volume of bug 1158189 in beta48 and earlier. That is what I expected with this patch.
Thank you Dragana for the fix! I agree that it looks greatly improved for 49.
You need to log in before you can comment on or make changes to this bug.