Closed Bug 1279163 Opened 3 years ago Closed 3 years ago

crash in mozilla::net::nsHttpConnectionMgr::Shutdown on beta 48

Categories

(Core :: Networking: HTTP, defect, critical)

49 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox47 --- wontfix
firefox48 --- fixed
firefox49 --- unaffected

People

(Reporter: dragana, Assigned: mcmanus)

Details

(Keywords: crash, Whiteboard: [necko-active])

Crash Data

Attachments

(1 file)

I am opening a new bug to easier track this issue.

48 was behaving like 47 in aurora regarding issue with crash in mozilla::net::nsHttpConnectionMgr::Shutdown. 

Turning into beta the crash amount jumped.
Similar thing has happened when we tried to uplift bug 698882 to 47beta (bug 698882 comment 143 and bug 1158189 comment 78).
Whiteboard: [necko-active]
Crash Signature: [ @shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::net::nsHttpConnectionMgr::Shutdown]
Crash Signature: [ @shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::net::nsHttpConnectionMgr::Shutdown] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_WaitCondVar | nsThread::ProcessNextEvent | NS_ProcessNextEvent | mozilla::net::nsHttpConnectionMgr::Shutdown ]
Looking at the crashes it seems that is happening all the time for some users (you can even build a chain).

The up time is not long, but it is not a stat up hang.
It looks like firefox is slow or hanging or it starts to be slow after sometime (the socketThread is hanging, but not the main thread) and when a user try to shut it down it hangs and crash.

Maybe someone with the problem will contact us soon so we can debug it.
Assignee: nobody → dd.mozilla
Severity: normal → critical
Keywords: crash
The actual stacks associated with the increased crash rates are dups of bug 1158189 - that bug is wontfix for 48 as it depends on the msvc toolchain changes that are in 49.

bug 698882 (landed on 48) somehow makes 1158189 worse on 48 - this is a little odd in that I would expect a mild improvement as 698882 makes less use of the system network functions (poll, write) that are implicated in 1158189.

the good news is gecko 49 contains both the new tool chain and 698882 and 1158189 appears to remain fixed there.

The pragmatic thing would be to back 698882 (which unfortunately is fragmented in a few patches) out of 48 and expect that the 48 shutdown hang rate associated with 1158189 would move towards 47.

The downside is that there are other hangs that are fixed by 698882 (though lower in volume) and it contains a startup performance boost that will need to sacrificed for another release cycle. I think that's worth it.

I'll prepare the patch for 48.
Assignee: dd.mozilla → mcmanus
its academic at this point, but when I was discussing this with dragana I was talking about how I thought 698882 would make the underlying bug less likely rather than more likely becuase it does less IO. (the data suggests it makes it more likely).

dragana suggested that the extra wakeups that happen without 698882 might help resolve a stuck scheduler in poll. essentially - just keep kicking it rather than treating it like a state machine. Seems plausible at least.
going to try and base the backout from 48 on

https://hg.mozilla.org/releases/mozilla-beta/rev/25e9aa3f174f
which came from
https://bugzilla.mozilla.org/show_bug.cgi?id=698882#c144
which was the backout of the same thing from 47 when it was in beta

can't wait for 49 :)
Attached patch 1279163.1Splinter Review
Attachment #8764624 - Flags: review?(dd.mozilla)
should also double check that comment 0 is still true with more crash stats data now available.
(In reply to Patrick McManus [:mcmanus] from comment #6)
> should also double check that comment 0 is still true with more crash stats
> data now available.

When I looked at some crash stacks, I could see that for some people it is happening all the time. so maybe it is better to back it out. (I have not look at newer crash stats)
Attachment #8764624 - Flags: review?(dd.mozilla) → review+
Comment on attachment 8764624 [details] [diff] [review]
1279163.1

Approval Request Comment

please read comment 2

[Feature/regressing bug #]: 698882
[User impact if declined]: shutdown hang crash rate
[Describe test coverage new/current, TreeHerder]: treeherder
[Risks and why]:  low - backout we've done before
[String/UUID change made/needed]: none
Attachment #8764624 - Flags: approval-mozilla-beta?
Comment on attachment 8764624 [details] [diff] [review]
1279163.1

OK, let's do it in beta too.
Should be in 48 beta 4.
Attachment #8764624 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
It does look like there are fewer crashes in 48.0b4 and b5. So I think the backout worked. 
48.0b3: 3718 crashes in the last week
48.0b4: 881 crashes 
48.0b5: 933 crashes
This signature is still the #3 topcrash on 48 beta 6, though.
yes, this bug just covered the increase of this signature at the start of the 48.0b cycle (it was close to 10% of all crashers there). now we're back to the 2-3% that we were at during past beta cycles as well, which usually translates to ~1% in the general release population.

the investigations of the networking team showed that the underlying problem should be solved with the compiler switch to MSVC 2015, so hopefully firefox 49: https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c103
(In reply to [:philipp] from comment #13)
> yes, this bug just covered the increase of this signature at the start of
> the 48.0b cycle (it was close to 10% of all crashers there). now we're back
> to the 2-3% that we were at during past beta cycles as well, which usually
> translates to ~1% in the general release population.
> 
> the investigations of the networking team showed that the underlying problem
> should be solved with the compiler switch to MSVC 2015, so hopefully firefox
> 49: https://bugzilla.mozilla.org/show_bug.cgi?id=1158189#c103

This should be resolved with 49, switch to MSVC 2015 fixed the problem. As Philipp said this bug only removed the code change that has caused an increase in the amount of crashes comparing to other betas.
marking this bug as resolved as well then. thank you
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.