Crash in HttpChannelParentListener while stability testing

RESOLVED FIXED in 2.2 S13 (29may)

Status

()

defect
--
critical
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: ggrisco, Assigned: jduell.mcbugs)

Tracking

({crash})

unspecified
2.2 S13 (29may)
ARM
Gonk (Firefox OS)
Points:
---

Firefox Tracking Flags

(blocking-b2g:2.2+, b2g-v2.2 fixed, b2g-master unaffected)

Details

(Whiteboard: [b2g-crash][caf-crash 638][caf priority: p1][CR 840028], crash signature)

Attachments

(9 attachments)

(Reporter)

Description

4 years ago
Saw this crash 3 times on AU 157 while running stability tests over many hours:

[@ mozilla::net::HttpChannelParentListener::OnDataAvailable | mozilla::net::nsHttpChannel::OnDataAvailable | nsInputStreamPump::OnStateTransfer | nsInputStreamPump::OnInputStreamReady ]

cafbot will upload minidump and logs soon.
Patrick, would you be able to take a look at this or redirect to someone appropriate?
Flags: needinfo?(mcmanus)
blocking-b2g: 2.2? → 2.2+
Assignee: nobody → jduell.mcbugs
Flags: needinfo?(mcmanus)
Whiteboard: [CR 840028] → [caf priority: p1][CR 840028]
Whiteboard: [caf priority: p1][CR 840028] → [b2g-crash][caf-crash 638][caf priority: p1][CR 840028]
Keywords: crash
(Assignee)

Comment 7

4 years ago
I'm not sure if we can trust the stack trace here.  HttpChannelParentListener::OnDataAvailable() only dereferences a single pointer (mNextListener), and it checks it for null first (and I don't see any places it could be modified on another thread), so AFAICT it's not logically possible that we're actually crashing in that function.  Perhaps it's the next ODA callee in the chain, though that doesn't look promising either.

An HTTP log might help here:  Greg, is there any chance you could get one, attach it here, and needinfo me?

  https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging

(Sadly that doc doesn't say how to set up logging on Android/B2G: :blassey or :mayhemer might know how. Another issue might be size of the log--it'll get big if you're running for hours).
Flags: needinfo?(ggrisco)
(Reporter)

Comment 8

4 years ago
Hi Jason, thanks for looking into this.  It would be better if you can provide some logging patch that we can apply that takes long running time into consideration.  It takes some time to land these patches internally and then build and send to test team, so sooner we can have this the better since we're up against a deadline.

Although the stack trace may not be exactly correct, we have seen this same crash signature multiple times and we aren't seeing any other spurious crashes.  All of the logs I've seen for this are showing browser activity near the time of crash.

The only other crash we're seeing currently is bug 1162663 which doesn't look related.
Flags: needinfo?(ggrisco) → needinfo?(jduell.mcbugs)
Jason, I'm pretty confident this is the MOZ_RELEASE_ASSERT hitting, since MOZ_REALLY_CRASH (which is the underlying assertion mechanism) triggers a write to NULL: https://dxr.mozilla.org/mozilla-central/source/mfbt/Assertions.h#198
Jason, do you have an update you can provide on this issue?
(Assignee)

Comment 17

4 years ago
So I think jdm is right about the assertion.  And that means that we're somehow still delivering OnDataAvailable after we've diverted the channel.  Which is a bug in necko for sure, but it's not clear yet if it's due to 1) changes in the necko code, or 2) some new use case of DivertTo() that's exposing a long-standing bug, or 3) a bug that's always been there but is only exposed by the stability tests.

> Saw this crash 3 times on AU 157 while running stability tests over many hours:

Greg: how much of pain would it be to try to get a regression range for this?  It sounds like it might take a while if you only see this after many hours.  Also, do we know if we have any reports of this in the wild? (I forget how good our crash reporting is for Firefox OS, especially for parent crashes).
Flags: needinfo?(jduell.mcbugs) → needinfo?(ggrisco)
(Assignee)

Comment 18

4 years ago
Dragana:  sworkman tells me you may have run into crashes like these when you were working on some Divert-related patches. Does that ring a bell? Do you have cycles to look into this?
Flags: needinfo?(dd.mozilla)
Bug 1097878 (looks like similar stack trace) and bug 1106396 (needed an extra patch) are what I was referring to. Not sure how FxOS 2.2 is related to Fx36-38. Maybe the patches could be uplifted?
Steve, FxOS 2.2 is using gecko 37.
there is a patch in bug 1106396

it is not in b2g 2.2 it should be uplifted.
Flags: needinfo?(dd.mozilla)
this is rather small patch.
(Reporter)

Comment 23

4 years ago
(In reply to Jason Duell [:jduell] (needinfo? me) from comment #17)

> Greg: how much of pain would it be to try to get a regression range for
> this?  It sounds like it might take a while if you only see this after many
> hours.  Also, do we know if we have any reports of this in the wild? (I
> forget how good our crash reporting is for Firefox OS, especially for parent
> crashes).

We started seeing this in AU 157, there were no prior reports.  Although, it is possible that the issue existed before that, but wasn't seen due to other frequent crashes, so it's hard to say.  Regardless, here's the breakdown:

AU 157:  Seen 12 times
AU 159:  Seen 18 times
AU 162:  Seen 4 times so far (still testing)

cafbot has commented on each of these builds, so you should have gaia/gecko versions for each.
Flags: needinfo?(ggrisco)
Comment on attachment 8609647 [details] [diff] [review]
bug_1106396_fix_v2_suspend.patch

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Long existing bug
User impact if declined: crash
Testing completed: It is in since 38
Risk to taking this patch (and alternatives if risky): Low risk, running for couple of months already.
String or UUID changes made by this patch: none
Attachment #8609647 - Flags: approval-mozilla-b2g37?
Attachment #8609647 - Flags: approval-mozilla-b2g37? → approval-mozilla-b2g37+
The other crashes with the same signature - bug 1153256 is caused by a addon, so this patch should fix it.
https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/3ca4af0fc6f3
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.2 S13 (29may)
You need to log in before you can comment on or make changes to this bug.