Closed Bug 1044322 Opened 11 years ago Closed 11 years ago

b2g crashes during closing IPC channel

Categories

(Core :: IPC, defect, P1)

32 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
mozilla34
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: tkundu, Assigned: bent.mozilla)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 291][caf priority: p1][CR 696984])

Crash Data

Attachments

(2 files, 2 obsolete files)

Attached file stacktrace.txt
STR : 1) Run stability test for call,sms,video,music,camera,camcorder. 2) b2g crashes at some point. Logs https://drive.google.com/file/d/0B1cSMS8_GuAEUUtESW9fN0wxQ0k/edit?usp=sharing Video app (pid 26788) has closed IPC channel and we are seeing following logs in logcat just before crash: 07-23 23:46:51.544 I/Gecko (26788): ###!!! [Child][MessageChannel] Error: Channel closing: too late to send/recv, messages will be lost and b2g crashes at [1] while cleaning IPC request. [1] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#991 Crash timestamps is : 07-23 23:46:51.544 We are also seeing 49 FDs are opened with "/data/local/webapps/marketplace.firefox.com/application.zip" just before crash has happened (check lsof logs for crash timestamp)
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.0?
Whiteboard: [CR 696984] → [caf priority: p1][CR 696984]
Whiteboard: [caf priority: p1][CR 696984] → [b2g-crash][caf-crash 291][caf priority: p1][CR 696984]
Keywords: crash
Observed on: Device: msm8610 Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.040 Moz BuildID: 20140716000201 B2G Version: 2.0 Gecko Version: 32.0a2 Gaia: http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=5f8b1b8a2da9e3b531eee817a669f57fa4d9b9c6 Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e00f7e464333689fcf54edb4945ece94f97f930b
Component: Gaia::Video → IPC
Product: Firefox OS → Core
Version: unspecified → 32 Branch
This is very probably fixed by this part of the commit for bug 1013571: https://hg.mozilla.org/mozilla-central/diff/fc4d6b5ee8ce/ipc/glue/MessageLink.cpp The ProcessLink destructor was very racy and I had to fix it for bug 1013571. Unfortunately its not in a separate patch that can be uplifted with a simple merge. Tapas, can you try manually applying the changes from the linked diff to see if it helps?
Flags: needinfo?(tkundu)
Actually, here is the partial uplift patch. Not sure on the rules for reviewing this sort of thing, so flagging for review just to be on the safe side.
Attachment #8463389 - Flags: review?(bent.mozilla)
Comment on attachment 8463389 [details] [diff] [review] Partial uplift of bug 1013571 (fc4d6b5ee8ce) to fix ProcessLink destructor race. And.... I'm an idiot. I see now this patch is for ThreadLink, not ProcessLink. Sorry for the noise! *Goes to drink more coffee*
Attachment #8463389 - Attachment is obsolete: true
Attachment #8463389 - Flags: review?(bent.mozilla)
Flags: needinfo?(tkundu)
It would be interesting to know if we are hitting the delay path in OnNotifyMaybeChannelError: http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#1588 This seems a little scary to me as the delayed task doesn't check to see if the MessageChannel is still alive before running. It seems perhaps this delayed task could be queued up, the MessageChannel deleted, and then the queued task run on a dead object.
NI, can you confirm you the right assignee here, didn't want to get a wrong assignee ;) ?
blocking-b2g: 2.0? → 2.0+
Flags: needinfo?(bkelly)
Bhavana, I'll try to help, but I don't have the time available to take point on this. I'm currently booked to work another effort with an aggressive deadline. So I'd recommend finding a different person to assign this to. Sorry.
Flags: needinfo?(bkelly)
Anyone who works this, though, will need some way to reproduce it. Tapas, can you provide information on how to run the test referenced in comment 0 to get this crash to occur?
Flags: needinfo?(tkundu)
(In reply to Ben Kelly [:bkelly] from comment #9) > Anyone who works this, though, will need some way to reproduce it. Tapas, > can you provide information on how to run the test referenced in comment 0 > to get this crash to occur? I am sorry to tell you that it will be very difficult to reproduce this crash as we are running stability test for 24 hours. But if you can enable some logs from IPC layer (You can give us patch) then we can run stability test with that log and provide you logs.
Flags: needinfo?(tkundu)
Flags: needinfo?(bkelly)
reached out to overholt offline, he is trying to help get an assignee here given this is a critical p1 blocker for MTBF so passing this onto him until then.Thanks Andrew!
Assignee: nobody → overholt
Andrea spent some time on this today but is having a hard time getting started given the lack of a nice STR. He's going to sync with Ben Turner and figure something out.
Assignee: overholt → amarchesini
Clearing my NI as Andrea is now looking at this.
Flags: needinfo?(bkelly)
Ok, I know what's happening here. Patch in a bit.
Assignee: amarchesini → bent.mozilla
Tapas, can you give the patch in comment 16 a try?
Flags: needinfo?(tkundu)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #17) > Tapas, can you give the patch in comment 16 a try? Sure . We asked out internal team to verify it. Thanks a lot for your help
Flags: needinfo?(tkundu)
While we're waiting for verification let's get this reviewed too.
Attachment #8467867 - Flags: review?(benjamin)
Comment on attachment 8467867 [details] [diff] [review] Clear channel listener on correct thread, v1 I could use a little more documentation here: in what situations would the transport for a processlink have a previous listener at all? Can we document the listener behavior in a doccomment on Open()?
Attachment #8467867 - Flags: review?(benjamin) → review-
:bent this needs to get into CAF's build tonight before they kick off stability testing so can we get a patch up addressing comment #21 asap and a patch up that can be applied to 2.0 so CAF can pick it up?
Flags: needinfo?(bent.mozilla)
:bent/bhanvana -- if the changes in the patch are just documentation related and there are no functional update then we can continue to try the previous patch. Please let us know if that's the case.
I don't believe there are any issues here besides additional documentation. The patch should be fine for testing.
Flags: needinfo?(bent.mozilla)
Attachment #8469598 - Flags: review?(benjamin) → review+
Turns out bug 1035454 (landed at the same time) was causing the leaks. Re-landed. https://hg.mozilla.org/integration/mozilla-inbound/rev/5832624a0c03
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
Tapas, any updates here? Did this fix the problem?
Flags: needinfo?(tkundu)
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #31) > Tapas, any updates here? Did this fix the problem? I think that your patch has fixed issue as we are not able to reproduce this issue anymore in last 20 days. Thanks for your work :)
Flags: needinfo?(tkundu)
Flags: in-moztrap?(ychung)
No STR is present to create test case to address bug.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(ychung)
Flags: in-moztrap-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: