Closed Bug 1074419 Opened 10 years ago Closed 10 years ago

b2g crashes in nsAString_internal::Assign

Categories

(Firefox OS Graveyard :: Bluetooth, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.1+, firefox34 wontfix, firefox35 wontfix, firefox36 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
2.1 S7 (24Oct)
blocking-b2g 2.1+
Tracking Status
firefox34 --- wontfix
firefox35 --- wontfix
firefox36 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: tkundu, Assigned: tzimmermann)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 370][caf priority: p1][CR 729510])

Crash Data

Attachments

(3 files, 2 obsolete files)

Attached file stack trace
STR:

1) Run random stability testing for 24 hours

We are seeing this crash . I attached stack traces and logcat logs :
[Blocking Requested - why for this release]:
blocking-b2g: --- → 2.1?
Eric, looks like this comes from BT code.
Flags: needinfo?(echou)
Component: Stability → Bluetooth
Do we know why we are seeing this error in .extra file log ? 

09-27 19:16:20.454   202   706 I Gecko   : [Parent 202] WARNING: Message requires an excessive number of descriptors channel:a897e000 message-type:-1 header()->num_fds:10131 num_fds:0 fds_i:0: file ../../../../../../../../gecko/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 571

Please also look into bug 1074434 comment 4
Please take a look into Comment 4 and bug 1074434 comment 4. It seems to be something is going wrong during IPC. lsof command does not show 10131 fds.. we are seeing ~300fds for b2g process in lsof command.
Flags: needinfo?(benjamin)
Whiteboard: [CR 729510] → [caf priority: p1][CR 729510]
Whiteboard: [caf priority: p1][CR 729510] → [b2g-crash][caf-crash 370][caf priority: p1][CR 729510]
Keywords: crash
It there a better STR? Something we can do locally?
Hi Tapas,

we suspect bug 1063066 to cause this problem. Is it possible for you to backout the patch of bug 1063066 and test again? We could provide you with a backout patch, if necessary.
Flags: needinfo?(tkundu)
Attached patch backout_bug_1063066.patch (obsolete) — Splinter Review
Here's the backout patch for bug 1063066.
I am not the right person to look at this.
Flags: needinfo?(benjamin)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
> Created attachment 8497455 [details] [diff] [review]
> backout_bug_1063066.patch
> 
> Here's the backout patch for bug 1063066.

Could you please rebase this patch for v2.1 ?
Flags: needinfo?(tkundu) → needinfo?(tzimmermann)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #11)
> (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
> > Created attachment 8497455 [details] [diff] [review]
> > backout_bug_1063066.patch
> > 
> > Here's the backout patch for bug 1063066.
> 
> Could you please rebase this patch for v2.1 ?

I just found that our test build (which has reproduced this issue originally), does not have any fix from bug 1063066 . Our test team also is not using gallery app to reproduce this issue .

I would rather suspect this issue because of closing many fds wrongly (see comment 5) .

@bhavna: Could you please ask some IPC expert to take a look into it ?
Flags: needinfo?(bbajaj)
We need some logging patch which can confirm why we are hitting IPC error message in Comment 4 and Comment 5
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #12)
> (In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from
> comment #11)
> > (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
> > > Created attachment 8497455 [details] [diff] [review]
> > > backout_bug_1063066.patch
> > > 
> > > Here's the backout patch for bug 1063066.
> > 
> > Could you please rebase this patch for v2.1 ?
> 
> I just found that our test build (which has reproduced this issue
> originally), does not have any fix from bug 1063066 . Our test team also is
> not using gallery app to reproduce this issue .
> 
> I would rather suspect this issue because of closing many fds wrongly (see
> comment 5) .
> 
> @bhavna: Could you please ask some IPC expert to take a look into it ?

Ni :cyu, :khuey to see if they have any other ideas here or give you a patch to get more logs. Hopefully they can help or pass it to the right person.
blocking-b2g: 2.1? → 2.1+
Flags: needinfo?(khuey)
Flags: needinfo?(cyu)
Flags: needinfo?(bbajaj)
The stack is in bluetooth code, so the bluetooth team needs to look at this.
Flags: needinfo?(khuey)
Flags: needinfo?(cyu)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> The stack is in bluetooth code, so the bluetooth team needs to look at this.
Thomas and I are working on this bug.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> The stack is in bluetooth code, so the bluetooth team needs to look at this.

Right, but I don't know the available logging options for memory leaks. I remember you were working on this topic. Can you provide some information or pointers to documentation?
Flags: needinfo?(tzimmermann) → needinfo?(khuey)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #12)
> (In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from
> comment #11)
> > (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #9)
> > > Created attachment 8497455 [details] [diff] [review]
> > > backout_bug_1063066.patch
> > > 
> > > Here's the backout patch for bug 1063066.
> > 
> > Could you please rebase this patch for v2.1 ?
> 
> I just found that our test build (which has reproduced this issue
> originally), does not have any fix from bug 1063066 . Our test team also is
> not using gallery app to reproduce this issue .
> 
> I would rather suspect this issue because of closing many fds wrongly (see
> comment 5) .

In this case, could you test _with_ the v2.1 patch from bug 1063066? Bug 1063066 fixes a problem with file descriptors and is unrelated to the Gallery app.
Flags: needinfo?(tkundu)
I found a potential problem with stale file descriptors when the Socket setup gets interrupted. Could you check if the patch makes a difference, maybe together with bug 1063066?
Attachment #8498037 - Flags: feedback?(tkundu)
(In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #17)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> > The stack is in bluetooth code, so the bluetooth team needs to look at this.
> 
> Right, but I don't know the available logging options for memory leaks. I
> remember you were working on this topic. Can you provide some information or
> pointers to documentation?

We have tools to log the JS and parts of the C++ heap, and tools to find untracked memory.  But I don't see any evidence that this is caused by a memory leak.  The pointer we're crashing at is close to 0x5a5a5a5a which is the mark for freed memory, so this looks like a use-after-free.
Flags: needinfo?(khuey)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #20)
> (In reply to Thomas Zimmermann [:tzimmermann] [:tdz] from comment #17)
> > (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #15)
> > > The stack is in bluetooth code, so the bluetooth team needs to look at this.
> > 
> > Right, but I don't know the available logging options for memory leaks. I
> > remember you were working on this topic. Can you provide some information or
> > pointers to documentation?
> 
> We have tools to log the JS and parts of the C++ heap, and tools to find
> untracked memory.  But I don't see any evidence that this is caused by a
> memory leak.  The pointer we're crashing at is close to 0x5a5a5a5a which is
> the mark for freed memory, so this looks like a use-after-free.

We are seeing following output in b2g-info when crash happened. 

==- 00011713, 2014-09-29 04:46:55, 2014-09-28 23:16:55 UTC, up 56665 -==
                            |       megabytes      |
           NAME   PID  PPID  CPU(s) NICE   USS   PSS   RSS SWAP VSIZE OOM_ADJ USER     
            b2g 15283     1 10233.0    0 187.7 193.2 212.3  0.0 479.1       0 root     
(Preallocated a  2334 15294     0.4   18   5.8   7.8  19.2  0.0  60.9       1 u0_a2334 
         (Nuwa) 15294 15283    94.6    0   2.1   4.2  14.0  0.0  54.1     -16 root     
Built-in Keyboa 15492 15294    53.2   18   6.7   9.1  21.7  0.0  63.5      10 u0_a15492
     Homescreen 15666 15283   173.1   18  17.5  21.8  37.2  0.0  83.7       8 u0_a15666
 Communications 27803 15294     4.8   18   8.9  11.6  25.0  0.0  65.7      10 u0_a27803
       FM Radio 28061 15294     0.9    1   7.6  10.3  23.3  0.0  68.5       2 u0_a28061
plugin-containe 30892 15283     0.3    1   4.3   6.1  12.5  0.0  53.0       2 u0_a30892

System memory info:

            Total 847.5 MB
        SwapTotal   0.0 MB
     Used - cache 492.6 MB
  B2G procs (PSS) 264.0 MB
    Non-B2G procs 228.6 MB
     Free + cache 354.8 MB
             Free  62.0 MB
            Cache 292.9 MB
         SwapFree   0.0 MB

Low-memory killer parameters:

  notify_trigger 14336 KB

  oom_adj min_free
        0  4096 KB
       58  5120 KB
      117  6144 KB
      352  7168 KB
      470  8192 KB
      588 20480 KB


b2g process VSize 479 is very big but we still have enough memory in system when crash happened. I am also in touch with our internal team to collect gecko memory report when it happens again !
Flags: needinfo?(tkundu)
(In reply to Fabrice Desré [:fabrice] from comment #3)
> Eric, looks like this comes from BT code.

Thomas is digging into it.
Flags: needinfo?(echou)
(In reply to Eric Chou [:ericchou] [:echou] from comment #22)
> (In reply to Fabrice Desré [:fabrice] from comment #3)
> > Eric, looks like this comes from BT code.
> 
> Thomas is digging into it.

Well, I had a look at the BT socket code and found a possible cause for all these file descriptors mentioned in comment #4, but in general 'lsof' looks normal here. If this bug is instead a reference-counting problem, I don't see why. I'd need a better STR to reproduce it locally. And I'm waiting for Tapas to return with some feedback on the possible fixes in bug 1063066 and the attached patch.
Tapas, Thomas is waiting for your feedback on this bug
Flags: needinfo?(tkundu)
I asked our test team to provide feedback with these two patches but our test team is in holidays now. So please expect 3 days delay for response from me . Sorry for the delayed response.
Flags: needinfo?(tkundu)
Thomas is working on this. If I am wrong, please reassign it.
Assignee: nobody → tzimmermann
Well, I'm not actually _actively_ working on the bug. The current status is still as described in comments 23 and 25.
Hi,

Can you provide some more information on the actual test? What was it doing when it crashed? Does the problem only happen after hours of testing?
Flags: needinfo?(tkundu)
Hi Bhavana, I wonder if you could talk with Tapas to get his update. Thanks.
Flags: needinfo?(bbajaj)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #25)
> I asked our test team to provide feedback with these two patches but our
> test team is in holidays now. So please expect 3 days delay for response
> from me . Sorry for the delayed response.

I am still waiting for our test team update on this. Sorry for unexpected delays.
Touchbased with CAF, they do not have the latest report yet but are expected to have one this week.
Flags: needinfo?(bbajaj)
Comment on attachment 8498037 [details] [diff] [review]
Close received socket file descriptors on errors

Review of attachment 8498037 [details] [diff] [review]:
-----------------------------------------------------------------

We tested with fix from both Bug 1074419 attachment 8498037 [details] [diff] [review] & bug 1063066 and we are not seeing this issue anymore .

Thanks for your help :)
Attachment #8498037 - Flags: feedback?(tkundu) → feedback+
Flags: needinfo?(tkundu) → needinfo?(tzimmermann)
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #33)
> Comment on attachment 8498037 [details] [diff] [review]
> Close received socket file descriptors on errors
> 
> Review of attachment 8498037 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> We tested with fix from both Bug 1074419 attachment 8498037 [details] [diff] [review]
> [review] & bug 1063066 and we are not seeing this issue anymore .
> 
> Thanks for your help :)

Great, thank you! :) I'll cleanup the attached patch and land it in the next days.
Flags: needinfo?(tzimmermann)
No longer blocks: CAF-v2.1-FC-metabug
Status: NEW → ASSIGNED
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Crash 370 (issue 370) has not been seen for 21 days.  Closing
Sorry for wrong comment 35. Please cleanup existing patch and land a fix for this issue. 

Sorry for confusion.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Attachment #8497455 - Attachment is obsolete: true
Attachment #8498037 - Attachment is obsolete: true
Attachment #8507753 - Flags: review?(shuang)
Comment on attachment 8507753 [details] [diff] [review]
[01] Bug 1074419: Close received socket file descriptors on errors

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

The bug is probably caused by bug 1029387, patches [05][06][07]; where we refactored the Bluetooth socket code.

User impact if declined: 

Not cleaning up resources leads to DOS after running Bluetooth for a long time (days). The affected code is for error handling and usually only triggered by corner cases, such as pairing problems or closing Bluetooth before pairings have been established.

Testing completed: 

Tested by CAF testers.

Risk to taking this patch (and alternatives if risky): 

The patch adds some cleanup code for errors. It shouldn't affect normal operation.

String or UUID changes made by this patch:

None
Attachment #8507753 - Flags: approval-mozilla-b2g34?
Attachment #8507753 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
https://hg.mozilla.org/mozilla-central/rev/735159446aef
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S7 (24Oct)
Unable to verify, requires stability test for 24 hours.
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-] → [QAnalyst-Triage+][QAnalyst-verify-]
Flags: needinfo?(ktucker)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: