Closed Bug 1093278 Opened 10 years ago Closed 10 years ago

b2g has crashed in libxul.so!CanonicalizeXPCOMParticipant [nsCycleCollector.cpp : 934 + 0x6]

Categories

(Firefox OS Graveyard :: Stability, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:2.1+, firefox34 wontfix, firefox35 fixed, firefox36 fixed, firefox-esr31 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
2.1 S9 (21Nov)
blocking-b2g 2.1+
Tracking Status
firefox34 --- wontfix
firefox35 --- fixed
firefox36 --- fixed
firefox-esr31 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: tkundu, Assigned: mccr8)

References

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 394][caf priority: p1][CR 749784])

Crash Data

Attachments

(4 files, 2 obsolete files)

Attached file stack trace
STR: Random stability test for a long time.
[Blocking Requested - why for this release]: b2g has crashed


Please note that we are seeing following memleak at time of crash : 

==- 00051712, 1414999732, 2014-11-03 12:58:52, 2014-11-03 07:28:52 UTC, up 232334 -==
                           |       megabytes       |
           NAME   PID PPID   CPU(s) NICE   USS   PSS   RSS SWAP VSIZE OOM_ADJ USER     
            b2g  2751    1 146854.6    0 248.2 259.2 279.2  0.0 488.2       0 root     
         (Nuwa)  2757 2751    924.5    0   1.9   4.8  13.6  0.0  53.8     -16 root     
Built-in Keyboa  2953 2757    257.9   18   6.7  10.4  21.3  0.0  64.1      10 u0_a2953 
     Homescreen  3125 2751   4829.2   18  21.0  27.1  40.5  0.0  99.1       8 u0_a3125 
(Preallocated a 32701 2757      0.2   18   5.3   8.6  18.9  0.0  60.7       1 u0_a32701

System memory info:

            Total 847.5 MB
        SwapTotal   0.0 MB
     Used - cache 489.1 MB
  B2G procs (PSS) 310.2 MB
    Non-B2G procs 178.9 MB
     Free + cache 358.4 MB
             Free 132.2 MB
            Cache 226.1 MB
         SwapFree   0.0 MB


and /sys/class/kgsl/kgsl/page_alloc is 69472256 bytes = 69MB which clearly says kgsl memleak ny b2g process. unfortunately we don't have DMD report for now..

one more point is that 

1) Crash has happened inside nsCycleCollector.cpp and system has enough memory to continue although b2g is leaking kgsl memory. 



I will create a separate bug for memleak issue. 

For this crash, we need a logging patch to confirm what exactly is happening inside gecko.
blocking-b2g: --- → 2.1?
Flags: needinfo?(khuey)
Whiteboard: [CR 749784]
b2g memleak bug 1093281 for comment 2

And here is the full logcat logs :

https://drive.google.com/file/d/0B1cSMS8_GuAETGFQV0piQnI4eFE/view?usp=sharing

crash timestamp is : 
"11-03 12:58:54.887  2752  2789 E AudioFlinger: no wake lock to update!" in full logcat logs.
I'll direct this to Andrew for now since I'm swamped.
Flags: needinfo?(khuey) → needinfo?(continuation)
Flags: needinfo?(continuation)
Flags: needinfo?(continuation)
Are these non-DEBUG builds?  There are some checks we do in debug builds we could try in a non-debug build.
Flags: needinfo?(tkundu)
khuey in IRC said they are not
Flags: needinfo?(tkundu)
Whiteboard: [CR 749784] → [caf priority: p1][CR 749784]
Whiteboard: [caf priority: p1][CR 749784] → [b2g-crash][caf-crash 394][caf priority: p1][CR 749784]
Keywords: crash
Observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_2.0.01.04.00.114.127
Moz BuildID: 20141029001202
B2G Version: 2.1
Gecko Version: 34.0
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=eb0aab0f13c78c7ac378ad860e865c4b6eaf669f
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e16ae12c5ab365be13fb2c98d3811b93b25ded6d
Patches: bug 1070431, bug 1083449
Olli, what do you think this crash stack means?  Maybe that the argument to CanonicalizeXPCOMParticipant is null?
Assignee: nobody → continuation
Flags: needinfo?(continuation) → needinfo?(bugs)
blocking-b2g: 2.1? → 2.1+
My guess here is that there's some kind of bug in UnmarkRemainingPurple, because I think we rarely actually call it, and it sets the mObject pointer to null.  In fact, looking at the code now, if mCount > 0, in FreeBlocks(), then we null out pointers in mFirstBlock, but of course we don't actually delete it. Then if we trigger something that iterates over the purple buffer before the first block is filled, I think we'll hit null pointers.  So maybe that's the issue?  But it would probably be safer to just null check in VisitEntries(), if in fact it is always null like this.
Eh, it looks like UnmarkRemainingPurple() is okay, thanks to the InitBlocks() call after, so I'm not sure what is going on. We can try the null check I guess.
This should be pretty safe. I checked and everything that reads mObject from the purple buffer goes through VisitEntries so this shouldn't move the crash elsewhere.
Attachment #8516348 - Flags: review?(bugs)
Rebased on top of b2g34, though the patch looks identical, I think.
Attachment #8516348 - Attachment is obsolete: true
Attachment #8516348 - Flags: review?(bugs)
Attachment #8516354 - Flags: review?(bugs)
Olli pointed out that an assert would be good.
Attachment #8516354 - Attachment is obsolete: true
Attachment #8516354 - Flags: review?(bugs)
Attachment #8516365 - Flags: review?(bugs)
Tapas, you could try this patch.  It should fix the immediate cause of the crash, assuming they are all null like this.  I think we won't end up crashing elsewhere, but I can't tell for sure.
Flags: needinfo?(tkundu)
Attachment #8516365 - Flags: review?(bugs) → review+
I'm waiting on landing this until we think it actually solves the 2.1 issue.

The try run against trunk is green:
  https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=e42b6404cb16
(In reply to Andrew McCreight [:mccr8] from comment #15)
> Tapas, you could try this patch.  It should fix the immediate cause of the
> crash, assuming they are all null like this.  I think we won't end up
> crashing elsewhere, but I can't tell for sure.

We landed this patch internally and we are waiting for feedback from our test team. I will update here asap
(In reply to Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me) from comment #17)
> (In reply to Andrew McCreight [:mccr8] from comment #15)
> > Tapas, you could try this patch.  It should fix the immediate cause of the
> > crash, assuming they are all null like this.  I think we won't end up
> > crashing elsewhere, but I can't tell for sure.
> 
> We landed this patch internally and we are waiting for feedback from our
> test team. I will update here asap

We are not seeing this again with your patch. It seems to me that this bug is very difficult to reproduce (we have seen only once) and I am not sure this patch really fixed this bug or not. 

Please go ahead and land this patch if you are ok with this :)  Thanks for your help .
Flags: needinfo?(tkundu) → needinfo?(continuation)
Alright, thanks!
Flags: needinfo?(continuation)
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/1bff6f64a418
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.1 S9 (21Nov)
Comment on attachment 8516365 [details] [diff] [review]
Null check in nsPurpleBuffer::VisitEntries.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): unknown
User impact if declined: 2.1 blocker. I want to get this on non-B2G branches, too, because we run this code on desktop, and it will get us more testing.
Testing completed: TBPL
Risk to taking this patch (and alternatives if risky): very low, just a null check.
String or UUID changes made by this patch: none
Attachment #8516365 - Flags: approval-mozilla-beta?
Attachment #8516365 - Flags: approval-mozilla-b2g34?
Attachment #8516365 - Flags: approval-mozilla-aurora?
Attachment #8516365 - Flags: approval-mozilla-b2g34? → approval-mozilla-b2g34+
Comment on attachment 8516365 [details] [diff] [review]
Null check in nsPurpleBuffer::VisitEntries.

I spoke with mccr8 about this request. It's not worth taking on Beta but can provide some value on Aurora.

Beta-
Aurora+
Attachment #8516365 - Flags: approval-mozilla-beta?
Attachment #8516365 - Flags: approval-mozilla-beta-
Attachment #8516365 - Flags: approval-mozilla-aurora?
Attachment #8516365 - Flags: approval-mozilla-aurora+
We see that crash on ESR 31.3. so would be worth to backport it there too.
Martin, any chance you have STR for the esr crash?
Flags: needinfo?(stransky)
Comment on attachment 8516365 [details] [diff] [review]
Null check in nsPurpleBuffer::VisitEntries.

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: Crashes, see comment 25. I don't know how common these are.
User impact if declined: crashes
Fix Landed on Version: 37 maybe? recent
Risk to taking this patch (and alternatives if risky): very low
String or UUID changes made by this patch: none
Attachment #8516365 - Flags: approval-mozilla-esr31?
Attached file ESR 31.3 Linux STR
There's the stracktrace. Catched on Linux/RHEL5.11.
Flags: needinfo?(stransky)
Unfortunately stack trace isn't enough here, since that is too generic. We just have a null pointer there, but the question is why.
I see a crash where also aEntry->mParticipant in VisitEntries() is null.
If I add the aEntry->mParticipant on the same place the app crashes in  GCGraphBuilder::DescribeRefCountedNode():

 2051        if (refCount == UINT32_MAX)                                    ?
>2052            Fault("overflowing refcount", mCurrPi)
It is normal for mParticipant to be null in VisitEntries().  Please file a new bug for your issue if the patch here is insufficient.  Thanks.
Attachment #8516365 - Flags: approval-mozilla-esr31? → approval-mozilla-esr31+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: