Closed Bug 915869 Opened 11 years ago Closed 11 years ago

Firefox OS crash in mozalloc_abort(char const*) | NS_DebugBreak | mozilla::layers::PGrallocBufferParent::Write(mozilla::layers::PGrallocBufferParent*, IPC::Message*, bool)

Categories

(Core :: Graphics: Layers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla27
blocking-b2g koi+
Tracking Status
firefox25 --- wontfix
firefox26 --- fixed
firefox27 --- verified
firefox-esr24 --- unaffected
b2g-v1.2 --- verified

People

(Reporter: marcia, Assigned: bjacob)

References

Details

(4 keywords, Whiteboard: [burirun1][osrestartcrash][b2g-crash])

Crash Data

Attachments

(5 files, 1 obsolete file)

This bug was filed from the Socorro interface and is 
report bp-bdf38910-9c8f-4c34-b14d-b50e52130912.
=============================================================

Seen multiple times while testing Buri device today. Seen while changing wallpaper and operating in Gallery. More crashes here: https://crash-stats.mozilla.com/report/list?product=B2G&signature=mozalloc_abort%28char%20const*%29%20|%20NS_DebugBreak%20|%20mozilla::layers::PGrallocBufferParent::Write%28mozilla::layers::PGrallocBufferParent*,%20IPC::Message*,%20bool%29

Buri Device using: 

Gaia   9ffd2899eb91388f7fc1ce6f7a895a6f5f922c05
SourceStamp a98569f21abe
BuildID 20130912040201
Version 26.0a1

STR:
1. Long press on homescreen to change wallpaper.
2. Select a picture from Wallpaper
3. Device crashes.

This is not 100% reproducible but the crash has been seen during other operations as well.
Blocks: GFXB2G1.2
blocking-b2g: --- → koi?
Keywords: regression
All the crashes are occurring with 20130912040201 Build ID. Yesterday's build had graphics crashes as well, but not in this same stack. Might be hard to get a regression window unless we can reproduce reliably, which right now I cannot.
Whiteboard: burirun1
Whiteboard: burirun1 → [burirun1][osrestartcrash]
I can consistently reproduce this by visiting https://bugzilla.mozilla.org/show_bug.cgi?id=901214 in the browser, navigating around, and reloading.
Keywords: reproducible
FWIW, I can reproduce this on 20130913100534.   Calling this out in burirun1 announcement to ensure visibility.
Does anybody testing this has the ability to change a single pref and re-run?  If so, could you tell us if you can still reproduce after changing layers.force-tiles to false?  This preference default is being changed today, so tonight's build will have that value anyway, but wanted to get a head start.
Assignee: nobody → milan
blocking-b2g: koi? → koi+
Not reproducible in emulator, with either value of the layers.force-tiles preference. Hwc related?
Still can't reproduce in emulator (after much usage).

But the stack looks like the _converse_ of bug 912725.

In bug 912725 we had a TextureHost that was failing to notify the GrallocBufferActor when it goes away. So the GrallocBufferActor was keeping a dangling pointer to the dead TextureHost.

Here we have the opposite. Some GrallocBufferActor failed to inform a TextureHost that it went away. So the TextureHost crashes as it kept a dangling pointer to the dead GrallocBufferActor.
Depends on: 879681
Looking at code, there are two ways that we could get into this situation:

 1) (Unlikely) The GrallocBufferActor was destroyed but ActorDestroy() was never called on it.

 2) (More likely) Some TextureHost had its mBuffer pointer pointing to that GrallocBufferActor, but that was manually set, not set by SetBuffer() which does the right thing to inform the GrallocBufferActor of the TextureHost referencing it.
Grepping through code, it appears that the patch in bug 912725 fixed the last occurrence of issue 2) in comment 8, so the cause of the bug must be different.

Here is a scenario that might conceivably be what's happening (hey, I still can't reproduce locally, so I have to make guesses).

Suppose we had a TextureHost, let's call it TextureHost1, and a GrallocBufferActor normally referencing each other:

    TextureHost1  <------>  GrallocBufferActor

Now suppose that another TextureHost, let's call it TextureHost2, is being passed that same GrallocBufferActor (could conceivably happen with a DeprecatedImageHostBuffered's way of doing double buffering). So we now have TextureHost2 and GrallocBufferActor referencing each other, but the TextureHost1 was never notified about it, so the graph of references now is

    TextureHost1   ------>  GrallocBufferActor   <-------> TextureHost2

So when the GrallocBufferActor goes away, as it only remembers about TextureHost2 and not about TextureHost1, it only notifies the TextureHost2 to drop its reference to it:

    TextureHost1   ------>  DEAD GrallocBufferActor        TextureHost2

So we are left with a TextureHost1 referencing a DEAD GrallocBufferActor. When TextureHost1 is destroyed, it will send a __delete__ message to the dead GrallocBufferActor, giving a crash with the same stack as we're observing here.
Attached patch SetDeprecatedTextureHost (obsolete) — — Splinter Review
Yup, this looks like an omission on my part when I wrote that very hacky code trying to work around the tragedy that is GrallocBufferActor.

See comment 9 for the theory that this is implementing.
Attachment #804596 - Flags: review?(jmuizelaar)
...there was a bug in my previous patch (which shows how tricky this all it).

In case SetDeprecatedTextureHost is passed the same DeprecatedTextureHost as we already have, we don't want to tell it to forget about us!
Attachment #804596 - Attachment is obsolete: true
Attachment #804596 - Flags: review?(jmuizelaar)
Attachment #804598 - Flags: review?(jmuizelaar)
Comment on attachment 804598 [details] [diff] [review]
Let the GrallocBufferActor inform its old TextureHost when it's switching to another TextureHost

Cancelling reviews for now, because this crashes. The problem is that ForgetBuffer tells the TextureHost to delete mBuffer. But as (that is the whole point of this bug and patch) multiple TextureHosts can be dealing with the same buffer, and mBuffer is a pointer so that currently two TextureHosts can be referencing the same SurfaceDescriptor... so if one deletes it, the other crashes.

This additional level of indirection is apparently useless anyway, so let's remove it.
Attachment #804598 - Flags: review?(jmuizelaar)
QA Contact: sarsenyev
This patch changes the mBuffer field of DeprecatedTextureHost to be a SurfaceDescriptor stored by value, instead of a SurfaceDescriptor* pointer.

That third level of indirection (a SurfaceDescriptor is already a GrallocBufferActor*, and a GrallocBufferActor points to a android::GraphicBuffer...) seems useless, right?

I need to do this, because the problem with my other patch on this bug is that two DeprecatedTextureHosts may be pointing to the same SurfaceDescriptor, so that if one gets notified when the GrallocBufferActor goes away, and ForgetBuffer() is called, doing | delete mBuffer; |, the other TextureHost still has its mBuffer pointing to that now-dead SurfaceDescriptor. So it makes my life a lot easier if DeprecatedTextureHosts each store their own SurfaceDescriptor by value.

Is that OK?
Attachment #804805 - Flags: review?(nical.bugzilla)
Comment on attachment 804598 [details] [diff] [review]
Let the GrallocBufferActor inform its old TextureHost when it's switching to another TextureHost

Review of attachment 804598 [details] [diff] [review]:
-----------------------------------------------------------------

Alright, time to re-request review. With the other patch here, ensuring that no two DeprecatedTextureHosts point refer to the same descriptor, this patch now seems to work fine.

I still couldn't reproduce the bug originally reported here (not even now that I have a hamachi) so I can't be certain that it will be fixed by these patches, but regardless, these patches are needed IMO.
Attachment #804598 - Flags: review?(jmuizelaar)
Assignee: milan → bjacob
Attachment #804598 - Attachment description: SetDeprecatedTextureHost → Let the GrallocBufferActor inform its old TextureHost when it's switching to another TextureHost
Attachment #804805 - Attachment description: mbuffer-not-pointer → DeprecatedTextureHost should store its SurfaceDescriptor by value, and not share it with anything else
Do you need a regression window here? Looks like you've got patches for a fix, so I'm wondering if we still need to try to get a regression window.
Good question. I think that at this point, what I would like the most to know is if you can still reproduce this crash with a build using today's mozilla-central (so, any build labeled 20130914 or later), as I've been unable to reproduce (despite identifying various issues in the source code that would explain these crashes, whence the patches).
Attachment #804598 - Flags: review?(jmuizelaar) → review+
Comment on attachment 804805 [details] [diff] [review]
DeprecatedTextureHost should store its SurfaceDescriptor by value, and not share it with anything else

Review of attachment 804805 [details] [diff] [review]:
-----------------------------------------------------------------

Using a pointer for mBuffer was meant to let us reduce header inclusion but if it caused confusion the header thing is not that important here (especially now that we want to kill DeprecatedTextureHost anyway).
Attachment #804805 - Flags: review?(nical.bugzilla) → review+
(In reply to Benoit Jacob [:bjacob] from comment #17)
> Good question. I think that at this point, what I would like the most to
> know is if you can still reproduce this crash with a build using today's
> mozilla-central (so, any build labeled 20130914 or later), as I've been
> unable to reproduce (despite identifying various issues in the source code
> that would explain these crashes, whence the patches).

I just crashed in this stack using the following Buri build:

Gaia   a0079597d510ce8ea0b9cbb02c506030510b9eeb
SourceStamp c4bcef90cef9
BuildID 20130916040205
Version 26.0a1

I had just done something in camera when the crash occurred.
Rebased, new try push, hopefully the oranges were another bug that's no longer here...
https://tbpl.mozilla.org/?tree=Try&rev=38f99001e298
Meh, tested locally, got about:memory reports from reftests running in emulator, my patch is leaking gralloc memory like a sieve... back to drawing board.
Whiteboard: [burirun1][osrestartcrash] → [burirun1][osrestartcrash][b2g-crash]
My attempts so far were working with the assumption/hope that no two TextureHosts would want to reference the same GrallocBufferActor at the same time, but that was very tricky in presence of double buffering using 2 texturehosts.

This is a different approach and I believe it is simpler and doesn't make such assumptions. It passes reftest locally and everything seems to work fine in the emulator (browsing and various gaia apps) and I looked in about:memory, it doesn't seem to leak.

https://tbpl.mozilla.org/?tree=Try&rev=297baa2d8464
Attachment #805997 - Flags: review?(nical.bugzilla)
Note the nsTAutoArray with built-in size 2... the assumption here is that a GBA will rarely be referenced by more than 2 THosts, is that an OK assumption? If there are more THosts than expected, the array will become heap-allocated (slower).
Comment on attachment 805997 [details] [diff] [review]
Let the GrallocBufferActor keep an _array_ of backrefs to TextureHosts referencing it

Review of attachment 805997 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/opengl/TextureHostOGL.cpp
@@ +1209,5 @@
>  {
>    MOZ_ASSERT(!mBuffer, "Will leak the old mBuffer");
> +
> +  if (aBuffer != mBuffer) {
> +    // only done for hacky fix in gecko 23 for bug 862324.

lame nit: you can also remove "gecko 23" here
Attachment #805997 - Flags: review?(nical.bugzilla) → review+
(In reply to Benoit Jacob [:bjacob] from comment #26)
> Note the nsTAutoArray with built-in size 2... the assumption here is that a
> GBA will rarely be referenced by more than 2 THosts, is that an OK
> assumption? If there are more THosts than expected, the array will become
> heap-allocated (slower).

It looks like a fair assumption, and I am pretty sure it will not make a difference if the assumption turns out to be wrong.
I think I understand how to reproduce the crash on master hamachi.
-[1] Open 3-4 tabs on the browser app.
-[2] open relatively complex site in each tabs.
-[3] refresh each tab's content one by one.

hamachi do not have enough memory to load all tabs. low memory killer kills a background tab. The tab icon was changed to the crashed state icon. Continue the STR until b2g crash.
Attached file logcat log of the crash —
Before the crash happens, child side print no error. From this, I suspected low memory killer.
Attached file dmesg of the crash —
low memory killer killed a lot of processes.
Attachment #806021 - Attachment is patch: false
By applying the patches, the crash did not happen in the STR in Comment 29 until now.
(In reply to Sotaro Ikeda [:sotaro] from comment #32)
> By applying the patches, the crash did not happen in the STR in Comment 29
> until now.

Many thanks for testing!
Bug reproduces on the latest build, but was not able to reproduce it on the previous builds with steps in comment 29
Cannot get the regression window since it's doesn't reproduce frequently
(In reply to sarsenyev from comment #34)
> Bug reproduces on the latest build, but was not able to reproduce it on the
> previous builds with steps in comment 29
> Cannot get the regression window since it's doesn't reproduce frequently

I think we've got a path forward here now, so let's disregard the regression window at this point.
(In reply to sarsenyev from comment #34)
> Bug reproduces on the latest build, but was not able to reproduce it on the
> previous builds with steps in comment 29
> Cannot get the regression window since it's doesn't reproduce frequently

From last Friday to Monday morning, tiled rendering was enabled. On tiled rendering enabled ROM, this crash is very difficult to generate.
(In reply to Sotaro Ikeda [:sotaro] from comment #36)
> (In reply to sarsenyev from comment #34)
> > Bug reproduces on the latest build, but was not able to reproduce it on the
> > previous builds with steps in comment 29
> > Cannot get the regression window since it's doesn't reproduce frequently
> 
> From last Friday to Monday morning, tiled rendering was enabled. On tiled
> rendering enabled ROM, this crash is very difficult to generate.

Hello Sotaro: Can you please clarify if you need QA's help in reproducing this bug? In yesterday's build my feeling was the crash was happening pretty randomly.
Landed the first and third patch. The second one isn't needed, won't land.

remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/f71237f92e82
remote:   https://hg.mozilla.org/integration/b2g-inbound/rev/60a5068b18b8
Target Milestone: --- → mozilla27
(In reply to Marcia Knous [:marcia] from comment #37)
> 
> Hello Sotaro: Can you please clarify if you need QA's help in reproducing
> this bug? In yesterday's build my feeling was the crash was happening pretty
> randomly.

When I testing, I used following URL. I open 3 tabs using the following URL. And then choose a background tab to foreground. And continue forever until the crash happens.
http://www.theregister.co.uk/2013/09/17/yahoo_groups_complaints_continue/

On my hamachi, I can consistently regenerate the crash until now.
(In reply to Sotaro Ikeda [:sotaro] from comment #39)
> (In reply to Marcia Knous [:marcia] from comment #37)
> When I testing, I used following URL. I open 3 tabs using the following URL.
> And then choose a background tab to foreground. And continue forever until
> the crash happens.
> http://www.theregister.co.uk/2013/09/17/yahoo_groups_complaints_continue/

When a content of the URL is shown, I always select another tab that is already dead icon.
Attempts to reproduce STR from comment 0 were not successful and appears resolved. Tried various wallpapers (10) and could not reproduce intermittent behavior.

In followup to comment 39 - tested the URL from comment 39 http://www.theregister.co.uk/2013/09/17/yahoo_groups_complaints_continue/ in 3 separate tabs in Browser. 

When each tab was opened, web page displayed properly. However, when user goes into tab view within Browser, the icon for two of the tabs appear as a frowned folder with the website description to the right of image. 


Environmental Variables
Build ID: 20130917180029
Gecko: http://hg.mozilla.org/mozilla-central/rev/53a2011a01b2
Gaia: 678c1d23150b0b9b037de9b2122fcccd75bdb6f5
Platform Version: 27.0a1
Buri Base 20130917
Attempts to reproduce STR from comment 0 were not successful. Tried various wallpapers (10) and could not reproduce intermittent behavior.

In followup to comment 39 - tested the URL from comment 39 http://www.theregister.co.uk/2013/09/17/yahoo_groups_complaints_continue/ in 3 separate tabs in Browser. 

When each tab was opened, web page displayed properly. However, when user goes into tab view within Browser, the icon for two of the tabs appear as a frowned folder with the website description to the right of image. 


Environmental Variables
Build ID: 20130917180029
Gecko: http://hg.mozilla.org/mozilla-central/rev/53a2011a01b2
Gaia: 678c1d23150b0b9b037de9b2122fcccd75bdb6f5
Platform Version: 27.0a1
Buri Base 20130917
bjacob, looks like this is still happening. Whats the plan here?
Flags: needinfo?(bjacob)
Comment 45 is on Mozilla-Aurora. That tree doesn't have that patch. So the plan to fix that is to request aurora approval. I sync'd with release management on Tuesday, see "Where are we on merge day?" thread on release@, and my understanding was that it had landed early enough to be picked up for B2G v1.2.
Flags: needinfo?(bjacob)
Comment on attachment 804598 [details] [diff] [review]
Let the GrallocBufferActor inform its old TextureHost when it's switching to another TextureHost

There are two patches to be backported here. This one, and the 3rd one. This one is not by itself the 'right' fix, but it is what the other one applies on top of.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): regressed between gecko 18 and gecko 26
User impact if declined: B2G crashes often
Testing completed (on m-c, etc.): m-c for 1.5 days
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #804598 - Flags: approval-mozilla-aurora?
Comment on attachment 805997 [details] [diff] [review]
Let the GrallocBufferActor keep an _array_ of backrefs to TextureHosts referencing it

This one is the 'right' fix. It fixes the crashes according to QA (see above comments) and doesn't leak contrary to the earlier approaches tried on this bug (it's green on TBPL while the other approaches leaked too much to complete reftest runs successfully). So it is IMO beneficial and low-risk enough for Aurora approval.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): some time between gecko 18 and 26
User impact if declined: B2G crashes often (or, if one only takes the first patch and not this one, B2G leaks)
Testing completed (on m-c, etc.): m-c for 1.5 days
Risk to taking this patch (and alternatives if risky): low
String or IDL/UUID changes made by this patch: none
Attachment #805997 - Flags: approval-mozilla-aurora?
I saw this crash last evening and filed Bug 918178, I can dupe it back to this bug since you are going to put the fix on that branch,
Attachment #804598 - Flags: approval-mozilla-aurora?
Attachment #805997 - Flags: approval-mozilla-aurora?
Blocks: 918178
I am still seeing this affect aurora; moz-central has an issue where the wallpaper is not seen and the graphic is converted to text, with no image seen (bug 919452), history is not seen (bug 918970)

Should I open a new bug?  or reopen this bug?
Can you post a stack or crash link for the crash on aurora?

The moz-central symptoms you're describing don't sound related to this bug, unless you know something that I don't :-)
I think I made a mistake when I thought I flashed my build on top.  I might have not done so as the crash reports in socorro doesn't reflect the 9/22 build I used.  

I'm reinvestigating and I don't see the Wallpaper option in aurora.  If I come across this crash I'll report it as a new bug.
Verified fixed on v1.2 and current master,

 - the crash doesn't happen when longpress homescreen and select a wallpaper (comment 0)
 - the crash doesn't happen when open 3-4 browser tabs and keep refreshing (comment 39) 

Build ID: 20130924004002
Gecko: http://hg.mozilla.org/releases/mozilla-aurora/rev/b34384409be6
Gaia: a13c76f8d3c617ee57c302c103da04ed1a6298d1
Platform Version: 26.0a2

Build ID: 20130924040201
Gecko: http://hg.mozilla.org/mozilla-central/rev/1fda74e33e06
Gaia: a22ba4a3a9efd99f94adf9ece8a2b391d4df295b
Platform Version: 27.0a1
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: