Closed Bug 694964 Opened 13 years ago Closed 12 years ago

crash [@ gfxSharedImageSurface::Open] when a layer is painted twice in the same transaction

Categories

(Core :: Graphics, defect)

10 Branch
ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla14
Tracking Status
firefox10 --- affected
firefox11 --- affected
firefox12 --- fixed
firefox13 --- fixed
firefox-esr10 12+ fixed
fennec - ---

People

(Reporter: kairo, Assigned: ajuma)

References

Details

(4 keywords, Whiteboard: [mobile-crash][native-crash][STR in comment 13, 15, 58, 60, 65, 67, and 68])

Crash Data

Attachments

(2 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-e1096e16-7bc6-458c-9ef3-a527b2111017 .
============================================================= 

Top frames:

0 	libxul.so 	gfxSharedImageSurface::Open 	BaseSize.h:55
1 	libxul.so 	mozilla::layers::ShadowLayerForwarder::OpenDescriptor 	gfx/layers/ipc/ShadowLayers.cpp:464
2 	libxul.so 	mozilla::layers::BasicShadowThebesLayer::Swap 	gfx/layers/basic/BasicLayers.cpp:2792
3 	libxul.so 	mozilla::layers::ShadowLayersParent::RecvUpdate 	PLayers.h:1199
4 	libxul.so 	mozilla::layers::PLayersParent::OnMessageReceived 	obj-firefox/ipc/ipdl/PLayersParent.cpp:220
5 	libxul.so 	mozilla::dom::PContentParent::OnMessageReceived 	obj-firefox/ipc/ipdl/PContentParent.cpp:1464
6 	libxul.so 	mozilla::ipc::SyncChannel::OnDispatchMessage 	ipc/glue/SyncChannel.cpp:175
7 	libxul.so 	mozilla::ipc::RPCChannel::OnMaybeDequeueOne 	ipc/glue/RPCChannel.cpp:431
8 	libxul.so 	RunnableMethod<mozilla::ipc::RPCChannel, bool , Tuple0>::Run 	ipc/chromium/src/base/task.h:308
9 	libxul.so 	mozilla::ipc::RPCChannel::DequeueTask::Run 	RPCChannel.h:487
10 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:319


This crash started popping up on October 12 with Nightly Fennec builds of that day, from looking at https://crash-stats.mozilla.com/report/list?signature=gfxSharedImageSurface%3A%3AOpen it seems to happen on different Android versions (at least that's what I derive from seeing different kernel versions, Naoki or someone might correct me here) and all Nightly builds since then - but it only happens on Fennec and only on Nightly.

This also seems to be the #1 crash on Fennec Trunk right now - at least in my own stats it's the only one of which we have 10 or more crashes every day.
This crash is happening in code that was recently added by Bug 690469.
Summary: crash gfxSharedImageSurface::Open → crash [@ gfxSharedImageSurface::Open]
Is there are some way to reproduce this crash? could it be related to the same problem we've workaround-ed for OGL port (order of ThebesLayer Front/Back sync)...
BasicShadowableThebesLayer::SetBackBufferAndAttrs
Not sure should we just call SyncFrontBufferToBackBuffer from that place for all backends?
but from other side, that problem could not endup with crash... and descriptor open could fail only if Shmem was give to Chrome process and destroyed at the same time on Child process side.
(In reply to Oleg Romashin (:romaxa) from comment #2)
> Is there are some way to reproduce this crash?

No idea, I only saw crash reports coming in frequently enough that it's worth filing this.
I was hitting this crash once in current trunk build on the EEE Transformer, while playing around with the file input and the camera. Fennec crashed a while after I closed that page:
https://crash-stats.mozilla.com/report/index/bp-da19582c-6b42-4d88-abcc-0a0e92111019
I got the crash whl3 loading www.leparisien.fr . The graphics on the pagw started to flicker a lot and then fennec crashed.
Ludovic, I couldn't reproduce on a Galaxy S II Epic 4G on today's (10/23) nightly.  What phone were you using?

Can anyone else repro?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #7)
> Ludovic, I couldn't reproduce on a Galaxy S II Epic 4G on today's (10/23)
> nightly.  What phone were you using?

Er, phone *and* build.
Galaxy-S one - build was from the day I posted on bugzilla. When I visited the page again it didn't crash. I haven't see this crash since.
The crash volume is significant considering how small user base for fennec nightly. I think we should back out bug 690469 until it's fixed.
Sounds like yes we should do that.
One possible solution to make change smaller, is to make this condition
http://mxr.mozilla.org/mozilla-central/source/gfx/layers/basic/BasicLayers.cpp#2262
true always, so we will switch back to previous Swap/Sync/Create order and keep API change.
we had similar problem in bug 694140, but it was not leading to crash, and was causing some rendering problems... sounds like for software mode we have unreproducible crashes...
Whiteboard: [mobile-crash]
Apparently bug 701617 contains rock-steady STR.
I am still able to reproduce this crash also by surfing to my battery crash page:
people.mozilla.com/~jhammink/webapi_test_pages/BatteryCrashv2.html

Actual crash report is here:
https://crash-stats.mozilla.com/report/index/bp-4e92cbbf-54fc-4874-af32-034262111122
OS: Linux → Android
It's #2 top crasher in 10.0a2 (only about 700 ADU).
It's #1 top crasher in 10.0b2 with about 12,000 ADU.
Keywords: topcrash
It's #1 top crasher in 10.0b3 (about 11,000 ADU) with 16% of all crashes.
There have been no crashes in 11.0a2 and 12.0a1 (about 500 ADU).
Blocks: 690469
Whiteboard: [mobile-crash] → [mobile-crash], [STR in comment 13 and 15]
Version: Trunk → 10 Branch
tracking-fennec: --- → ?
Perhaps George can take a look at this when he has a moment.
If it is easy to reproduce, then it would be nice to get it crashed on debug build with NSPR_LOG_MODULES=Layers:5 exported... so we can see order of layers transactions/ctor/dtor calls.
We have here mFrontBuffer which is coming from child process, stick in UI process, and somehow get's destroyed in child process in such way that UI process does not know about that...
I see one way how it might happen:
BasicShadowableThebesLayer::PaintBuffer pushing mBackBuffer into transaction queue, but don't move that buffer into read-only state... while transaction in pending state time between ::PaintBuffer and BasicShadowableThebesLayer::SetBackBufferAndAttrs, we do receive another BasicShadowableThebesLayer::CreateBuffer call with different size... after that we have mBackBuffer  destroyed without notifying UI process.

I guess the right proposal here is to set mBackBuffer = NULL right after PaintedThebesBuffer call, or move it to mROFrontBuffer... so there is no chance that mBuffer will be destroyed after that.
But in that case we should handle situation where we can have 3 buffers, and destroy 3-rd buffer correctly
Kind of blind fix, because I'm not able to reproduce this crash, will create try build probably someone could run manual test round and check if it still reproducible or not, or we can just push it and check if crash has disappear
Attachment #587937 - Flags: review?(joe)
tracking-fennec: ? → -
Comment on attachment 587937 [details] [diff] [review]
Crash fix on assumption from comment 21

This looks perfectly reasonable to me, but I don't know whether this is a thing that should actually happen in the wild. I'm going to cede review to Chris and/or Ali, who have more recent experience with cross-process things.
Attachment #587937 - Flags: review?(jones.chris.g)
Attachment #587937 - Flags: review?(joe)
Attachment #587937 - Flags: review?(ajuma)
Comment on attachment 587937 [details] [diff] [review]
Crash fix on assumption from comment 21


>+  } else {
>+    BasicManager()->ShadowLayerForwarder::DestroySharedSurface(&mBackBuffer);

Oh, here we should destroy aBuffer.

>   }
Attachment #587937 - Flags: review?(jones.chris.g) → review-
Attachment #587937 - Attachment is obsolete: true
Attachment #587937 - Flags: review?(ajuma)
Attachment #588206 - Flags: review?(ajuma)
Attachment #588206 - Flags: review?(ajuma) → review+
Pushed into:
https://hg.mozilla.org/mozilla-central/rev/964b118ac852

hope it will help.
if not then we should reopen and investigate again
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
(In reply to Oleg Romashin (:romaxa) from comment #27)
> hope it will help.
As it doesn't happen in Nightly or Aurora (small sample group of about 800 ADU), we don't know if it's fixed using crash stats.
It's still #1 top crasher in Fennec 10.0b4 (about 10,000 ADU).
but, aurora does not have that fix... any crashes from nightly?
(In reply to Oleg Romashin (:romaxa) from comment #29)
> but, aurora does not have that fix... any crashes from nightly?
It has never crashed in Fennec Native Nightly and Aurora.
It crashed in Fennec Aurora 11.0a2/20120111.
People using Nightly and Aurora are not statistically representative.

It seems risky to land this patch in the latest Beta as it also affects the desktop version.
Ok, let's wait when 12 will be representative
(In reply to Scoobidiver from comment #30) 
> It seems risky to land this patch in the latest Beta as it also affects the
> desktop version.

Note that this patch only affects shadow layers, which aren't used by default anywhere on desktop right now. (Shadow layers are currently only used for e10s and for off-main-thread compositing.)

I'd advocate at least taking this on Aurora, so that the next Beta for tablets (which, as I understand it, will still be XUL Fennec rather than native Fennec) has this.

Landing this on Beta right away is indeed risky (given Comment 22), but might be worth doing if there's enough time to land it, measure the effect it's having, and back out if it's making things worse.
I can consistently reproduce this crash on Nightly prior to this landing, but not on the latest Nightly. So I feel reasonably confident that the patch that landed fixed the issue.

Steps to reproduce:
1) Visit http://people.mozilla.com/~ajuma/fennec/test.html
2) Click on "one"
3) Click on OK
Target Milestone: --- → mozilla12
> Steps to reproduce:
> 1) Visit http://people.mozilla.com/~ajuma/fennec/test.html
> 2) Click on "one"
> 3) Click on OK
oh, great, we have test for that crash.
then we definitely should land it to 11, and ajuma is right, this affect only shadowLayers which are currently used only in e10s fennec
(In reply to Ali Juma [:ajuma] from comment #32)
> might be worth doing if there's enough time to land it, measure the effect
> it's having, and back out if it's making things worse.
As it's reproducible, impacts only the mobile version and there will be a Beta 5 (this night) and 6 (in one week), it's worth asking a Beta approval ASAP.
Comment on attachment 588206 [details] [diff] [review]
Updated patch, drop obsolete buffer carefully

[Approval Request Comment]
Regression caused by (bug #): 690469

User impact if declined: This is the top crash on XUL Fennec, both on Beta and on Aurora.

Testing completed (on m-c, etc.): This has been on m-c since Jan. 12, but this is of limited relevance given the small number of XUL Fennec users we have on Nightly -- the number of users is small enough that we weren't even seeing this crash before the patch landed. In my own testing on a Galaxy Tab, the patch does fix the crash (see Comment 33).

Risk to taking this patch (and alternatives if risky): The patch is sufficiently non-trivial that there is some risk here. The risk is only to XUL Fennec -- this code isn't used by default anywhere else right now. We should definitely take this on Aurora. Given that this is the top crash, we should also consider taking it on Beta despite the risk (ideally though, we should only do this if there's still time to back it out if it causes regressions).
Attachment #588206 - Flags: approval-mozilla-beta?
Attachment #588206 - Flags: approval-mozilla-aurora?
Comment on attachment 588206 [details] [diff] [review]
Updated patch, drop obsolete buffer carefully

[Triage Comment]
Our understanding of this issue is that although this is in shared code, XUL Fennec is the only client of this change. Approving for Aurora/Beta.
Attachment #588206 - Flags: approval-mozilla-beta?
Attachment #588206 - Flags: approval-mozilla-beta+
Attachment #588206 - Flags: approval-mozilla-aurora?
Attachment #588206 - Flags: approval-mozilla-aurora+
Depends on: 720353
This looks to be the most likely cause of Bug 720353, which is causing more crashes on beta (0.52 crashes/ADU) than the number of crashes fixed by this bug (0.24 crashes/ADU).
Since the landing of this bug on mozilla-beta is the most likely cause of Bug 720353, we should back out this bug from beta.

[Approval Request Comment]
Regression caused by (bug #): The landing of this bug (Attachment 588206 [details] [diff]) on beta.
User impact if declined: Attachment 588206 [details] [diff] is suspected of causing more crashes on beta (0.52 crashes/ADU) than it resolved (0.24 crashes/ADU), so declining to back it out would mean a net increase of 0.28 crashes/ADU.
Testing completed (on m-c, etc.): Bug 720353 has only been seen on beta, likely due to the small number of XUL Fennec users on Aurora and Nightly, so landing on m-c wouldn't give us any useful information.
Risk to taking this patch (and alternatives if risky): Low-risk. This is just a backout of a patch than landed on beta last week.
Attachment #590713 - Flags: approval-mozilla-beta?
Attachment #590713 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
It's #2 top crasher in 10.0.
(In reply to Scoobidiver from comment #43)
> It's #2 top crasher in 10.0.

I think this should be re-opened since it was backed out of Beta 10 and Beta 11. Joe - can you find somebody to take a look?
Assignee: nobody → joe
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It's conceivable that this crash was really just the problem described in Bug 718150 (or at least closely related). Boris/Chris/Oleg, do you think that patch there might have fixed this crash?
In thebes layers case we do not have internal mSize or mBounds variables, and do check of FrontSurface size using actual buffers structures.
I've noticed one possible problem is that BasicShadowThebesLayer::Swap check only surface size, but not ContentType of buffer.
And another possible problem described in bug 720353#c6
would be nice to find testcase which reproduce that problem
Bug 718150 should not have led to crashes if I understand the code correctly...
Oleg, can you proceed with your idea in comment 46, and maybe we can resolve the crash in bug 720353?

Alternately, someone able to reproduce that crash would be helpful...
Assignee: joe → romaxa
(In reply to Joe Drew (:JOEDREW!) from comment #48)
> Alternately, someone able to reproduce that crash would be helpful...
There are STR in comment 13 and comment 15.

Is it reproducible in Native Fennec?
(In reply to Scoobidiver from comment #49)
> (In reply to Joe Drew (:JOEDREW!) from comment #48)
> > Alternately, someone able to reproduce that crash would be helpful...
> There are STR in comment 13 and comment 15.
> 
> Is it reproducible in Native Fennec?

To be clear, there are two crashes being discussed here -- the crash for this bug, and the crash in Bug 720353.

The crash for this bug has STR in comments 13, 15, and 33. It's not reproducible in the current Native Fennec, since the current Native Fennec doesn't use shadow layers. Native Fennec with OMTC does use shadow layers, but the crash for this bug happens in Basic shadow layers while Native Fennec with OMTC uses GL shadow layers. There may turn out to be an analogous crash though, if the root cause is on the content side (since both XUL Fennec and Native Fennec with OMTC use Basic layers on the content side).

The crash in Bug 720353 doesn't have STR. It's also a shadow layers crash, but on the content side, so it may well be possible to reproduce in Native Fennec with OMTC.
I got a different Crash in 33 on Galaxy Nexus doing a 1 off:
> Steps to reproduce:
> 1) Visit http://people.mozilla.com/~ajuma/fennec/test.html
> 2) Click on "one"
> 3) Click on OK
4) Rotate device

It looks like an OOM crash. (03-19 15:06:23.419: E/GeckoApp(3020): low memory); cannot current duplicate the bug based on the repro steps.  Removing reproducible for now.
Keywords: reproducible
This signature is still high in XUL releases (10 ESR). Should we land this on the ESR branch for the upcoming XUL releases?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #53)
> This signature is still high in XUL releases (10 ESR). Should we land this
> on the ESR branch for the upcoming XUL releases?
Fixing this bug with the current patch causes bug 720353 to appear.
https://bugzilla.mozilla.org/show_bug.cgi?id=694964#c52 is with nightly build.
Does not crash with nightly.

Doesn't seem to crash for Galaxy Nexus with the ESR release using the steps to repro given here: https://bugzilla.mozilla.org/show_bug.cgi?id=694964#c33
We're going to minus this for ESR10 because this patch caused a worse regression, and there's no patch that fixes both.
I've been able to reproduce this on Native Fennec on an HTC Desire, by rotating the phone back and forth while a page is loading (the crash doesn't always happen while doing this, but happens eventually when I keep repeating this).

Here's what seems to be the problem:

The crash happens when a Thebes layer gets painted twice within a single transaction. This happens when a transaction is set as incomplete (via BasicLayerManager::SetTransactionIncomplete). The result of painting the Thebes layer twice is that we have two TOpPaintThebesBuffer Edits for the same layer within the same transaction. Both of these have the same newFrontBuffer. This means that the ShadowThebesLayer's Swap function gets called twice with the same value of aNewFront. After the first such call, the ShadowThebesLayer's mFrontBufferDescriptor is aNewFront.buffer(). During the second such call, aNewBack's buffer is set to mFrontBufferDescriptor, but mFrontBufferDescriptor's value doesn't change (since it gets set again to the same aNewFront.buffer()). The result is that after this call and the corresponding SetBackBufferAndAttrs on the shadowable layer, the ShadowableThebesLayer's mBackBuffer is the same as ShadowThebesLayer's mFrontBufferDescriptor. The crash then happens when the Shadowable layer decides to destroy mBackBuffer (during a call to CreateBuffer), and the Shadow layer subsequently tries to open mFrontBufferDescriptor.

Another thing that can go wrong when we have two paints of the same layer in the same transaction is mBackBuffer getting destroyed during the second paint. In this case, after the first Swap, mFrontBufferDescriptor will be an already destroyed surface.
Whiteboard: [mobile-crash], [STR in comment 13 and 15] → [mobile-crash][native-crash][STR in comment 13 and 15]
Summary: crash [@ gfxSharedImageSurface::Open] → crash [@ gfxSharedImageSurface::Open] when a layer is painted twice in the same transaction
Assignee: romaxa → ajuma
I reproduced this crash on today's Nightly build by going to Awesomebar and back into main view repeatedly:

https://crash-stats.mozilla.com/report/index/bp-e89ea478-0b13-40ec-8665-874a22120327

--
Firefox 14.0a1 (2012-03-27)
Device: Samsung Galaxy S
OS: Android 2.2
Version: 10 Branch → Trunk
It's a regression from Fx 10.
Version: Trunk → 10 Branch
I was able to reproduce this in Fennec Native on Galaxy Nexus by going to http://www.kevs3d.co.uk/dev/ and rotating from portrait to landscape.
Attachment #588206 - Attachment is obsolete: true
The problem is actually that we're generating a TOpPaintThebesBuffer edit when we shouldn't be. Here's how this happens. When BasicShadowableThebesLayer::PaintBuffer gets called, it in turn calls BasicThebesLayer::PaintBuffer with the same arguments. When argument aCallback is null, BasicThebesLayer::PaintBuffer sets the transaction as incomplete, and returns immediately without actually painting anything. But BasicShadowableThebesLayer::PaintBuffer nevertheless goes ahead and generates a spurious TOpPaintThebesBuffer edit (by calling PaintedThebesBuffer). This patch fixes that by making BasicShadowableThebesLayer::PaintBuffer check if the transaction got set as incomplete.
Attachment #610636 - Flags: review?(bgirard)
Comment on attachment 610636 [details] [diff] [review]
Don't generate a Thebes Paint edit in an incomplete transaction.

ajuma++!
Attachment #610636 - Flags: review?(bgirard) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/d5a7628eed87
Target Milestone: mozilla12 → mozilla14
I got this crash twice on the HTC Desire HD in current trunk build, while having these 2 tabs open:
http://people.mozilla.com/~mwargers/tests/positonabsolute_redraw_crash.htm
http://planet.mozilla.org/
And while I was performing some panning/zooming actions in the first tab.
Ok, I think I can reproduce this crash (sort of) with only http://people.mozilla.com/~mwargers/tests/positonabsolute_redraw_crash.htm
While being on that page, I first try to pinch zoom out as far as possible, then I pan to the right like crazy to scroll to the left as to make some scroll overflow area visible at the left (this is difficult, because the page has javascript in there that scrolls to the right)
Another STR:
1. Set Fennec -> Settings -> Plugins -> Enabled
2. restart Fennec
3. go to http://people.mozilla.org/~mwargers/tests/flashcrash_ics/buttonopencrash1.htm
4. Select Always Show for popups
5. During the time that it's doing the test, go to menu-> settings

Crash should occur.
I seem to reproducibly crash on the Samsung Galaxy Nexus on this url: http://people.mozilla.org/~mwargers/tests/plugins/flash/scroll.html
Whiteboard: [mobile-crash][native-crash][STR in comment 13 and 15] → [mobile-crash][native-crash][STR in comment 13, 15, 58, 60, 65, 67, and 68]
(In reply to Benoit Girard (:BenWa) from comment #62)
> Comment on attachment 610636 [details] [diff] [review]
> Don't generate a Thebes Paint edit in an incomplete transaction.
> 
> ajuma++!

https://hg.mozilla.org/mozilla-central/rev/d5a7628eed87
Status: REOPENED → RESOLVED
Closed: 13 years ago12 years ago
Resolution: --- → FIXED
This is still one of the major topcrashes in the current Fennec "releases" which are from ESR10, should the backout or the fix be ported there?
Comment on attachment 610636 [details] [diff] [review]
Don't generate a Thebes Paint edit in an incomplete transaction.

[Approval Request Comment]
Regression caused by (bug #): Bug 690469
User impact if declined: This is a Fennec top crasher (see Comment 70).
Testing completed (on m-c, etc.): This has been on m-c since March 31 and has fixed the crash there.
Risk to taking this patch (and alternatives if risky): Low risk to Fennec, no risk to desktop (since ShadowLayers aren't used on desktop).
String changes made by this patch: None.
Attachment #610636 - Flags: approval-mozilla-esr10?
Comment on attachment 610636 [details] [diff] [review]
Don't generate a Thebes Paint edit in an incomplete transaction.

(In reply to Ali Juma [:ajuma] from comment #71)
> Comment on attachment 610636 [details] [diff] [review]
> Don't generate a Thebes Paint edit in an incomplete transaction.
> 
> [Approval Request Comment]
> Regression caused by (bug #): Bug 690469
> User impact if declined: This is a Fennec top crasher (see Comment 70).
> Testing completed (on m-c, etc.): This has been on m-c since March 31 and
> has fixed the crash there.
> Risk to taking this patch (and alternatives if risky): Low risk to Fennec,
> no risk to desktop (since ShadowLayers aren't used on desktop).
> String changes made by this patch: None.

Approved for the ESR branch. Does bug 720353 need to come along for the ride?
Attachment #610636 - Flags: approval-mozilla-esr10? → approval-mozilla-esr10+
(In reply to Alex Keybl [:akeybl] from comment #72)
 
> Approved for the ESR branch. Does bug 720353 need to come along for the ride?

No, Bug 720353 backs out on 14 a patch that had already been backed out on 10 (see Comment 41 above).
(In reply to Ali Juma [:ajuma] from comment #73)
> (In reply to Alex Keybl [:akeybl] from comment #72)
>  
> > Approved for the ESR branch. Does bug 720353 need to come along for the ride?
> 
> No, Bug 720353 backs out on 14 a patch that had already been backed out on
> 10 (see Comment 41 above).

Thanks Ali. We spent a few minutes trying to untangle the status flags and backouts without success during triage, so we appreciate your confirmation.
Landed on esr10:
https://hg.mozilla.org/releases/mozilla-esr10/rev/52183a4ceb7c

along with a follow-up bustage fix (BasicLayerManager::IsTransactionIncomplete wasn't already defined on esr10):
https://hg.mozilla.org/releases/mozilla-esr10/rev/2329d4a9d3d1
There are crashes at a lower volume in XUL Fennec 10.0.4esr (#31 top crasher), 10.0.5esr (#37) and 14.0b6 (#42) that contain the fix: https://crash-stats.mozilla.com/report/list?signature=gfxSharedImageSurface%3A%3AOpen
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: