Closed Bug 756253 Opened 12 years ago Closed 12 years ago

Beta crashes @ _cairo_user_data_array_fini on kijiji.ca (desktop site) while zooming or panning

Categories

(Firefox for Android Graveyard :: General, defect)

14 Branch
ARM
Android
defect
Not set
critical

Tracking

(firefox14 verified, firefox15- verified, firefox16 verified, fennec15+)

VERIFIED FIXED
Firefox 16
Tracking Status
firefox14 --- verified
firefox15 - verified
firefox16 --- verified
fennec 15+ ---

People

(Reporter: matej, Assigned: BenWa)

References

()

Details

(Keywords: crash, productwanted, reproducible, Whiteboard: [Engagement][native-crash][gfx])

Crash Data

Attachments

(2 files)

Web page or screen you were on when you saw the issue: kijiji.ca

Device: Droid Razr running Gingerbread

Crash report ID: bp-56e54338-bbd5-44f6-88dc-a53922120517

The site looks fine when it loads, but the top right section (Daily Deals) takes a long time to show up and the horizontal area under "post a classified ad" and "all new ads" usually doesn't show up at all (though it did a couple of times). 

Any panning or zooming causes the browser to crash almost immediately. Sometimes clicking a link will load the new page, sometimes it will cause the browser to crash.
Also, on first visit, the site serves a location detection message in a yellow box at the top of the screen ("Welcome to Kijiji. We have matched your location to..."). On Fennec, the font is too big, the lines overlap and the message bleeds off the right side of the screen.
Crash looks like a dupe of bug 754056, which supposedly didn't affect aurora/beta.

The font being too big is probably a font inflation issue.
There's both crashes from incorrect management of reusable surfaces. But the cause must be different.
Assignee: nobody → bgirard
Whiteboard: [Engagement] → [Engagement][crash]
It's bug 681114 with STR.
Severity: normal → critical
blocking-fennec1.0: --- → ?
Crash Signature: [@ _cairo_user_data_array_fini]
Keywords: crash, reproducible
Hardware: All → ARM
Summary: Beta crashes on kijiji.ca (desktop site) → Beta crashes @ _cairo_user_data_array_fini on kijiji.ca (desktop site) while zooming or panning
Whiteboard: [Engagement][crash] → [Engagement][native-crash]
Version: Trunk → Firefox 14
Blocks: 681114
I couldn't reproduce, but we know this crash is still happening. If you can confirm that it still happens and give more detailed steps that would be wonderful. In the meantime we want to increase the runtime checking and crash easier when we enter this bad state to find who's at fault. Going to post that patch in bug 681114.
Seems to only happen for me if I activate the plugin that displays the top side-scrolling promo area. Otherwise it appears fine.
No longer blocks: 756556
No longer blocks: 681114
Blocks: 681114
Matej, can you provide the new crash ID with the try build in bug 681114 comment 17?
Matej I reproduced this and added the info Scoobidiver asked for.
Crashes a current nightly as well on the Droid Razr.
Keywords: qawanted
This seems like it's probably going to be fixed by the fix to bug 681114.
blocking-fennec1.0: ? → +
Whiteboard: [Engagement][native-crash] → [Engagement][native-crash][gfx]
Blocks: 759676
The problem if-and-only-if flash is loaded. It seems to crash most of the time for me and when it doesn't crash the horizontal flash ad is significantly out of sync when doing async zooming.

While reproducing this I was able to reproduce 'JS_DHashTableEnumerate' another top crasher.
This crash seems to be if-and-only-if flash + Android 2.3.x or LOWER.
I've narrowed down the code to flash drawing. I'm still unsure if it's our support code or not yet. I hope to have this narrowed down significantly monday.
Right now I have to narrowed down further to PluginLayer.performUpdates(). Still progressing.
We tried a few theories unsuccessfully:
- Click to play
- Skia
- Plugin Placeholder surface

Ali has some Valgrind traces showing some serious heap issues he will post once the trace is completed.
Attached file Valgrind log
This is the complete Valgrind log of a run of Fennec that included visiting kijiji.ca, clicking to play the Flash content, zooming/panning, and then crashing.

The most concerning/suspicious parts of the log are the invalid writes (that is, writing to memory that has already been freed).
(In reply to Ali Juma [:ajuma] from comment #16)
> Created attachment 630727 [details]
> The most concerning/suspicious parts of the log are the invalid writes (that
> is, writing to memory that has already been freed).

Well, basically Fennec is hosed as soon as the first invalid read occurs,
at android::Surface::getBufferIndex.  By this point, the underlying bad
event (refcounting problem?) that screws this up has already occurred.
The invalid write and invalid free later on are merely follow-on effects.

>  at 0xAC7121E6: android::Surface::getBufferIndex(android::sp<android::GraphicBuffer> const&) const (Surface.cpp:1025)
>  by 0xAC71393B: android::Surface::unlockAndPost() (Surface.cpp:1011)
>  by 0xC279EC1: ??? (in /dev/ashmem)

Do you need more frames than this?  I have a crappy stack-scanning hack
that might get you 3 or 4 frames after the ??? -- or it might not, no
promises.  You're welcome to try it if you want.
(In reply to Julian Seward from comment #17)
> Do you need more frames than this?  I have a crappy stack-scanning hack
> that might get you 3 or 4 frames after the ??? -- or it might not, no
> promises.  You're welcome to try it if you want.

That would certainly help. I've been looking into the android source code but if we can find if libxul is calling this (or flash) that would help.
Something relevant in the log:
W/SurfaceComposerClient( 2762): Destroying surface while a transaction is open. Client 0x219358: destroying surface 142, mTransactionOpen=1
(In reply to Benoit Girard (:BenWa) from comment #19)
> Something relevant in the log:
> W/SurfaceComposerClient( 2762): Destroying surface while a transaction is
> open. Client 0x219358: destroying surface 142, mTransactionOpen=1

It sounds like the right thing to do here might be to just disable placeholders. This seemed to fix the problem in our testing.
Attached patch patchSplinter Review
Attachment #631032 - Flags: review?(snorp)
Attachment #631032 - Flags: review?(snorp) → review+
Comment on attachment 631032 [details] [diff] [review]
patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 727116
User impact if declined: Crashes on 2.3 and lower when running Flash and zooming
Testing completed (on m-c, etc.): Tested locally, will be on m-c soon.
Risk to taking this patch (and alternatives if risky): mobile only, very low. Disabling a feature via a pref.
String or UUID changes made by this patch: none.
Attachment #631032 - Flags: approval-mozilla-beta?
Attachment #631032 - Flags: approval-mozilla-aurora?
Target Milestone: --- → Firefox 16
Are there any performance implications disabling place holders?
Let's confirm that this fixes the crash from crash-stats.

Product and QA to evaluate the trade off of crashiness vs. the flash zooming experience.
We should keep digging for a better solution as well, because the tradeoff here could be pretty ugly.
https://hg.mozilla.org/mozilla-central/rev/a9455a51d543

(Merged by Ed Morley)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
reopening because we're not sure this fixes the crash (should have had a leave open in the whiteboard)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [Engagement][native-crash][gfx] → [Engagement][native-crash][gfx][leave open]
Attachment #631032 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I've tried a few things to try to allow us to turn placeholders back on:

1) Leave the Surface lock acquired in AndroidJNI::getSurfaceBits(). This would prevent the plugin from locking the surface before we removed it, with the goal that it would not then attempt to unlock it after the surface was destroyed. This didn't work, but I can't really explain why. Might warrant more investigation.

2) Instead of removing the SurfaceView, which destroys the underlying Surface (causing our problem), I thought it might be nice if we could just hide the Surface. Conveniently, Surface has methods for doing this. Sadly, they all require a transaction to work, and only privileged processes are allowed to open a transaction.

At this point I think we should go ahead with the current fix. I will continue to investigate options for getting this working again, but I don't think we'll be able to get anything in before the next build.
qawanted: take a look at stability report today
(In reply to Tony Chung [:tchung] from comment #31)
> qawanted: take a look at stability report today
The latest crash in Nightly happened in 15.0a1/20120525 long before the patch landed.
The latest crash in Aurora happened in 14.0a2/20120604 before the patch landed.
Comment on attachment 631032 [details] [diff] [review]
patch

Let's back this out of mozilla-central and mozilla-aurora
Attachment #631032 - Flags: approval-mozilla-beta?
Attachment #631032 - Flags: approval-mozilla-beta-
Attachment #631032 - Flags: approval-mozilla-aurora-
Attachment #631032 - Flags: approval-mozilla-aurora+
tracking-fennec: --- → 15+
blocking-fennec1.0: + → soft
(In reply to Brad Lassey [:blassey] from comment #33)
> Comment on attachment 631032 [details] [diff] [review]
> patch
> 
> Let's back this out of mozilla-central and mozilla-aurora

Shouldn't we fix the problems first?
(In reply to Brad Lassey [:blassey] from comment #33)
> Comment on attachment 631032 [details] [diff] [review]
> patch
> 
> Let's back this out of mozilla-central and mozilla-aurora

I don't think turning placeholders back on is a good idea. Keep in mind that the crash we are seeing is a heap corruption which:
1) This is not the only signature caused by the bug. I've seen dozens of unique signature while debugging this problem. Judging how frequent this crash is based on this signature is wrong and misleading. This bug accounts for other crashes!
2) Heap corruption does not necessary lead to crash but can cause unpredictable behavior as well as corrupting data.
(In reply to Benoit Girard (:BenWa) from comment #35)
> 1) This is not the only signature caused by the bug. I've seen dozens of
> unique signature while debugging this problem. Judging how frequent this
> crash is based on this signature is wrong and misleading. This bug accounts
> for other crashes!

It's worth re-emphasizing that this may be responsible for crashes with other signatures and on other sites. For example, I just tried panning and zooming a flash video on cnn.com; I did this three times, and (rather easily) got the following three crashes:
bp-79c61338-6e65-4358-bf2c-f08722120612
bp-33617e48-e514-4847-91a6-323b92120612
bp-2ec951d2-5cbf-426c-bbe9-d954d2120612

Importantly, note that each of these crashes has a different signature.
I agree that this could be really bad, fwiw. I hope to have a real fix some time today, however.
Re-nominating as a *hard* blocker because of concerns in Comment 35 and 37.

Here's some video of the current behavior:
Try 1: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.55.42.mp4
Try 2: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.55.21.mp4
Try 3: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.54.32.mp4

Note that when placeholders run long enough to not crash right away the visual behavior is equivalent to not having placeholders. I was not able to demonstrate this in all my attempts because the heap corruptions is to severe and brings down the browser in seconds.

- All hard evidence points to this crash being our single biggest source of crashes
- It causes heap corruption which is a security problem and makes it impossible to accurately gather crash stats.

I think shipping with placeholder as-is is out of the question! Disabling placeholders at this point is a no-brainer.

The only question in my mind is what additional steps can be take to lower the impact. My suggestion is to either live with the current zoom/panning behavior or hiding flash during panning/zooming.
blocking-fennec1.0: soft → ?
Attachment #631032 - Flags: approval-mozilla-beta-
Attachment #631032 - Flags: approval-mozilla-beta+
Attachment #631032 - Flags: approval-mozilla-aurora-
Attachment #631032 - Flags: approval-mozilla-aurora+
The latest crash on Nightly happened in 15.0a1/20120525 and on Aurora in 14.0a2/20120604.
There are no crashes in 14.0b7.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Whiteboard: [Engagement][native-crash][gfx][leave open] → [Engagement][native-crash][gfx]
No longer blocks: 681114
Blocks: 759675
Unable to reproduce the issue on Nightly 16.0a1 2012-06-17, Aurora 15.0a2 2012-06-17 or Firefox Mobile 14 beta 7 build 3 on HTC Desire running Android 2.2.2.

Verifying the issue as fixed and removing qawaned.
Status: RESOLVED → VERIFIED
Keywords: qawanted
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.