Last Comment Bug 756253 - Beta crashes @ _cairo_user_data_array_fini on kijiji.ca (desktop site) while zooming or panning
: Beta crashes @ _cairo_user_data_array_fini on kijiji.ca (desktop site) while ...
Status: VERIFIED FIXED
[Engagement][native-crash][gfx]
: crash, productwanted, reproducible
Product: Firefox for Android
Classification: Client Software
Component: General (show other bugs)
: 14 Branch
: ARM Android
: -- critical (vote)
: Firefox 16
Assigned To: Benoit Girard (:BenWa)
:
Mentors:
http://kijiji.ca/
: 681114 759680 (view as bug list)
Depends on:
Blocks: 759676 759675
  Show dependency treegraph
 
Reported: 2012-05-17 13:41 PDT by Matej Novak [:matej]
Modified: 2012-09-27 06:54 PDT (History)
18 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
verified
-
verified
verified
15+


Attachments
Valgrind log (65.59 KB, text/plain)
2012-06-06 14:48 PDT, Ali Juma [:ajuma]
no flags Details
patch (1.05 KB, patch)
2012-06-07 10:28 PDT, Benoit Girard (:BenWa)
snorp: review+
blassey.bugs: approval‑mozilla‑aurora+
blassey.bugs: approval‑mozilla‑beta+
bgirard: checkin+
Details | Diff | Splinter Review

Description Matej Novak [:matej] 2012-05-17 13:41:06 PDT
Web page or screen you were on when you saw the issue: kijiji.ca

Device: Droid Razr running Gingerbread

Crash report ID: bp-56e54338-bbd5-44f6-88dc-a53922120517

The site looks fine when it loads, but the top right section (Daily Deals) takes a long time to show up and the horizontal area under "post a classified ad" and "all new ads" usually doesn't show up at all (though it did a couple of times). 

Any panning or zooming causes the browser to crash almost immediately. Sometimes clicking a link will load the new page, sometimes it will cause the browser to crash.
Comment 1 Matej Novak [:matej] 2012-05-17 13:50:38 PDT
Also, on first visit, the site serves a location detection message in a yellow box at the top of the screen ("Welcome to Kijiji. We have matched your location to..."). On Fennec, the font is too big, the lines overlap and the message bleeds off the right side of the screen.
Comment 2 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-17 14:14:09 PDT
Crash looks like a dupe of bug 754056, which supposedly didn't affect aurora/beta.

The font being too big is probably a font inflation issue.
Comment 3 Benoit Girard (:BenWa) 2012-05-17 14:44:29 PDT
There's both crashes from incorrect management of reusable surfaces. But the cause must be different.
Comment 4 Scoobidiver (away) 2012-05-18 02:02:24 PDT
It's bug 681114 with STR.
Comment 5 Benoit Girard (:BenWa) 2012-05-18 10:21:48 PDT
I couldn't reproduce, but we know this crash is still happening. If you can confirm that it still happens and give more detailed steps that would be wonderful. In the meantime we want to increase the runtime checking and crash easier when we enter this bad state to find who's at fault. Going to post that patch in bug 681114.
Comment 6 Matej Novak [:matej] 2012-05-18 11:01:13 PDT
Seems to only happen for me if I activate the plugin that displays the top side-scrolling promo area. Otherwise it appears fine.
Comment 7 Scoobidiver (away) 2012-05-21 14:25:59 PDT
Matej, can you provide the new crash ID with the try build in bug 681114 comment 17?
Comment 8 Kevin Brosnan [:kbrosnan] 2012-05-21 14:43:56 PDT
Matej I reproduced this and added the info Scoobidiver asked for.
Comment 9 Kevin Brosnan [:kbrosnan] 2012-05-21 14:44:52 PDT
Crashes a current nightly as well on the Droid Razr.
Comment 10 Joe Drew (not getting mail) 2012-05-22 11:53:03 PDT
This seems like it's probably going to be fixed by the fix to bug 681114.
Comment 11 Benoit Girard (:BenWa) 2012-05-30 13:47:12 PDT
The problem if-and-only-if flash is loaded. It seems to crash most of the time for me and when it doesn't crash the horizontal flash ad is significantly out of sync when doing async zooming.

While reproducing this I was able to reproduce 'JS_DHashTableEnumerate' another top crasher.
Comment 12 Benoit Girard (:BenWa) 2012-05-30 14:22:21 PDT
This crash seems to be if-and-only-if flash + Android 2.3.x or LOWER.
Comment 13 Benoit Girard (:BenWa) 2012-06-01 17:06:30 PDT
I've narrowed down the code to flash drawing. I'm still unsure if it's our support code or not yet. I hope to have this narrowed down significantly monday.
Comment 14 Benoit Girard (:BenWa) 2012-06-04 13:50:40 PDT
Right now I have to narrowed down further to PluginLayer.performUpdates(). Still progressing.
Comment 15 Benoit Girard (:BenWa) 2012-06-06 14:30:57 PDT
We tried a few theories unsuccessfully:
- Click to play
- Skia
- Plugin Placeholder surface

Ali has some Valgrind traces showing some serious heap issues he will post once the trace is completed.
Comment 16 Ali Juma [:ajuma] 2012-06-06 14:48:14 PDT
Created attachment 630727 [details]
Valgrind log

This is the complete Valgrind log of a run of Fennec that included visiting kijiji.ca, clicking to play the Flash content, zooming/panning, and then crashing.

The most concerning/suspicious parts of the log are the invalid writes (that is, writing to memory that has already been freed).
Comment 17 Julian Seward [:jseward] 2012-06-06 15:50:16 PDT
(In reply to Ali Juma [:ajuma] from comment #16)
> Created attachment 630727 [details]
> The most concerning/suspicious parts of the log are the invalid writes (that
> is, writing to memory that has already been freed).

Well, basically Fennec is hosed as soon as the first invalid read occurs,
at android::Surface::getBufferIndex.  By this point, the underlying bad
event (refcounting problem?) that screws this up has already occurred.
The invalid write and invalid free later on are merely follow-on effects.

>  at 0xAC7121E6: android::Surface::getBufferIndex(android::sp<android::GraphicBuffer> const&) const (Surface.cpp:1025)
>  by 0xAC71393B: android::Surface::unlockAndPost() (Surface.cpp:1011)
>  by 0xC279EC1: ??? (in /dev/ashmem)

Do you need more frames than this?  I have a crappy stack-scanning hack
that might get you 3 or 4 frames after the ??? -- or it might not, no
promises.  You're welcome to try it if you want.
Comment 18 Benoit Girard (:BenWa) 2012-06-06 18:34:33 PDT
(In reply to Julian Seward from comment #17)
> Do you need more frames than this?  I have a crappy stack-scanning hack
> that might get you 3 or 4 frames after the ??? -- or it might not, no
> promises.  You're welcome to try it if you want.

That would certainly help. I've been looking into the android source code but if we can find if libxul is calling this (or flash) that would help.
Comment 19 Benoit Girard (:BenWa) 2012-06-07 07:46:56 PDT
Something relevant in the log:
W/SurfaceComposerClient( 2762): Destroying surface while a transaction is open. Client 0x219358: destroying surface 142, mTransactionOpen=1
Comment 20 Jeff Muizelaar [:jrmuizel] 2012-06-07 10:00:50 PDT
(In reply to Benoit Girard (:BenWa) from comment #19)
> Something relevant in the log:
> W/SurfaceComposerClient( 2762): Destroying surface while a transaction is
> open. Client 0x219358: destroying surface 142, mTransactionOpen=1

It sounds like the right thing to do here might be to just disable placeholders. This seemed to fix the problem in our testing.
Comment 21 Benoit Girard (:BenWa) 2012-06-07 10:28:06 PDT
Created attachment 631032 [details] [diff] [review]
patch
Comment 22 Benoit Girard (:BenWa) 2012-06-07 11:05:08 PDT
Comment on attachment 631032 [details] [diff] [review]
patch

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 727116
User impact if declined: Crashes on 2.3 and lower when running Flash and zooming
Testing completed (on m-c, etc.): Tested locally, will be on m-c soon.
Risk to taking this patch (and alternatives if risky): mobile only, very low. Disabling a feature via a pref.
String or UUID changes made by this patch: none.
Comment 24 JP Rosevear [:jpr] 2012-06-07 11:25:12 PDT
Are there any performance implications disabling place holders?
Comment 25 Brad Lassey [:blassey] (use needinfo?) 2012-06-07 11:59:39 PDT
Let's confirm that this fixes the crash from crash-stats.

Product and QA to evaluate the trade off of crashiness vs. the flash zooming experience.
Comment 26 JP Rosevear [:jpr] 2012-06-07 13:57:39 PDT
We should keep digging for a better solution as well, because the tradeoff here could be pretty ugly.
Comment 27 Graeme McCutcheon [:graememcc] 2012-06-08 04:20:42 PDT
https://hg.mozilla.org/mozilla-central/rev/a9455a51d543

(Merged by Ed Morley)
Comment 28 Brad Lassey [:blassey] (use needinfo?) 2012-06-08 07:14:02 PDT
reopening because we're not sure this fixes the crash (should have had a leave open in the whiteboard)
Comment 30 James Willcox (:snorp) (jwillcox@mozilla.com) 2012-06-08 14:41:10 PDT
I've tried a few things to try to allow us to turn placeholders back on:

1) Leave the Surface lock acquired in AndroidJNI::getSurfaceBits(). This would prevent the plugin from locking the surface before we removed it, with the goal that it would not then attempt to unlock it after the surface was destroyed. This didn't work, but I can't really explain why. Might warrant more investigation.

2) Instead of removing the SurfaceView, which destroys the underlying Surface (causing our problem), I thought it might be nice if we could just hide the Surface. Conveniently, Surface has methods for doing this. Sadly, they all require a transaction to work, and only privileged processes are allowed to open a transaction.

At this point I think we should go ahead with the current fix. I will continue to investigate options for getting this working again, but I don't think we'll be able to get anything in before the next build.
Comment 31 Tony Chung [:tchung] 2012-06-11 09:16:41 PDT
qawanted: take a look at stability report today
Comment 32 Scoobidiver (away) 2012-06-11 09:56:18 PDT
(In reply to Tony Chung [:tchung] from comment #31)
> qawanted: take a look at stability report today
The latest crash in Nightly happened in 15.0a1/20120525 long before the patch landed.
The latest crash in Aurora happened in 14.0a2/20120604 before the patch landed.
Comment 33 Brad Lassey [:blassey] (use needinfo?) 2012-06-11 12:35:25 PDT
Comment on attachment 631032 [details] [diff] [review]
patch

Let's back this out of mozilla-central and mozilla-aurora
Comment 34 Jeff Muizelaar [:jrmuizel] 2012-06-11 12:36:42 PDT
(In reply to Brad Lassey [:blassey] from comment #33)
> Comment on attachment 631032 [details] [diff] [review]
> patch
> 
> Let's back this out of mozilla-central and mozilla-aurora

Shouldn't we fix the problems first?
Comment 35 Benoit Girard (:BenWa) 2012-06-11 12:55:12 PDT
(In reply to Brad Lassey [:blassey] from comment #33)
> Comment on attachment 631032 [details] [diff] [review]
> patch
> 
> Let's back this out of mozilla-central and mozilla-aurora

I don't think turning placeholders back on is a good idea. Keep in mind that the crash we are seeing is a heap corruption which:
1) This is not the only signature caused by the bug. I've seen dozens of unique signature while debugging this problem. Judging how frequent this crash is based on this signature is wrong and misleading. This bug accounts for other crashes!
2) Heap corruption does not necessary lead to crash but can cause unpredictable behavior as well as corrupting data.
Comment 36 James Willcox (:snorp) (jwillcox@mozilla.com) 2012-06-12 06:39:47 PDT
Backed out of mozilla-central and mozilla-aurora

https://hg.mozilla.org/mozilla-central/rev/e9bf05c14376
https://hg.mozilla.org/releases/mozilla-aurora/rev/94e4994199c0
Comment 37 Ali Juma [:ajuma] 2012-06-12 06:55:03 PDT
(In reply to Benoit Girard (:BenWa) from comment #35)
> 1) This is not the only signature caused by the bug. I've seen dozens of
> unique signature while debugging this problem. Judging how frequent this
> crash is based on this signature is wrong and misleading. This bug accounts
> for other crashes!

It's worth re-emphasizing that this may be responsible for crashes with other signatures and on other sites. For example, I just tried panning and zooming a flash video on cnn.com; I did this three times, and (rather easily) got the following three crashes:
bp-79c61338-6e65-4358-bf2c-f08722120612
bp-33617e48-e514-4847-91a6-323b92120612
bp-2ec951d2-5cbf-426c-bbe9-d954d2120612

Importantly, note that each of these crashes has a different signature.
Comment 38 James Willcox (:snorp) (jwillcox@mozilla.com) 2012-06-12 07:01:04 PDT
I agree that this could be really bad, fwiw. I hope to have a real fix some time today, however.
Comment 39 Benoit Girard (:BenWa) 2012-06-12 11:15:34 PDT
Re-nominating as a *hard* blocker because of concerns in Comment 35 and 37.

Here's some video of the current behavior:
Try 1: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.55.42.mp4
Try 2: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.55.21.mp4
Try 3: https://dl.dropbox.com/u/10523664/bug_756253/2012-06-12%2013.54.32.mp4

Note that when placeholders run long enough to not crash right away the visual behavior is equivalent to not having placeholders. I was not able to demonstrate this in all my attempts because the heap corruptions is to severe and brings down the browser in seconds.

- All hard evidence points to this crash being our single biggest source of crashes
- It causes heap corruption which is a security problem and makes it impossible to accurately gather crash stats.

I think shipping with placeholder as-is is out of the question! Disabling placeholders at this point is a no-brainer.

The only question in my mind is what additional steps can be take to lower the impact. My suggestion is to either live with the current zoom/panning behavior or hiding flash during panning/zooming.
Comment 40 Kevin Brosnan [:kbrosnan] 2012-06-12 12:48:52 PDT
Placeholders off
http://www.youtube.com/watch?v=PJDBQMZ_iDk
Placeholders on
http://www.youtube.com/watch?v=-nWODB94358
Comment 41 James Willcox (:snorp) (jwillcox@mozilla.com) 2012-06-12 13:12:28 PDT
I relanded Benoit's original patch on m-c and aurora, and also applied to beta per blassey

https://hg.mozilla.org/mozilla-central/rev/d39501066aa4
https://hg.mozilla.org/releases/mozilla-aurora/rev/19a31aa93143
https://hg.mozilla.org/releases/mozilla-beta/rev/03b65c7a3470
Comment 42 Scoobidiver (away) 2012-06-17 00:29:53 PDT
The latest crash on Nightly happened in 15.0a1/20120525 and on Aurora in 14.0a2/20120604.
There are no crashes in 14.0b7.
Comment 43 Scoobidiver (away) 2012-06-17 00:31:26 PDT
*** Bug 681114 has been marked as a duplicate of this bug. ***
Comment 44 Adrian Tamas (:AdrianT) 2012-06-18 02:14:36 PDT
Unable to reproduce the issue on Nightly 16.0a1 2012-06-17, Aurora 15.0a2 2012-06-17 or Firefox Mobile 14 beta 7 build 3 on HTC Desire running Android 2.2.2.

Verifying the issue as fixed and removing qawaned.
Comment 45 JP Rosevear [:jpr] 2012-06-20 08:41:47 PDT
*** Bug 759680 has been marked as a duplicate of this bug. ***
Comment 46 Adrian Tamas (:AdrianT) 2012-09-27 06:54:25 PDT
Per comment 44 changing flags to verified

Note You need to log in before you can comment on or make changes to this bug.