Closed Bug 982128 Opened 6 years ago Closed 6 years ago

crash in mozilla::layers::ClientTiledLayerBuffer::ReadLock()

Categories

(Core :: Graphics: Layers, defect, critical)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla30
blocking-b2g 1.4+
Tracking Status
b2g-v1.4 --- fixed

People

(Reporter: RobertC, Assigned: bas.schouten)

References

Details

(4 keywords, Whiteboard: [fromAutomation][b2g-crash])

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-6b3bf01b-2895-4c46-917a-c0e172140311.
=============================================================

This crash occurs intermittently on local device, but seems to only be triggered by the automation tests.
It is caused when setting a new alarm in the clock app and the test tries to tap on the repeat element.

Build info:
Gaia      a351fe62c11737c722ad33aaff438f6ccd00bd4a
Gecko     https://hg.mozilla.org/mozilla-central/rev/41d962d23e81
BuildID   20140311040203
Version   30.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013

The test that causes the failure is found at:
gaiatest/tests/functional/clock/test_clock_set_alarm.py
The test ran on hamachi.
I've debugged this a bit more and it seems related to automation using the select menu so I've changed the component to Gaia. I can also replicate this crash in other tests that use the select menu picker.

Basically all this test does is:
1. Open Clock app, creates a new alarm
2. Iterates through the repeat option menu turning on and off then click 'OK' to confirm the selection.

It is much more likely to crash when the automation needs to scroll through options outisde the viewport, then select the item. The automation does it much faster than you can manually so it seems to be a race condition perhaps.

It is replicable about 9/10 times.

Here's another report:
https://crash-stats.mozilla.com/report/index/b809d171-fd36-4415-839a-9bc962140311
Easy to repro in automation, harder (but still possible) to hit on device.
blocking-b2g: --- → 1.4?
Component: Gaia::Clock → Graphics: Layers
Keywords: regression
Product: Firefox OS → Core
Whiteboard: [fromAutomation] → [fromAutomation][b2g-crash]
Version: unspecified → Trunk
Discussed with vlad & bas in IRC - this is a regression from enabling of tiling.
Blocks: b2g-tiling
Assignee: nobody → nical.bugzilla
I got the same crash when navigating in the browser. It's reproduced intermittently with manual testing, but 100% from automation.

STR:
1. go to http://mozqa.com/data/firefox/layout/mozilla.html
2. tap on the "more" link from the community section

Expected result:
Page should load all elements

Actual result:
Browser crashes

Here is another report:
https://crash-stats.mozilla.com/report/index/5f93b84a-2781-44d2-b454-fa4782140312

Tested on hamachi with build:
Gaia      3005269d4dcabcc7d27eaf72bda44a969873af8c
Gecko     https://hg.mozilla.org/mozilla-central/rev/23005b395ae8
BuildID   20140312040203
Version   30.0a1
ro.build.version.incremental=eng.tclxa.20131223.163538
ro.build.date=Mon Dec 23 16:36:04 CST 2013
We crash when trying to increment the lock count on a tile before sending it. This tells us that most likely we either don't have a lock or the lock's share memory wasn't allocated properly. Looking at the code, the second possibility is the most likely because we don't check the result of the shmem allocation in gfxShmSharedLemoryLock's constructor and (which we definitely should), and I couldn't find a code path where we set the tile's buffer without setting the lock.
This patch adds checks and assertions that will make it much easier to validate the above theory.
It doesn't fix the problem, though.
If we failed to allocate the shmem it's most likely because we hit the limit of open file descriptors (yeah, 1 tile = lock = 1FD...). In that case the best thing we can do is to stop using 1 shmem per tile, and allocate a shmem per TiledContentClient or per ClientTiledLayerBuffer. It's probably not a patch I can write by tomorrow and I'll be absent tomorrow the entire day.
Let's make sure the problem actually is the FD limit and decide where to go from there.
Attachment #8389856 - Flags: review?(chrislord.net)
Comment on attachment 8389856 [details] [diff] [review]
Add assertions and checks that the lock exists and was properly allocated

Review of attachment 8389856 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/client/TiledContentClient.cpp
@@ +523,5 @@
>      } else {
>        mBackLock = new gfxShmSharedReadLock(mManager->AsShadowForwarder());
>      }
> +
> +    MOZ_ASSERT(mBackLock->IsValid());

I wonder if we could just return a placeholder tile when this fails, rather than asserting? I imagine this will be gone as soon as Bas finishes up the multiple-locks-in-one-shmem-segment bug though, so it's no big deal.
Attachment #8389856 - Flags: review?(chrislord.net) → review+
(In reply to Chris Lord [:cwiiis] from comment #7)
> I wonder if we could just return a placeholder tile when this fails, rather
> than asserting? I imagine this will be gone as soon as Bas finishes up the
> multiple-locks-in-one-shmem-segment bug though, so it's no big deal.

I thought about doing this but we'd replace a crash by a correctness issue which is harder to find and debug, so I'd rather resolve to this at the last moment if all else fails.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f6ceb09baa53
Whiteboard: [fromAutomation][b2g-crash] → [fromAutomation][b2g-crash][leave-open]
We hit this crash consistently in Browser (repro ~75%) and intermittently in other apps (Gallery) 

STR (same as comment 5)
1. Open Browser
2. Navigate to any URL
3. Select any links
==> crash occurs

It fails our manual smoke test browser cases. 

BuildID: 20140312040203
Gaia: 3005269d4dcabcc7d27eaf72bda44a969873af8c
Gecko: 23005b395ae8
Version: 30.0a1
Keywords: smoketest
Yeah, no placeholder tiles please. This will need a complex fix. Even per thebes buffer might not be enough.
blocking-b2g: 1.4? → 1.4+
Keywords: reproducible
I won't be available tomorrow, so I'll let Bas track this (or reassign). I think that this bug is the same as bug 981315, although i am not 100% sure.
Assignee: nical.bugzilla → bas
Duplicate of this bug: 982790
Able to reproduce this somewhat consistently within the Clock App while using the Timer. Please see Bug 982789 for repro steps that were used to experience the crash four times this morning.
Duplicate of this bug: 982916
My work in bug 981315 should make this -a lot- less likely regardless of what else is going on if this is related to the allocation failing.
Depends on: 981315
(In reply to Bas Schouten (:bas.schouten) from comment #18)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/193df828b5a2

This landed for bug 981315. When it's on central, let's see if it fixes this too.
Duplicate of this bug: 983065
Duplicate of this bug: 983063
This crash no longer reproduces during manual testing using the latest master

BuildID: 20140313131233
Gaia: ea9e23abea5933656555d849b922c8da7530c90b
Gecko: fe40387eba1a
Version: 30.0a1
v1.2-device.cfg
Sounds like the pieces landed here have fixed the problem, so closing this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [fromAutomation][b2g-crash][leave-open] → [fromAutomation][b2g-crash]
Target Milestone: --- → mozilla30
You need to log in before you can comment on or make changes to this bug.