Closed Bug 1016538 Opened 10 years ago Closed 10 years ago

Intermittent B2G test_bug346659.html,test_bug365932.html | application crashed [@ mozilla::layers::ISurfaceAllocator::~ISurfaceAllocator] (due to: "Assertion failure: mUsedShmems.empty(), at ../../../gecko/gfx/layers/ipc/ISurfaceAllocator.cpp:53")

Categories

(Core :: Graphics: Layers, defect)

ARM
Gonk (Firefox OS)
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla34
Tracking Status
firefox32 --- wontfix
firefox33 --- fixed
firefox34 --- fixed
firefox-esr24 --- unaffected
firefox-esr31 --- unaffected
b2g-v1.4 --- unaffected
b2g-v2.0 --- fixed
b2g-v2.0M --- fixed
b2g-v2.1 --- fixed

People

(Reporter: ahal, Assigned: nical)

References

Details

(Keywords: assertion, intermittent-failure)

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=40473271&tree=Try Full crash stack: https://pastebin.mozilla.org/5262079 This happens intermittently when setting MOZ_CRASHREPORTER_SHUTDOWN to true (which aborts Gecko when a child process crashes) and calling appStartup.quit()
See Also: 1006273
Severity: normal → critical
Keywords: assertion
Summary: Intermittent emulator mochitest shutdown crash | application crashed [@ mozilla::layers::ISurfaceAllocator::~ISurfaceAllocator] | before MOZ_Assert: Assertion failure: mUsedShmems.empty(), at ../../../gecko/gfx/layers/ipc/ISurfaceAllocator.cpp:53 → Intermittent B2G test_bug365932.html | application crashed [@ mozilla::layers::ISurfaceAllocator::~ISurfaceAllocator] (due to: "Assertion failure: mUsedShmems.empty(), at ../../../gecko/gfx/layers/ipc/ISurfaceAllocator.cpp:53")
Note a spike in this will just be because we're now catching cases that previously went unnoticed, prior to bug 969146.
Turns out this is extraordinarily frequent. Nicolas, can you take a look at this by chance or suggest someone else who could?
Flags: needinfo?(nical.bugzilla)
Summary: Intermittent B2G test_bug365932.html | application crashed [@ mozilla::layers::ISurfaceAllocator::~ISurfaceAllocator] (due to: "Assertion failure: mUsedShmems.empty(), at ../../../gecko/gfx/layers/ipc/ISurfaceAllocator.cpp:53") → Intermittent B2G test_bug346659.html,test_bug365932.html | application crashed [@ mozilla::layers::ISurfaceAllocator::~ISurfaceAllocator] (due to: "Assertion failure: mUsedShmems.empty(), at ../../../gecko/gfx/layers/ipc/ISurfaceAllocator.cpp:53")
Jeff, as module owner, could you find someone to look at this top orange on B2G? (escalating per https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy). (~70 failures on trunk in the last week alone). Thanks :-)
Flags: needinfo?(jmuizelaar)
Sorry for the wait, I was away for 2 weeks. What's happening here: ISurfaceAllocator has a basic allocator which only purpose at the moment is to store the atomic copy-on-write counters of tiles (see gfxShmSharedReadLock). The way this allocator is written, counters can be deallocated on either the client or host side, be the shared memory itself underneath can only be deallocated on the client side. This means that all counters must be deallocated on both sides before the child side of ISurfaceAllocatro shuts down, otherwise there will be nobody to deallocate the shared memory. This assertion means "I am shutting down the client side, but there are still some counters alive". I think that this means that the host is still asynchronously destroying tiles and some of the counters are still alive. It's not entirely clear to me whether the best solution is to make sure that all of the tiles are destroyed on both sides before ISurfaceAllocator's dtor (imposing this kind of constraint on shutdown has proven to be tedious many times), or if we need to make it possible for the host side to destroy the shmems as well (maybe not easy). We are possibly late enough in the shutdown that we don't really have the luxury to send more messages to let the host side deallocate the shmems.
Flags: needinfo?(nical.bugzilla)
Actually, I can reproduce this bug even when all of the gfxShmSharedReadLocks are destroyed, so it's also possible that we are just messing up the Lock/Unlock logic. Still digging...
So what was actually happening was that when the copy-on-write path triggers in GetBackBuffer, we forget to ReadUnlock the tile, and replace the gfxSharedReadLock, loosing the possibility to balance the read count. in the same-process case it's not a big deal since the thing is wrapped in a refcounted object used by both sides, but in the cross-process case we effectively leak shmem sections. This patch also adds an assertion in the same-process path so that we can catch this problem more easily if it happens again.
Assignee: nobody → nical.bugzilla
Attachment #8479091 - Flags: review?(bas)
Oops, the last try push was build-only: https://tbpl.mozilla.org/?tree=Try&rev=bf2821d7bd31
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #102) > FYI, you can trigger tests on those pushes now. > https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs Thanks!
Flags: needinfo?(jmuizelaar)
Attachment #8479091 - Flags: review?(bas) → review+
This bug was first reported when Gecko 32 was on m-c, but the next star wasn't until weeks later when we were up to 33, so I'm not sure if v2.0 is affected by this or not.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #105) > This bug was first reported when Gecko 32 was on m-c, but the next star > wasn't until weeks later when we were up to 33, so I'm not sure if v2.0 is > affected by this or not.
Flags: needinfo?(nical.bugzilla)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #105) > This bug was first reported when Gecko 32 was on m-c, but the next star > wasn't until weeks later when we were up to 33, so I'm not sure if v2.0 is > affected by this or not. Yes, I checked the code and the issue is in 2.0 as well.
Flags: needinfo?(nical.bugzilla)
Comment on attachment 8479091 [details] [diff] [review] Fix the locking logic in tiling Approval Request Comment [Feature/regressing bug #]: [User impact if declined]: memory leak in b2g and e10s (with tiling enabled) [Describe test coverage new/current, TBPL]: [Risks and why]: not too risky, easy to back out (small change) [String/UUID change made/needed]:
Attachment #8479091 - Flags: approval-mozilla-beta?
Attachment #8479091 - Flags: approval-mozilla-b2g32?
Attachment #8479091 - Flags: approval-mozilla-aurora?
Comment on attachment 8479091 [details] [diff] [review] Fix the locking logic in tiling Too late for 32. Sorry.
Attachment #8479091 - Flags: approval-mozilla-beta?
Attachment #8479091 - Flags: approval-mozilla-beta-
Attachment #8479091 - Flags: approval-mozilla-aurora?
Attachment #8479091 - Flags: approval-mozilla-aurora+
Nervous to take any non-blocking changes on 2.0 especially in areas of gfx or areas that are high risk, given our stability situation. :milan, can you give your feedback here on landing this on 2.0 given b2g32 is not taking it ?
Flags: needinfo?(milan)
We can get away with not taking this for desktop - it only effects E10S, and you can memory leak more on desktop before you notice it. Same thing on Android, because we're all in the same process, I don't think we get the memory leak at all. On B2G we do, and we're tight for memory, so if we don't take this for 2.0, we're leaving an intermittent memory leak in place. It may even show up in the MTBF numbers. The change is as safe as a change this late can be.
Flags: needinfo?(milan)
Comment on attachment 8479091 [details] [diff] [review] Fix the locking logic in tiling Approving based on :milan's feedback.
Attachment #8479091 - Flags: approval-mozilla-b2g32? → approval-mozilla-b2g32+
Unable to verify since it's a back end issue.
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?][QAnalyst-verify-] → [QAnalyst-Triage+][QAnalyst-verify-]
Flags: needinfo?(ktucker)
Blocks: Woodduck
See Also: → 1214863
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: