Closed Bug 1068268 Opened 6 years ago Closed 6 years ago

B2G M9 shutdown leaks involving APZC

Categories

(Core :: Panning and Zooming, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35
Tracking Status
firefox33 --- unaffected
firefox34 --- fixed
firefox35 --- fixed
b2g-v2.0 --- unaffected
b2g-v2.0M --- unaffected
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: mccr8, Assigned: kats)

References

Details

(Whiteboard: [MemShrink:P2])

Attachments

(3 files)

For some reason, we leak more in M9 than in other tests.  This is a regression in the last month or something, because we silently stopped reporting shutdown leaks.  The additional leaks are something like 6 AsyncPanZoomController, 2 CompositorParent, 6 CondVar, 2 CrossProcessMutex, 1 GeckoContentController, 2 GestureEventListener, 6 Mutex, 1 PCompositorParent, 3 ReentrantMonitor, 1 RefCountedMonitor, 2 RefCountedTask, 4 SharedMemory, 1 WeakReference<MessageListener>, 4 ZoomConstraints, 1 ipc::MessageChannel, 14 nsTArray_base.  About 11kb in total, which isn't huge, but this gets in the way of letting us catch failures to create leak logs and other things.
No longer blocks: 962871
I'm doing a local emulator build to see if I can reproduce this locally.
See Also: → 1067958
First attempt at running the M-9 locally failed. I'll have to dig through a bunch of harness code to figure out what's going on.
You'll need the patches in bug 962871 for it to produce a leak log at all.
Whiteboard: [MemShrink] → [MemShrink:P2]
After another attempt I was able to successfully run the M9 chunk on b2g-emulator. I think the port it was trying to use was just busy before. I'll apply the patches from bug 962871 and see if I can repro the leak. In the run I just did there was "no output line for total leaks!"
Depends on: 1068872
Ok, so I updated gecko to git sha eff47d926 which is just after the first landing of part 3 in bug 962871 and it's still not producing a leak log. The only leak-related stuff output is this:

TEST-INFO | leakcheck | threshold set at 5116 bytes

WARNING | leakcheck | missing output line for total leaks!

Any ideas?
Hmm.  Well, my current theory is that the bug just made us stop timing out somehow, so if your machine is running things more slowly then you'll hit the timeout anyways.  You could try running smaller chunks and see if that gets you a leak log.  So, instead of running chunk 9 of 16, run 18 and 19 of 32, or whatever.
I went all the way down to 120 chunks and still nothing.
It still had "WARNING | leakcheck | missing output line for total leaks!"?  Odd.
Do you have any ideas, Kyle?
Flags: needinfo?(khuey)
(In reply to Andrew McCreight [:mccr8] from comment #8)
> It still had "WARNING | leakcheck | missing output line for total leaks!"? 
> Odd.

Yeah. I added some more logging in harness, it seems to call processSingleLeakFile on a file in /tmp but I can't find that file afterwards. I'm not really sure what the race condition you're referring to, but I'm assuming that file is still being written to when the harness runs? And then it gets deleted before I get around to looking for it?
Even if I change [1] to pass an argument of delete=False so the file doesn't get deleted, it remains empty throughout. Do you know where the leakcheck output comes from? Can I just have it print to logcat instead of a file?

[1] http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtestsb2g.py?rev=1db4be77a409#201
If the log is always empty, maybe the process is dying before the log gets written to?  If you still want to try to get it to log somewhere else, the code lives in nsTraceRefcnt::DumpStatistics().
Is there anything I need in a mozconfig or userconfig to enable this? I tried making it log to logcat but I'm not seeing anything at all.
Yeah, that suggests we're just not getting to the place that does the logging.

I've never run B2G emulator mochitests myself, but it should just set up all of the environment variables for leak checking, assuming you have a debug build.
Oh another thing, are you using emulator or emulator-jb?  The former is what is leaking on TBPL.  I have no idea what the difference is, and I'm not sure why I thought you were using the latter, but maybe that's something else to try.
I had to modify the gonk-misc/default-gecko-config to add --enable-logrefcnt. Not sure how that happens on buildbot but it doesn't happen by default in emulator builds. Now I'm getting leakcheck output, will try to isolate the right leaks.
For the record, I'm running ./mach mochitest-remote --this-chunk 9 --total-chunks 15 because those are the chunk numbers it indicates in the inbound logs I saw. However the set of tests that actually gets run for me is much larger than the tests that get run on buildbot.

For example at [1] the first test that gets run is tests/dom/devicestorage/test/test_fs_remove.html and the last one is tests/dom/indexedDB/test/test_invalid_version.html. In my run, the first test is tests/dom/events/test/test_bug593959.html and the last is tests/dom/imptests/html/html/dom/documents/dta/test_nameditem-06.html. I also see a much larger leak (542519 bytes) compared to that inbound run (16244 bytes). I do still see 6 APZC objects getting leaked though, so maybe that's good enough and I can try to figure out why they're getting leaked.

[1] https://tbpl.mozilla.org/php/getParsedLog.php?id=48219877&tree=Mozilla-Inbound&full=1
Flags: needinfo?(khuey)
In the end I think trying to reproduce this problem locally was a waste of time. By reasoning through the log above and some code inspection I think I found the problem.
My trychooser syntax was wrong (or rather, try didn't respect my syntax) but it doesn't matter because my theory was wrong. My local emulator run still shows the leak even with my patch. I narrowed down the leak to the dom/events/test/test_moz_mouse_pixel_scroll_event.html test.
Second attempt; this one actually fixes the problem for me locally. https://tbpl.mozilla.org/?tree=Try&rev=5f94797d2e73
Attached patch Part 1 - The fixSplinter Review
The problem occurs when we have two layers that have the same APZC (call it "A"), but those two layers have subtrees with different APZCs. When we recurse into the second subtree, we set the parent pointer to A and "nextsibling" pointer to null, even though A already has children from the first subtree.

As a result, those children A already had get clobbered and leaked. It also results in an incorrect APZC tree. To fix this the critical line is to set |next| to apzc->GetFirstChild() which makes sure the two subtrees of A get bound into a single tree. The rest is just cleanup and tests.
Assignee: nobody → bugmail.mozilla
Attachment #8492347 - Flags: review?(botond)
This is redundant for refcounted things. Also it results in the leakchecker reporting twice as many objects created/leaked.
Attachment #8492348 - Flags: review?(botond)
Random thing I noticed while poking around.
Attachment #8492350 - Flags: review?(botond)
This bug was exposed from the layer tree changes in bug 967844 and will need uplifting to 34 as well.
Comment on attachment 8492348 [details] [diff] [review]
Part 2 - Remove redundant ctor-counting

Review of attachment 8492348 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/apz/src/AsyncPanZoomController.cpp
@@ -885,5 @@
>       mAPZCId(sAsyncPanZoomControllerCount++),
>       mSharedLock(nullptr),
>       mAsyncTransformAppliedToContent(false)
>  {
> -  MOZ_COUNT_CTOR(AsyncPanZoomController);

You should use MOZ_COUNT_CTOR_INHERITED, if I'm understanding your bug description correctly.

@@ -893,5 @@
>    }
>  }
>  
>  AsyncPanZoomController::~AsyncPanZoomController() {
> -  MOZ_COUNT_DTOR(AsyncPanZoomController);

...and resp. MOZ_COUNT_DTOR_INHERITED.
(Clarified on IRC with :froydnj that removing these macros is correct based on the comment at http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsISupportsImpl.h?rev=b276ce875275#168 and the fact that there is no class hierarchy around AsyncPanZoomController that would necessitate using the _INHERITED versions.)
Comment on attachment 8492347 [details] [diff] [review]
Part 1 - The fix

Review of attachment 8492347 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/apz/src/APZCTreeManager.cpp
@@ +439,5 @@
> +  // have our siblings as their siblings, and our parent as their parent.
> +  AsyncPanZoomController* next = aNextSibling;
> +  if (apzc) {
> +    // Otherwise, use this APZC as the parent going downwards, and start off
> +    // with it's first child as the next sibling

s/it's/its
Attachment #8492347 - Flags: review?(botond) → review+
Attachment #8492348 - Flags: review?(botond) → review+
Attachment #8492350 - Flags: review?(botond) → review+
Build-only try run to make sure the gtest passes everywhere: https://tbpl.mozilla.org/?tree=Try&rev=e34b89a551aa
Comment on attachment 8492347 [details] [diff] [review]
Part 1 - The fix

Approval Request Comment
[Feature/regressing bug #]: bug 967844
[User impact if declined]: in some situations the apz code will leak objects. These cases will also have incorrect scrolling behavior because the apz code does not build the right data structure to track scrollable frames.
[Describe test coverage new/current, TBPL]: added gtest, also tested locally
[Risks and why]: low risk, change is relatively straightforward
[String/UUID change made/needed]: none
Attachment #8492347 - Flags: approval-mozilla-aurora?
No longer depends on: 1068872
Comment on attachment 8492347 [details] [diff] [review]
Part 1 - The fix

Aurora+
Attachment #8492347 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Is this bug B2G specific? Are desktop or mobile affected?
Flags: needinfo?(bugmail.mozilla)
This only affects platforms with APZ enabled, so B2G and Metro at this point.
Flags: needinfo?(bugmail.mozilla)
You need to log in before you can comment on or make changes to this bug.