Massive memory leak in Screenshot web extension when Firefox gets shutdown too quickly after startup

RESOLVED FIXED in Firefox 55

Status

()

defect
--
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: whimboo, Assigned: ianbicking)

Tracking

({crash, regression})

unspecified
Firefox 56
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox54 unaffected, firefox55 fixed, firefox56 fixed)

Details

Attachments

(4 attachments)

Until I know in which component this regression has been introduced recently, I will file it as a general core bug.

The assertion and crash is always happening when running the attached Marionette test 


Assertion failure: isEmpty() (failing this assertion means this LinkedList's creator is buggy: it should have removed all this list's elements before the list's destruction), at /home/worker/workspace/build/src/obj-firefox/dist/include/mozilla/LinkedList.h:455
#01: __cxa_finalize_ranges[/usr/lib/system/libsystem_c.dylib +0x60178]
#02: exit[/usr/lib/system/libsystem_c.dylib +0x604b2]
Here is the pushlog for the regression range:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=8726e36dafe1&tochange=1b3359583493

So it points to bug 1322235, bug 1368102, or bug 1368152.

Please note that there is also a huge leak involved in that assertion.

While I continue to narrow it down, Kris maybe you have an idea already.
Flags: needinfo?(kmaglione+bmo)
Turns out to be all bug 1368152 related. While changeset ef5b869561f4 is working fine, it's broken for 1b3359583493.

Did we miss something to remove for artifact builds, and which caused this bustage?

I will continue to check the remaining commits, so that it will be total clear which of those caused this regression.
Blocks: 1368152
Also setting ni? to Chris given that he knows a lot about artifact builds, and I already spoke to him yesterday about this crasher.
Flags: needinfo?(cmanchester)
Henrik, I can't reproduce the crash. Can you include the exact command you're using to invoke the tests, and also a full log of the build that produces the error?
Flags: needinfo?(cmanchester)
For postery here my mozconfig file:

mk_add_options AUTOCLOBBER=1
mk_add_options MOZ_OBJDIR=@TOPSRCDIR@/obj/debug/

ac_add_options --enable-artifact-builds
ac_add_options --enable-debug

# Set environment variables
export BUILD_SYSTEM_TELEMETRY=1

Attached you can find a full log output from this Marionette test.
Is this a shutdown crash? If so, the isEmpty() assertion is generally just a symptom of a shutdown leak, not the cause.
Flags: needinfo?(kmaglione+bmo)
I was finally able to reproduce this, but it turns out this is the difference between a packaged build and a build run from the objdir. I can reproduce with a full build run locally, but I can't with a packaged artifact build from automation.
(In reply to Kris Maglione [:kmag] (busy; behind on reviews) from comment #8)
> Is this a shutdown crash? If so, the isEmpty() assertion is generally just a
> symptom of a shutdown leak, not the cause.

Yes, this is a massive leak during shutdown.

Also not sure how those assertions right before the leak is getting reported fall into this issue:

[Parent 92864] WARNING: '!mMainThread', file /home/worker/workspace/build/src/xpcom/threads/nsThreadManager.cpp, line 330
[Parent 92864] WARNING: Fonts still alive while shutting down gfxFontCache: 'mFonts.Count() == 0', file /home/worker/workspace/build/src/gfx/thebes/gfxFont.cpp, line 216
Assertion failed at /home/worker/workspace/build/src/gfx/cairo/cairo/src/cairo-hash.c:196: hash_table->live_entries == 0
WARNING: YOU ARE LEAKING THE WORLD (at least one JSRuntime and everything alive inside it, that is) AT JS_ShutDown TIME.  FIX THIS!
[Parent 92864] ###!!! ASSERTION: Component Manager being held past XPCOM shutdown.: 'cnt == 0', file /home/worker/workspace/build/src/xpcom/build/XPCOMInit.cpp, line 1042
(In reply to Kris Maglione [:kmag] (busy; behind on reviews) from comment #8)
> Is this a shutdown crash? If so, the isEmpty() assertion is generally just a
> symptom of a shutdown leak, not the cause.

So leaks we only analyze for debug builds, right? Which means it would also affect other builds including opt.

(In reply to Chris Manchester (:chmanchester) from comment #9)
> difference between a packaged build and a build run from the objdir. I can
> reproduce with a full build run locally, but I can't with a packaged
> artifact build from automation.

Interesting.

Kris, so can this be related to your code changes?
Flags: needinfo?(kmaglione+bmo)
(In reply to Henrik Skupin (:whimboo) from comment #11)
> Kris, so can this be related to your code changes?

It may be related, yes, but it's almost certainly caused by a race in the Screenshots extension's startup and shutdown code that just happened to show up after those changes.
Flags: needinfo?(kmaglione+bmo)
Would you mind telling me which changeset you mean? I frankly cannot find anything screenshot related on inbound:

https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?startdate=2017-06-04&enddate=2017-06-06

Also I did several test runs with your changeset https://hg.mozilla.org/integration/mozilla-inbound/rev/1b3359583493  included and not. And the leak only happens with it included, but not without it.
Flags: needinfo?(kmaglione+bmo)
(In reply to Henrik Skupin (:whimboo) from comment #13)
> Also I did several test runs with your changeset
> https://hg.mozilla.org/integration/mozilla-inbound/rev/1b3359583493 
> included and not. And the leak only happens with it included, but not
> without it.

I agree that this change is triggering the problem, but it's not the cause. If I had to guess, I'd say that prior to this change, Screenshots startup was failing because of a fatal error in the StartupCache code that became non-fatal after that change.

But this shutdown leak is a known issue, due to the way that Screenshots delays its initialization code so that it winds up racing with its shutdown code in very short sessions. It's something that needs to be fixed, but it's not currently a problem in real-world usage.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] from comment #14)
> But this shutdown leak is a known issue, due to the way that Screenshots
> delays its initialization code so that it winds up racing with its shutdown
> code in very short sessions. It's something that needs to be fixed, but it's
> not currently a problem in real-world usage.

Thanks Kris. So this actually is a problem for our Marionette tests which use restarts or quits. Would there be a change to get this fixed? Mahendra, mind having a look at this? If there is no way to get this fixed soon, we may have to do the workaround in disabling Screenshots when Marionette tests are running. As it looks like it can be done via "extensions.screenshots.disabled".

Is there already an existent bug which covers this leak?
Flags: needinfo?(mpotharaju)
I've created https://github.com/mozilla-services/screenshots/issues/3057 for what Screenshots has to do.  I think it might be a really simple fix (as mentioned in the bug), but I'm not 100% clear on the problem.
Flags: needinfo?(mpotharaju)
Attachment #8880493 - Flags: review?(hskupin)
I don't know how to run the right tests to see if this fixes it, but the patch attached here makes the suggested change and should wait for the embedded WebExtension to start up before trying to shut it down.
Thank you Ian for the patch. I actually cannot review this patch, but I will have a look early tomorrow morning and check if that fixes the Marioentte test.

If you want to test it yourself, just download the minimal test and run it with "mach marionette-test path_to_file -vv --gecko-log -".
Flags: needinfo?(hskupin)
Component: General → Screenshots
Flags: needinfo?(hskupin)
Product: Core → Cloud Services
Version: 55 Branch → unspecified
Assignee: nobody → ianb
Status: NEW → ASSIGNED
Flags: needinfo?(hskupin)
So I had a look at the patch right now. After applying it and re-running "mach build" for my artifact build on OS X, i have to say that this patch alone doesn't fix it. It still leaks. I can suppress it with a sleep(.1) in the Python code. So it seems to be a very racy condition.
Flags: needinfo?(hskupin)
Summary: Assertion and crash for debug artifact builds on OS X: isEmpty() (failing this assertion means this LinkedList's creator is buggy: it should have removed all this list's elements before the list's destruction) → Massive memory leak in Screenshot web extension when Firefox gets shutdown too quickly after startup
Attachment #8880493 - Flags: review?(kmaglione+bmo)
I sat down with Henrik and we got the test working with this new patch.
Comment on attachment 8880493 [details]
Bug 1373614 - shut down Screenshots WebExtension unconditionally on browser shutdown

https://reviewboard.mozilla.org/r/151830/#review158944
Attachment #8880493 - Flags: review?(kmaglione+bmo) → review+
Keywords: checkin-needed
Pushed by hskupin@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5d527ebb2371
shut down Screenshots WebExtension unconditionally on browser shutdown r=kmag
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/5d527ebb2371
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Ian, is it worth trying to get this fix in 55?  If so please request uplift to beta when you get a chance.
Flags: needinfo?(ianb)
I think this only affects specific testing situations, I'll defer to Henrik if he thinks there's value in uplift
Flags: needinfo?(ianb) → needinfo?(hskupin)
Ian, it would be great to see it uplifted. It's even a very tiny patch, and I assume it should apply directly.
Flags: needinfo?(hskupin) → needinfo?(ianb)
Comment on attachment 8880493 [details]
Bug 1373614 - shut down Screenshots WebExtension unconditionally on browser shutdown

Approval Request Comment
[Feature/Bug causing the regression]:
Screenshots

[User impact if declined]:
Causes memory leak detection to be triggered during testing scenarios when the browser is started and stopped very quickly

[Is this code covered by automated tests?]:
No (except that we have automated tests that see the problem this patch resolves)

[Has the fix been verified in Nightly?]:
Yes

[Needs manual test from QE? If yes, steps to reproduce]: 
No

[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
No

[Why is the change risky/not risky?]:
The change only affects shutdown scenarios

[String changes made/needed]:
None
Flags: needinfo?(ianb)
Attachment #8880493 - Flags: approval-mozilla-beta?
Comment on attachment 8880493 [details]
Bug 1373614 - shut down Screenshots WebExtension unconditionally on browser shutdown

screenshots shutdown fix, beta55+
Attachment #8880493 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
(In reply to Ian Bicking (:ianb) from comment #30)
> [Is this code covered by automated tests?]:
> No (except that we have automated tests that see the problem this patch
> resolves)
> 
> [Has the fix been verified in Nightly?]:
> Yes
> 
> [Needs manual test from QE? If yes, steps to reproduce]: 
> No

Setting qe-verify- based on Ian's assessment on manual testing needs.
Flags: qe-verify-
Target Milestone: --- → mozilla56
Target Milestone: mozilla56 → Firefox 56
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.