Closed Bug 1242084 Opened 4 years ago Closed 4 years ago

nsTerminator::UpdateTelemetry() leaks 8 byte nsStringBuffer when app.update.channel gets set to Beta

Categories

(Toolkit :: Telemetry, defect, P1)

defect
Points:
3

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox44 --- unaffected
firefox45 --- wontfix
firefox46 --- fixed
firefox47 --- fixed
firefox48 --- fixed

People

(Reporter: RyanVM, Assigned: Dexter)

References

(Blocks 1 open bug)

Details

(Keywords: memory-leak, regression, Whiteboard: [measurement:client])

Attachments

(5 files, 2 obsolete files)

See https://treeherder.mozilla.org/#/jobs?repo=try&revision=158f20c5f57f&group_state=expanded&selectedJob=15819028 which sets app.update.channel to beta on top of mozilla-central.

Nathan, didn't you have patches to print the stack of leaked objects? How would I go about doing that?
Flags: needinfo?(nfroyd)
The patches landed, so it looks like you just need to set some configuration information. Bug 1196430 comment 35 has an example.
That's so cool.

1171 |nsStringBuffer                        |        8        8|  129075        1|
nsTraceRefcnt::DumpStatistics: 1334 entries
Serial Numbers of Leaked Objects:
128108 @13D43EE0 (1 references)
allocation stack:
#00: nsStringBuffer::Realloc(nsStringBuffer *,unsigned int) [xpcom/string/nsSubstring.cpp:248]
#01: nsACString_internal::MutatePrep(unsigned int,char * *,unsigned int *) [xpcom/string/nsTSubstring.cpp:106]
#02: nsACString_internal::ReplacePrepInternal(unsigned int,unsigned int,unsigned int,unsigned int) [xpcom/string/nsTSubstring.cpp:195]
#03: nsACString_internal::ReplacePrep(unsigned int,unsigned int,unsigned int) [xpcom/string/nsTSubstring.cpp:186]
#04: nsACString_internal::Replace(unsigned int,unsigned int,char const *,unsigned int,mozilla::fallible_t const &) [xpcom/string/nsTSubstring.cpp:557]
#05: nsACString_internal::Replace(unsigned int,unsigned int,char const *,unsigned int) [xpcom/string/nsTSubstring.cpp:530]
#06: mozilla::nsTerminator::UpdateTelemetry() [toolkit/components/terminator/nsTerminator.cpp:504]
#07: mozilla::nsTerminator::Observe(nsISupports *,char const *,wchar_t const *) [toolkit/components/terminator/nsTerminator.cpp:443]
#08: nsObserverList::NotifyObservers(nsISupports *,char const *,wchar_t const *) [xpcom/ds/nsObserverList.cpp:113]
#09: nsObserverService::NotifyObservers(nsISupports *,char const *,wchar_t const *) [xpcom/ds/nsObserverService.cpp:311]
#10: mozilla::ShutdownXPCOM(nsIServiceManager *) [xpcom/build/XPCOMInit.cpp:849]
#11: ScopedXPCOMStartup::~ScopedXPCOMStartup() [toolkit/xre/nsAppRunner.cpp:1480]
#12: mozilla::DefaultDelete<ScopedXPCOMStartup>::operator()(ScopedXPCOMStartup *) [mfbt/UniquePtr.h:482]
#13: XREMain::XRE_main(int,char * * const,nsXREAppData const *) [toolkit/xre/nsAppRunner.cpp:4454]
#14: XRE_main [toolkit/xre/nsAppRunner.cpp:4526]
#15: do_main [browser/app/nsBrowserApp.cpp:212]
#16: NS_internal_main(int,char * *) [browser/app/nsBrowserApp.cpp:352]
#17: wmain [toolkit/xre/nsWindowsWMain.cpp:138]
#18: __tmainCRTStartup [f:/dd/vctools/crt/crtw32/startup/crt0.c:255]
#19: kernel32 + 0x17067
TEST-INFO | leakcheck | default process: leaked 1 nsStringBuffer
TEST-UNEXPECTED-FAIL | leakcheck | default process: 8 bytes leaked (nsStringBuffer)
Flags: needinfo?(nfroyd)
I'm wondering if it's related to the crash reporter, looking at the places in nsAppRunner.cpp where MOZ_UPDATE_CHANNEL is being used. Unfortunately, I can't test that on Try AFAICT because the test jobs burn when they can't find a crashreporter symbols package.
Attached file prefs.js that leaks
Here's a prefs.js that leaks (without channel changes). Disabling telemetry by changing the last line to |false| makes it not leak.
Attached file leak info
Component: XPCOM → Telemetry
Flags: needinfo?(dteller)
Keywords: regression
Product: Core → Toolkit
I'm not sure why the leak is related to Beta, but the entire nsTerminator is pretty much designed to leak, given that it's triggered by shutdown and never stopped. The only mystery to me is why 8 bytes and not more.
Flags: needinfo?(dteller)
Assignee: nobody → continuation
Summary: 8 byte nsStringBuffer leak when app.update.channel gets set to Beta → nsTerminator::UpdateTelemetry() leaks 8 byte nsStringBuffer when app.update.channel gets set to Beta
Just disabling e10s and then setting the two telemetry prefs from Jesse's prefs file is enough to trigger the leak.
Well, as mentioned, I don't care about a 8 byte leak-by-design during shutdown, so let's just whitelist it.
(In reply to David Rajchenbach-Teller [:Yoric] (please use "needinfo") from comment #8)
> Well, as mentioned, I don't care about a 8 byte leak-by-design during
> shutdown, so let's just whitelist it.

I'd like to get a better understanding of why we are leaking this particular nsStringBuffer. It could just be that it is being held alive by some string that we white listed already or something, which wouldn't be a big deal.
Locally, I also get an additional nsStringBuffer leak under a different stack:

#05: nsXULAppInfo::GetVersion(nsACString_internal&) (nsAppRunner.cpp:818, in XUL)
#06: mozilla::widget::GfxInfoBase::GetApplicationVersion() (nsCOMPtr.h:403, in XUL)
#07: BlacklistEntryToDriverInfo(nsIDOMNode*, mozilla::widget::GfxDriverInfo&) (nsTString.h:117, in XUL)
This chunk of code does a raw cast on leaking objects to nsStringBuffer and prints out what they contain. (you also need my patch in bug 1245673)

When you run with the stack tracing option for nsStringBuffer you get in the output:

BUFFER CONTENTS {"profile-change-teardown": 0, "profile-before-change": 0, "xpcom-will-shutdown": 0}
BUFFER CONTENTS 47.0a1

So, about what you'd expect.
This particular "leak" looks like a regression from bug 1225682. Before that, we used an nsAutoCString for this static variable.

That doesn't really explain nsTerminator, as I think that did not change in that patch. I also don't know why I see this GfxInfoBase thing locally but it doesn't show up on automated testing. Maybe it is disabled somehow.
So I'm guessing that adding a suppression for this isn't quite as straightforward as I'd hoped because our current suppression code only cares about tab process leaks (which makes sense!).
I don't want to start whitelisting main process leaks. It is very confusing for developers.

I'm going to try to figure out why we're enabling telemetry in all beta debug tests when we aren't otherwise. At least I'm guessing that is the issue.
nsTerminator is leaking because telemetry is enabled. Telemetry is enabled because of this code in Preferences.cpp:

    if (Preferences::GetDefaultCString(kChannelPref).EqualsLiteral("beta")) {
      prerelease = true;
    }

That channel pref is set by configure.in, by --enable-update-channel. If none is specified, then the channel pref is "default".

It looks to me that --enable-update-channel is only ever set in mozconfigs for opt builds that we never do leak checking on:
  http://mxr.mozilla.org/mozilla-central/search?string=--enable-update-channel

If that's right, then this leak will not actually show up when we merge to beta, because our beta debug builds will not set the channel to beta, like Ryan's try push did.

Looking at Linux32 builds for current beta, I see --enable-update-channel= in the build log for an opt build, but not for a debug build (though it doesn't show the value for the former but whatever), which seems to confirm my theory.

Ryan, am I wrong here? Do beta debug builds actually set the update channel somehow?
Flags: needinfo?(ryanvm)
Release Promotion will be changing CI builds to use the same update channel as release builds.
Flags: needinfo?(ryanvm)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #16)
> Release Promotion will be changing CI builds to use the same update channel
> as release builds.

Even for debug builds? This is going to result in odd behavior for telemetry, where it is enabled in Beta debug builds but not Nightly or Aurora debug builds. Georg, do you have some thoughts on how to handle this? Maybe the "Preferences::GetDefaultCString(kChannelPref).EqualsLiteral("beta")" guard could be only included in non-DEBUG builds or something. The conditions under which telemetry are enabled or not seem kind of complicated and I'm not sure I entirely understand them.
Flags: needinfo?(gfritzsche)
Disabling Telemetry on debug builds seems reasonable. I can't imagine we'd be getting any useful data from those builds anyway.
The Telemetry conditions & build flag interactions are definitely too confusing.
From my perspective the important thing is that we don't have Telemetry *send* data from debug builds.

To not create more confusing rules i would propose:
* Telemetry defaults are the same between debug & release
* we don't send Telemetry from debug builds (to not skew our analysis etc.)
Flags: needinfo?(gfritzsche)
Does that sounds sensible?
What is the timeline for the CI builds changing the update channels?
Does that have a meta bug so i can file a blocker for adopting Telemetry accordingly?
Flags: needinfo?(ryanvm)
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> * Telemetry defaults are the same between debug & release
> * we don't send Telemetry from debug builds (to not skew our analysis etc.)

That sounds reasonable to me, but I don't think it will fix the leak detector failure which is caused by nsTerminator recording telemetry-related information late in shutdown.
(In reply to Georg Fritzsche [:gfritzsche] from comment #20)
> Does that sounds sensible?
I'll defer to mccr8.
> What is the timeline for the CI builds changing the update channels?
Rail would know better than I what the current timeframe is looking like.
> Does that have a meta bug so i can file a blocker for adopting Telemetry
> accordingly?
Ditto.
Flags: needinfo?(ryanvm) → needinfo?(rail)
(In reply to Andrew McCreight [:mccr8] from comment #21)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> > * Telemetry defaults are the same between debug & release
> > * we don't send Telemetry from debug builds (to not skew our analysis etc.)
> 
> That sounds reasonable to me, but I don't think it will fix the leak
> detector failure which is caused by nsTerminator recording telemetry-related
> information late in shutdown.

Ok, i think i don't understand that leak exactly.
Do we know what exactly is leaked and why?
I assume from comment 4 et al that this is not specific to beta and just happens in all debug builds with Telemetry enabled?
Flags: needinfo?(continuation)
(In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> Do we know what exactly is leaked and why?

nsTerminator::UpdateTelemetry() creates some heap-allocated nsCStrings very late in shutdown, and does not free them until after shutdown leak detection has finished.

Some possible work-arounds would be:
1. Use nsAutoCString instead, which would not use heap allocations. This is a little tricky, because strings are being passed between two different threads.
2. Don't run this specific code in debug builds.
3. Create a new way to ignore specific leaked objects in the XPCOM shutdown leak detector. This is what the MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT() calls do for LeakSanitizer already. (LSan builds are non-debug, and thus have already had to deal with this issue.)

> I assume from comment 4 et al that this is not specific to beta and just
> happens in all debug builds with Telemetry enabled?

Yes. I can reproduce on m-c with telemetry enabled.
Flags: needinfo?(continuation)
Sorry for the tardy reply...

(In reply to Ryan VanderMeulen [:RyanVM] from comment #22)
> > What is the timeline for the CI builds changing the update channels?
> Rail would know better than I what the current timeframe is looking like.

We haven't decided yet. Firefox 46 or 47 are our current candidates. This will be applicable to beta only for only.

> > Does that have a meta bug so i can file a blocker for adopting Telemetry
> > accordingly?
> Ditto.

It's bug 1118794
Flags: needinfo?(rail)
I think the next step here is for a telemetry person to figure out the right fix.
Assignee: continuation → nobody
Priority: -- → P2
Whiteboard: [measurement:client]
Priority: P2 → P1
Points: --- → 3
Assignee: nobody → alessio.placitelli
See Also: → 1254550
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> The Telemetry conditions & build flag interactions are definitely too
> confusing.
> From my perspective the important thing is that we don't have Telemetry
> *send* data from debug builds.
> 
> To not create more confusing rules i would propose:
> * Telemetry defaults are the same between debug & release
> * we don't send Telemetry from debug builds (to not skew our analysis etc.)

I moved this discussion to bug 1254550.
(In reply to Andrew McCreight [:mccr8] from comment #24)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #23)
> > Do we know what exactly is leaked and why?
> 
> nsTerminator::UpdateTelemetry() creates some heap-allocated nsCStrings very
> late in shutdown, and does not free them until after shutdown leak detection
> has finished.
> 
> Some possible work-arounds would be:
> 1. Use nsAutoCString instead, which would not use heap allocations. This is
> a little tricky, because strings are being passed between two different
> threads.
> 2. Don't run this specific code in debug builds.
> 3. Create a new way to ignore specific leaked objects in the XPCOM shutdown
> leak detector. This is what the MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT() calls
> do for LeakSanitizer already. (LSan builds are non-debug, and thus have
> already had to deal with this issue.)

We should avoid (2) if possible for code coverage.
(3) sounds rather involved, so we probably want to go with a version of (1), depending how tricky that will be.
(In reply to Georg Fritzsche [:gfritzsche] from comment #28)
> > Some possible work-arounds would be:
> > 1. Use nsAutoCString instead, which would not use heap allocations. This is
> > a little tricky, because strings are being passed between two different
> > threads.
> > 2. Don't run this specific code in debug builds.
> > 3. Create a new way to ignore specific leaked objects in the XPCOM shutdown
> > leak detector. This is what the MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT() calls
> > do for LeakSanitizer already. (LSan builds are non-debug, and thus have
> > already had to deal with this issue.)

From the current investigations there seem to be no easy workarounds for this leak notification.
The options now seem to be:
(1) don't run this code in debug builds
(2) implement a new way to ignore specific leaked objects
(3) whitelist the leak until we have (2)


I think we should avoid (1) (other debug coverage etc.).
(2) sounds a bit involved and i think we don't have the time for this now.
Andrew, any issues with us going with (3) until we have (2)?
Happy to file a bug about this.
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #12)
> Created attachment 8715577 [details] [diff] [review]
> Fix GfxInfoBase nsStringBuffer leak.
> 
> This particular "leak" looks like a regression from bug 1225682. Before
> that, we used an nsAutoCString for this static variable.
> 
> That doesn't really explain nsTerminator, as I think that did not change in
> that patch. I also don't know why I see this GfxInfoBase thing locally but
> it doesn't show up on automated testing. Maybe it is disabled somehow.

Should we take this to a different bug so it can get landed?
(In reply to Georg Fritzsche [:gfritzsche] from comment #30)
> Should we take this to a different bug so it can get landed?

I don't know how much it really matters, but sure I could look into landing it.
Flags: needinfo?(continuation)
(In reply to Georg Fritzsche [:gfritzsche] from comment #29)
> I think we should avoid (1) (other debug coverage etc.).
> (2) sounds a bit involved and i think we don't have the time for this now.
> Andrew, any issues with us going with (3) until we have (2)?

It is not okay to add a main process leak suppression. We currently have zero, and as I've seen from the content process leak suppressions, it is very confusing to developers, who often think that the leaks are their fault.

We haven't ever run this code in debug builds, going back since it landed, so I don't think it is too much of a hardship to continue not running it in debug builds until we can come up with a solution that will not negatively impact all Gecko developers.
(In reply to Andrew McCreight [:mccr8] from comment #31)
> I don't know how much it really matters, but sure I could look into landing it.

Filed bug 1255469.
(In reply to Andrew McCreight [:mccr8] from comment #32)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #29)
> > I think we should avoid (1) (other debug coverage etc.).
> > (2) sounds a bit involved and i think we don't have the time for this now.
> > Andrew, any issues with us going with (3) until we have (2)?

> It is not okay to add a main process leak suppression. We currently have
> zero, and as I've seen from the content process leak suppressions, it is
> very confusing to developers, who often think that the leaks are their fault.

Ok, the developer confusion is a good point - otherwise the result of suppression and marking at as intentional seems basically the same.

> We haven't ever run this code in debug builds, going back since it landed,
> so I don't think it is too much of a hardship to continue not running it in
> debug builds until we can come up with a solution that will not negatively
> impact all Gecko developers.

Is there a bug on a mechanism to mark intentional leaks for this?
Flags: needinfo?(continuation)
(In reply to Georg Fritzsche [:gfritzsche] from comment #34)
> Is there a bug on a mechanism to mark intentional leaks for this?

I filed bug 1255478.
Flags: needinfo?(continuation)
Attached patch bug1242084.patch (obsolete) — Splinter Review
This patch prevents the telemetry-related sections of nsTerminator from running on debug builds (StartWriter/UpdateTelemetry). The crash-reported relevant sections of this class are still enabled on debug builds.

This patch also skips the xpcshell "test_terminator_record.js" test for debug builds.
Attachment #8729532 - Flags: review?(gfritzsche)
Comment on attachment 8729532 [details] [diff] [review]
bug1242084.patch

Review of attachment 8729532 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/terminator/nsTerminator.cpp
@@ +357,5 @@
>  nsTerminator::Start()
>  {
>    MOZ_ASSERT(!mInitialized);
>    StartWatchdog();
> +#if !defined(DEBUG)

We already have the |Telemetry::CanRecordExtended()| checks in StartWriter()/UpdateTelemetry() - let's just place the debug checks next to them.

@@ +358,5 @@
>  {
>    MOZ_ASSERT(!mInitialized);
>    StartWatchdog();
> +#if !defined(DEBUG)
> +  // Only allow nsTerminator to write on non-debug builds so we don't leak on shutdown.

Nit: "... so we don't get leak warnings on shutdown for intentional leaks."
We should reference bug 1242084 for context on why we do this.

@@ +451,5 @@
>    }
>  
>    UpdateHeartbeat(aTopic);
> +#if !defined(DEBUG)
> +  // Only allow nsTerminator to write on non-debug builds so we don't leak on shutdown.

Ditto.
Attachment #8729532 - Flags: review?(gfritzsche) → feedback+
Comment on attachment 8729532 [details] [diff] [review]
bug1242084.patch

Review of attachment 8729532 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/terminator/tests/xpcshell/xpcshell.ini
@@ +2,5 @@
>  head=
>  tail=
>  
>  [test_terminator_record.js]
> +skip-if = debug

Let's also comment with a bug number here (e.g. "# bug xxx, bug yyy will re-enable").
Attached patch bug1242084.patch (obsolete) — Splinter Review
(In reply to Georg Fritzsche [:gfritzsche] from comment #37)
> Comment on attachment 8729532 [details] [diff] [review]
> bug1242084.patch
> 
> Review of attachment 8729532 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/terminator/nsTerminator.cpp
> @@ +357,5 @@
> >  nsTerminator::Start()
> >  {
> >    MOZ_ASSERT(!mInitialized);
> >    StartWatchdog();
> > +#if !defined(DEBUG)
> 
> We already have the |Telemetry::CanRecordExtended()| checks in
> StartWriter()/UpdateTelemetry() - let's just place the debug checks next to
> them.
> 

I should have mentioned it earlier (sorry :-( ) that I tried that before. By checking near |Telemetry::CanRecordExtended()| we were breaking DEBUG builds, as we were triggering an unreachable code warning. And warnings are treated as errors on debug builds.
Attachment #8729532 - Attachment is obsolete: true
Attachment #8730155 - Flags: review?(gfritzsche)
Attachment #8730155 - Flags: review?(gfritzsche) → review+
Attached patch bug1242084.patchSplinter Review
Removed a bogus whitespaces.
Attachment #8730155 - Attachment is obsolete: true
Attachment #8730174 - Flags: review?(gfritzsche)
Attachment #8730174 - Flags: review?(gfritzsche) → review+
https://hg.mozilla.org/integration/fx-team/rev/2602e9da8592ce52e68a22203881ab238506cc94
Bug 1242084 - nsTerminator::UpdateTelemetry() leaks 8 byte nsStringBuffer when app.update.channel gets set to Beta. r=gfritzsche
https://hg.mozilla.org/mozilla-central/rev/2602e9da8592
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Comment on attachment 8730174 [details] [diff] [review]
bug1242084.patch

Approval Request Comment
[Feature/regressing bug #]:
[User impact if declined]: None. On the other hand, "release promotion" builds will leak on treeherder.
[Describe test coverage new/current, TreeHerder]: No new test coverage.
[Risks and why]: Low risk, this changes the behaviour on debug builds for a section of code that runs on shutdown.
[String/UUID change made/needed]: None
Attachment #8730174 - Flags: approval-mozilla-beta?
Attachment #8730174 - Flags: approval-mozilla-aurora?
(comment 45 and comment 46 are from bug 1255469, as I forgot to update the bug number.)
Comment on attachment 8730174 [details] [diff] [review]
bug1242084.patch

Needed for release build promotion, Aurora47+, Beta46+
Attachment #8730174 - Flags: approval-mozilla-beta?
Attachment #8730174 - Flags: approval-mozilla-beta+
Attachment #8730174 - Flags: approval-mozilla-aurora?
Attachment #8730174 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.