Closed Bug 1193038 Opened 4 years ago Closed 4 years ago

UAF in Telemetry::Accumulate(); code appears to not be thread-safe

Categories

(Toolkit :: Telemetry, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox39 --- wontfix
firefox40 --- wontfix
firefox41 + wontfix
firefox42 + fixed
firefox43 + fixed
firefox-esr38 --- fixed

People

(Reporter: jesup, Assigned: jesup)

References

Details

(4 keywords, Whiteboard: [post-critsmash-triage][adv-main42+][adv-esr38.4+])

Attachments

(3 files, 3 obsolete files)

Repeated crashes in the field with 0x5a signatures.  
Called from
Telemetry::Accumulate(Telemetry::DNS_FAILED_LOOKUP_TIME, millis);

https://crash-stats.mozilla.com/report/index/aabb2399-64d7-4ec7-a888-027b02150805

There is no obvious mention anywhere in the Telemetry code/includes about the thread-safety aspects; thus one should presume it's callable from any thread (and in this case is, from a DNS resolver thread, STS thread, etc).  However, I see no locking around access to shared telemetry structures, which presumably led to this set of crashes.  I would imagine it's also prone to TSAN issues.

The only good thing is that few of the crashes are on Release since most telemetry is disabled on Release (IIRC).  But not all....

Others include  Telemetry::Accumulate(Telemetry::STS_NUMBER_OF_ONSOCKETREADY_CALLS, numberOfOnSocketReadyCalls)
(https://crash-stats.mozilla.com/report/index/c77f5b7c-cc39-44d3-9664-030ff2150807)

More crashes, all ugly:
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29
Component: Untriaged → Telemetry
Product: Core → Toolkit
Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.

That sounds like we're calling into Telemetry *really* late, though; histograms aren't freed until very very late in the shutdown process...or at least they didn't used to be...
(In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
> 
> That sounds like we're calling into Telemetry *really* late, though;
> histograms aren't freed until very very late in the shutdown process...or at
> least they didn't used to be...


The first one looks to be *early*, perhaps profilemanager?  thread 0 is in XRE_main() /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/NtCreateUserProcess().

The second one I link to is crashing in Add() to the array (and has >70 active threads).

Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the first - in LaunchChild(), with a smallish number of threads active, and short uptimes (6, 24, etc).  There are some others like the second (which had uptime = 15519)

Hmmmm I wonder if it *is* the switch to the real profile
Also, bug 1142079 smells like wallpapering over a race like the Add() race (second example above).
(In reply to Randell Jesup [:jesup] from comment #0)
> The only good thing is that few of the crashes are on Release since most
> telemetry is disabled on Release (IIRC).  But not all....

Telemetry will default to enabled on release soon (41 or 42), but we will only record from a limited set of probes.
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> (In reply to Randell Jesup [:jesup] from comment #0)
> > The only good thing is that few of the crashes are on Release since most
> > telemetry is disabled on Release (IIRC).  But not all....
> 
> Telemetry will default to enabled on release soon (41 or 42), but we will
> only record from a limited set of probes.

Meaning that default-on on release should be not - or much less - affected.

The breakdown for the uptime range here puts about half of the crashes in the first minute:
https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29

Note that in absolute numbers per release, it is not ranking particularly high.
(In reply to Randell Jesup [:jesup] from comment #2)
> (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
> > 
> > That sounds like we're calling into Telemetry *really* late, though;
> > histograms aren't freed until very very late in the shutdown process...or at
> > least they didn't used to be...
> 
> 
> The first one looks to be *early*, perhaps profilemanager?  thread 0 is in
> XRE_main()
> /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/
> NtCreateUserProcess().
> 
> The second one I link to is crashing in Add() to the array (and has >70
> active threads).
> 
> Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the
> first - in LaunchChild(), with a smallish number of threads active, and
> short uptimes (6, 24, etc).  There are some others like the second (which
> had uptime = 15519)
> 
> Hmmmm I wonder if it *is* the switch to the real profile

The call sequence seen in at least half these (and most of the 0x5a's) is LaunchChild() called within "if (appInitiatedRestart) {"

So, this would correlate strongly with installing updates and restarting.

This gets set if XRE_mainRun() returns a "please restart" result, so at this point most or all threads should be shut down (and certainly all nsThreads).  Clearly, however, this is not the case.  We're launching a new child, but haven't exited yet, and in the meantime other threads may still be doing stuff.  And clearly a bunch of threads are still running.  Some are OS things (darn, can't see the thread names).  Some are not:

thread 11 was XPCJSRuntime.cpp in WatchdogMain() - and interestingly, it's in a Sleep() (really a CondVar wait) inside an if (!self->ShuttingDown).

Also the sandbox broker service, the toolkit nsTerminator (which may make sense).  There are ~19 threads though; most unclear what they are.

So at a minimum we'd want to either not clear out the telemetry array ever or block usage of it after it's freed (which likely would involve a lock/etc).  We also need to figure out why the DNS Host Resolver is still running in shutdown.  And the other crashes strongly imply the lack of locking here is causing Add() to fail due to (presumed) races and crash with randomish addresses (not 0x5a, but not null ptrs, and a number are EXEC crashes)

cc: mcmanus & jduell because the HostResolver is part of the problem here.

(In reply to Georg Fritzsche [:gfritzsche] from comment #5)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> > (In reply to Randell Jesup [:jesup] from comment #0)
> > > The only good thing is that few of the crashes are on Release since most
> > > telemetry is disabled on Release (IIRC).  But not all....
> > 
> > Telemetry will default to enabled on release soon (41 or 42), but we will
> > only record from a limited set of probes.
> 
> Meaning that default-on on release should be not - or much less - affected.

Sure, but it'd be hard to be sure none of them can trigger the UAFs.  Also, as mentioned, some of these are EXEC crashes not-at-shutdown, which can be especially dangerous.

> The breakdown for the uptime range here puts about half of the crashes in
> the first minute:
> https://crash-stats.mozilla.com/report/
> list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATe
> lemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29
> 
> Note that in absolute numbers per release, it is not ranking particularly
> high.

Sure, but if made exploitable that doesn't matter (much).  The restart one would likely be very hard to exploit, though.  The other 1/2 are more worrying.
Who owns this code and can dig down (especially on the non-restart crashers)?  sec-crit that's going to be active in 41 or maybe 42...
Flags: needinfo?(gfritzsche)
(In reply to Randell Jesup [:jesup] from comment #2)
> (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
> > 
> > That sounds like we're calling into Telemetry *really* late, though;
> > histograms aren't freed until very very late in the shutdown process...or at
> > least they didn't used to be...
> 
> 
> The first one looks to be *early*, perhaps profilemanager?  thread 0 is in
> XRE_main()
> /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/
> NtCreateUserProcess().
> 
> The second one I link to is crashing in Add() to the array (and has >70
> active threads).

Can we narrow down where exactly? Somewhere in Histogram::Add()?

This would kill the StatisticsRecorder (which holds histograms) before the |if (appInitiatedRestart)| handling:
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.cc#1200

If we are not running into any leak detection issues, we should be fine with just leaving the StatisticsRecorder around (at least until we find out why other threads are still writing into it wildly).

Alternatively we could skip histogram writes after shutdown - we are not using the data anymore anyway.
Flags: needinfo?(gfritzsche)
The StatisticsRecorder allocation doesn't explain the very early crash though, this location should be early enough and the StatisticsRecorder seems to handle calls before initialization ok:
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> The StatisticsRecorder allocation doesn't explain the very early crash
> though, this location should be early enough and the StatisticsRecorder
> seems to handle calls before initialization ok:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350

Later analysis showed that the "early" crashes really are "restart" crashes, while trying to start the new process, and after shutting down/exiting XRE_main.  At least some of those are tied to threads still trying to log Telemetry *after* XRE_main has exited (i.e. thread leaks) - but it shows that the telemetry framework is vulnerable to stuff getting freed and used later.

The other group (non-restart) are more concerning, since the restart ones are likely *very* hard to attack.  But never say never...
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> This would kill the StatisticsRecorder (which holds histograms) before the
> |if (appInitiatedRestart)| handling:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.
> cc#1200

So those seem pretty clear.

I don't really see what the exact issue is with the other ones, e.g. comment 11 - dmajor, do you have any tips/input on the issue here?
Flags: needinfo?(dmajor)
(In reply to Georg Fritzsche [:gfritzsche] from comment #12)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > This would kill the StatisticsRecorder (which holds histograms) before the
> > |if (appInitiatedRestart)| handling:
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.
> > cc#1200
> 
> So those seem pretty clear.
> 
> I don't really see what the exact issue is with the other ones, e.g. comment
> 11 - dmajor, do you have any tips/input on the issue here?

E.g. i'm wondering:
* whether we can pinpoint more exactly what we are crashing on and
* whether we have something to reason that the others are during shutdown too
> https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads

There's nothing wrong with that one. (It was the other process that failed, and we just happened to be in Telemetry::Accumulate at the time that we snapshotted both processes.)

> https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads

This one has a dead Histogram object. The crash happens when trying to access the vtable for the BucketIndex call inside Histogram::Add. 

The main thread is in LaunchChild. Perhaps the comment "// Restart the app after XPCOM has been shut down cleanly." is relevant? Do we shut down telemetry as part of that? http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/xre/nsAppRunner.cpp#l4208
Flags: needinfo?(dmajor)
Tracking for 41+;leaving 40 for Sylvestre to decide.
(In reply to David Major [:dmajor] from comment #14)
> > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
> 
> There's nothing wrong with that one. (It was the other process that failed,
> and we just happened to be in Telemetry::Accumulate at the time that we
> snapshotted both processes.)

Is there a way to separate those out?  I'll note we have a bunch of crashes in this vsync code (perhaps it's hitting that code *a lot*.

> > https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads
> 
> This one has a dead Histogram object. The crash happens when trying to
> access the vtable for the BucketIndex call inside Histogram::Add. 
> 
> The main thread is in LaunchChild. Perhaps the comment "// Restart the app
> after XPCOM has been shut down cleanly." is relevant? Do we shut down
> telemetry as part of that?
> http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/
> xre/nsAppRunner.cpp#l4208

Yes; we've clearly found a case where update-and-restart can UAF (these are ~50% of the telemetry crashes).  This would be the StatisticsRecorder stuff mentioned above; however it also highlights we have some cases where threads (HostResolver for example) are escaping getting shut down properly (perhaps due to the 1-second sleep timers).

Others to check for validity:
https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9-0d72e2150815#allthreads
(https://crash-stats.mozilla.com/report/index/8723432f-f927-4e7d-9f4d-5a4d52150811)

https://crash-stats.mozilla.com/search/?signature=~Telemetry%3A%3A&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
is a search for all that crashed in Telemetry::

/me wishes for a "search crashstats for X in the crashing thread's backtrace"
Flags: needinfo?(dmajor)
Try push to see if just leaking the StatisticsRecorder triggers anything on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c
(In reply to Randell Jesup [:jesup] from comment #16)
> Yes; we've clearly found a case where update-and-restart can UAF (these are
> ~50% of the telemetry crashes).  This would be the StatisticsRecorder stuff
> mentioned above; however it also highlights we have some cases where threads
> (HostResolver for example) are escaping getting shut down properly (perhaps
> due to the 1-second sleep timers).

Do we actually have a defined time where those should be shutdown?
Should any of those still be active after XRE_mainRun() [0]?
If not we will just make those Telemetry crashes go away and they might still run into undefined behavior elsewhere.

0: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4389
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> (In reply to Randell Jesup [:jesup] from comment #16)
> > Yes; we've clearly found a case where update-and-restart can UAF (these are
> > ~50% of the telemetry crashes).  This would be the StatisticsRecorder stuff
> > mentioned above; however it also highlights we have some cases where threads
> > (HostResolver for example) are escaping getting shut down properly (perhaps
> > due to the 1-second sleep timers).
> 
> Do we actually have a defined time where those should be shutdown?
> Should any of those still be active after XRE_mainRun() [0]?
> If not we will just make those Telemetry crashes go away and they might
> still run into undefined behavior elsewhere.

nsThreads should be all gone by end of xpcom-shutdown-threads, certainly before this.  I have a patch (unlanded) that asserts on leaked nsThreads I need to un-bitrot; bsmedberg wasn't ready to land it a while back, but we used it to flag a bunch of leaks and kill them.  Bug 988464; I have an un-birotted version (not on the bug right now) I'm going to push through another Try run.

pthreads can leak past there, but need to make sure that they shouldn't do anything...  especially when we return from main().  Best not to leak them either...  In the past there were webrtc pthread leaks that did periodic processing that would occasionally hit a null vtable entry during post-main atexit processing (just because a timer woke them up).
(In reply to Randell Jesup [:jesup] from comment #16)
> (In reply to David Major [:dmajor] from comment #14)
> > > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
> > 
> > There's nothing wrong with that one. (It was the other process that failed,
> > and we just happened to be in Telemetry::Accumulate at the time that we
> > snapshotted both processes.)
> 
> Is there a way to separate those out?  I'll note we have a bunch of crashes
> in this vsync code (perhaps it's hitting that code *a lot*.

I guess you can ignore anything with an ipc_channel_error annotation. Bug 1186092 may make things better eventually.
> Others to check for validity:
> https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9-
> 0d72e2150815#allthreads
That one I don't understand; the registers in the minidump don't agree with the reported exception address.

I don't think looking at this across a bunch of different variations is going to give you anything actionable. It's entirely possible that there are random unrelated crashes mixed in. I'd say find a couple stacks that are most prominent and go after those.
Flags: needinfo?(dmajor)
This one has no ipc_channel_error annotation, it's an illegal instruction, and no indication it's in shutdown:

https://crash-stats.mozilla.com/report/index/58ace115-56f3-4ebf-b8f2-3a4bb2150813#allthreads

I do think some sort of fix for the statisticsrecorder issue is a big part of the answer here.  I have a try running for bug 988464, which may help with other aspects.  We should work to land some sort of fix for the primary crash here, and also probably address the HostResolver running-that-late issue (which can spin to a separate bug).

I do suspect from the set of crashes we see (such as the one above) that there are some more bugs here (as dmajor suggests), but if not, great.  I *do* want us to carefully document and vet the thread-safety aspects of this code!  If we don't (even if it's "all safe" today), we've left a grenade under the furniture that the next person to rearrange it will uncover.
So, the try push is fine, except some leaks that we can annotate/suppress.

However, the StatisticsRecorder had issues with crashes late in shutdown too per bug 1100501.
We need to avoid what bug 1100501, comment 135 describes.
To temporarily fix this issue (until the run-away threads are fixed), we'd need to find a point where those runaway threads should be done, but before jemalloc enters a shutdown state.

Alternatively public Telemetry accumulation functions should all be protected by checks on sTelemetry [0], but nulling that still leaves race potential.

0: https://dxr.mozilla.org/mozilla-central/rev/d5cf4e7900df6b2351bf3677b49fb70bedf68b99/toolkit/components/telemetry/Telemetry.cpp#744
(In reply to Randell Jesup [:jesup] from comment #22)
> and also probably address the HostResolver running-that-late
> issue (which can spin to a separate bug).

I filed bug 1196237 for the HostResolver.
Sorry guys, I'm late to this bug, my mail filters were hiding bugmail form this bug.
I agree we need to document and verify Telemetry threading behaviour. My team is really busy right now but I could document threading next week. Someone else will need to head up fixing the issues found so far, I'll help as much as I can.
Lowering severity slightly because races at startup or shutdown--the majority of these--would be very hard to turn into a useful exploit. With e10s and workers and telemetry leaking into Release it's still quite plausible there's an exploit lurking.
Keywords: sec-criticalsec-high
(In reply to Randell Jesup [:jesup] from comment #6)
> Sure, but it'd be hard to be sure none of them can trigger the UAFs.  Also,
> as mentioned, some of these are EXEC crashes not-at-shutdown, which can be
> especially dangerous.

These appear to be the same crash. EXCEPTION_ACCESS_VIOLATION_READ at 0x5a5a5a8e is the UAF while the memory has been poisoned and not yet re-used, EXCEPTION_ACCESS_VIOLATION_EXEC is when the UAF points at memory that _has_ been reused and we're interpreting part of it as a function pointer. From my POV they're the same (that is, potentially exploitable). This kind of poisoning is good for us because it forces more crashes so we can find and fix the UAFs, but it's not protecting the user.
While likely a separate issue from the shutdown crash which this is mostly about, some of the signatures pulled up by this are definitely non-shutdown crashes with writes to non-nullptr addresses in ::Accumulate():

cec7adac-303a-4e6f-9e78-fce142150820
(https://crash-stats.mozilla.com/report/index/cec7adac-303a-4e6f-9e78-fce142150820#allthreads)
and 
c05545b5-cc4f-47f7-bb89-129782150817
(https://crash-stats.mozilla.com/report/index/c05545b5-cc4f-47f7-bb89-129782150817#allthreads)

and there are more.  These do not appear to be IPC-shutdown-invoked crashes (no indication in the metadata about ipc failure).

A close look at how those crashed *might* give clues to what's clearly another hole in the Telemetry code here, and one that's hit (though not often) during normal operation (not shutdown).  Likely those should spin off to a separate bug and leave this one on the shutdown issue, though I suspect both are fallout from running this multithreaded code lock-free (and Atomic-free)
See Also: → 1197616
Benjamin, can you help find an owner for this bug (and its potential spin-off)? Thanks.
Flags: needinfo?(benjamin)
gfritzsche - You seem to have a patch for the base portion of this.  Can you take it?
Flags: needinfo?(gfritzsche)
Group: core-security → toolkit-core-security
(In reply to Randell Jesup [:jesup] from comment #30)
> gfritzsche - You seem to have a patch for the base portion of this.  Can you
> take it?

The patch doesn't take it all the way, otherwise i'd be happy to get this going.
See comment 23 for the issue against this patch.

The other problem is that - for the shutdown path - it only moves the whole issue to a later time.
If we still have runaway threads then they might trigger undefined behavior elsewhere too.

So i'm torn on this bug here. Anyone have a good/clear idea on this?
Flags: needinfo?(gfritzsche)
IIRC Patrick fixed the nsHostResolver thread problem that caused most of the crashes when we freed this, but there are other leaked threads (I'm slowly nailing them in another set of bugs, so I can land a patch to make nsThread leakage assert).

Note that the biggest crash cause is restart, and we crash while sitting launching the new process.  That shouldn't run afoul of jemalloc shutdown.

I'd say let's go ahead with this.  It certainly won't hurt.
(In reply to Randell Jesup [:jesup] from comment #32)
> IIRC Patrick fixed the nsHostResolver thread problem that caused most of the
> crashes when we freed this, but there are other leaked threads (I'm slowly
> nailing them in another set of bugs, so I can land a patch to make nsThread
> leakage assert).

Great, that sounds good.

> Note that the biggest crash cause is restart, and we crash while sitting
> launching the new process.  That shouldn't run afoul of jemalloc shutdown.

Ah, you're thinking of keeping that storage alive past that launch point?
Sure, that's no problem.
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #33)
> (In reply to Randell Jesup [:jesup] from comment #32)
> > Note that the biggest crash cause is restart, and we crash while sitting
> > launching the new process.  That shouldn't run afoul of jemalloc shutdown.
> 
> Ah, you're thinking of keeping that storage alive past that launch point?
> Sure, that's no problem.

So moving it to after the relaunch wasn't exactly my suggestion (though it will certainly help); I was responding to your comment that we need to deal with comment 23, which links to a comment about use after jemalloc-shutdown.  I was suggesting we do what you had mentioned earlier - leak the object and add it to suppressions.  I'd add a "and add some debug MOZ_ASSERTS if possible to flag use of Accumulate/etc in late shutdown (check a global if need be)." to the wish-list
Comment on attachment 8656664 [details] [diff] [review]
Move StatisticsRecorder cleanup after app relaunch

Review of attachment 8656664 [details] [diff] [review]:
-----------------------------------------------------------------

r+ because it's an improvement.

::: toolkit/xre/nsAppRunner.cpp
@@ +4449,5 @@
>      return rv == NS_ERROR_LAUNCHED_CHILD_PROCESS ? 0 : 1;
>    }
>  
> +  mStatisticsRecorder = nullptr;
> +

better than what we have, for sure.  But still we should look at just removing this and suppressing the leak.
Attachment #8656664 - Flags: review?(rjesup) → review+
(In reply to Randell Jesup [:jesup] from comment #36)
> (In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment
> #33)
> > (In reply to Randell Jesup [:jesup] from comment #32)
> > > Note that the biggest crash cause is restart, and we crash while sitting
> > > launching the new process.  That shouldn't run afoul of jemalloc shutdown.
> > 
> > Ah, you're thinking of keeping that storage alive past that launch point?
> > Sure, that's no problem.
> 
> So moving it to after the relaunch wasn't exactly my suggestion (though it
> will certainly help); I was responding to your comment that we need to deal
> with comment 23, which links to a comment about use after jemalloc-shutdown.
> I was suggesting we do what you had mentioned earlier - leak the object and
> add it to suppressions.  I'd add a "and add some debug MOZ_ASSERTS if
> possible to flag use of Accumulate/etc in late shutdown (check a global if
> need be)." to the wish-list

Ok, i guess the straight leaking would work. Sorry, i don't have time for figuring out the suppressions anymore before i leave, could look into that when i'm back.

What about landing the reviewed patch here, should we be stealthy about landing it?
FWIW, a trivial leaking version without supressions was in this try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c
Seems too late to take a fix in 41 for this. If a patch is ready for uplift in the next 24 hrs, please let me know and we can reconsider.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da814562fe6 looks good, with the leak suppressed (a 1-byte leak...)
gfritzche's patch, r=jesup.  I added the suppression
Attachment #8659078 - Flags: review+
Comment on attachment 8659078 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Does this suppress everything allocated by XRE_main only, or stuff allocated by anything called from it?  (Hopefully the first -- and I presume this suppresses all leaks directly allocated in that function, so in theory it could suppress other real leaks, not that I care in this case.)
Attachment #8659078 - Flags: review?(continuation)
Comment on attachment 8659078 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Review of attachment 8659078 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/xre/nsAppRunner.cpp
@@ +3090,5 @@
>    nsCOMPtr<nsIRemoteService> mRemoteService;
>  #endif
>  
>    UniquePtr<ScopedXPCOMStartup> mScopedXPCOM;
> +  base::StatisticsRecorder* mStatisticsRecorder;

Considering nothing is actually reading mStatisticsRecorder, you could make it a local variable in XRE_main (*not* XREMain::XRE_main), would would allow you to leak it and limit the effects of the lsan suppression, since it would then apply to XRE_main, not XREMain::XRE_main.
Comment on attachment 8659095 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Review of attachment 8659095 [details] [diff] [review]:
-----------------------------------------------------------------

copy/paste from irc:
> that's what I was suggesting, yes
> although, I just had a new nasty idea, which may of may not be worth if we have more of this "intentionally leak" scheme
> my nasty idea being something like UniquePtr, but that would be LeakedPtr. There would be a MakeLeaked<> template like there is MakeUnique, and then we could possily generically lsan-suppress all things leaked from MakeLeaked<*>
Attachment #8659095 - Flags: review?(mh+mozilla) → review+
Attachment #8659095 - Flags: review?(continuation) → review-
Attachment #8659078 - Flags: review?(continuation) → review-
That will match any stack that contains XRE_main in any frame. What you want is to call MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT on the object you are intentionally leaking, which is what is described in comment 46.
Assignee: gfritzsche → rjesup
Attachment #8659078 - Attachment is obsolete: true
Attachment #8659095 - Attachment is obsolete: true
Attachment #8659304 - Flags: review?(continuation)
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report

Review of attachment 8659304 [details] [diff] [review]:
-----------------------------------------------------------------

r=me for the LEAK_OBJECT stuff.

::: toolkit/xre/nsAppRunner.cpp
@@ +4471,5 @@
> +  // A initializer to initialize histogram collection, a chromium
> +  // thing used by Telemetry (and effectively a global; it's all static).
> +  // Note: purposely leaked
> +  base::StatisticsRecorder* statistics_recorder = new base::StatisticsRecorder();
> +  MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder);

You should #include mozilla/MemoryChecking.h explicitly in this file, if it isn't there already.
Attachment #8659304 - Flags: review?(continuation) → review+
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: longer exposure (though low risk of that being exploited)

[Describe test coverage new/current, TreeHerder]: Shutdown is tested.

[Risks and why]: very low risk

[String/UUID change made/needed]: none

[Security approval request comment]
How easily could an exploit be constructed based on the patch?  Highly unlikely

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?  Somewhat - clear we're leaking, so one could look for why that's important.

Which older supported branches are affected by this flaw? all

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?  Trivial backporting.

How likely is this patch to cause regressions; how much testing does it need?  No likelihood of regressions.  No significant testing needed.  At most, do some update-and-restarts.
Attachment #8659304 - Flags: sec-approval?
Attachment #8659304 - Flags: approval-mozilla-aurora?
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report

a=dveditz for sec-approval and aurora landing.
Attachment #8659304 - Flags: sec-approval?
Attachment #8659304 - Flags: sec-approval+
Attachment #8659304 - Flags: approval-mozilla-aurora?
Attachment #8659304 - Flags: approval-mozilla-aurora+
Note: this code is still a mess thread-safety-wise, at least documenting it and TSAN.
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/56f6468c7e60 because suppressing the very generic leak in Valgrind made us nervous.
Tested with this patch (https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf91734da97f), and with a patch to introduce a leak in something called from a deeper function to ensure reporting is still working (https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cac87e190aa)
Attachment #8660398 - Flags: review?(n.nethercote)
Comment on attachment 8660398 [details] [diff] [review]
Suppress leak in valgrind as well

Review of attachment 8660398 [details] [diff] [review]:
-----------------------------------------------------------------

::: build/valgrind/cross-architecture.sup
@@ +39,5 @@
> +   Memcheck:Leak
> +   fun:malloc
> +   fun:moz_xmalloc
> +   fun:operator new
> +   fun:XRE_main

I have no context but this is a worryingly generic suppression. I suggest you create a separate function that does nothing but allocate the StatisticsReporter and call it from main(). That way there would be, for example, a "fun:LeakyStatisticsReporterAllocator" entry between the "fun:operator new" and the "fun:XRE_main", which would make things clearer and less generic.
Attachment #8660398 - Flags: review?(n.nethercote)
This suppression should only over allocations directly from XRE_main(), which is a very simple function.  Before the change it was basically:
int XRE_main(int argc, char* argv[], const nsXREAppData* aAppData, uint32_t aFlags)
{
  XREMain main;
  int result = main.XRE_main(argc, argv, aAppData);
  mozilla::RecordShutdownEndTimeStamp();
  return result;
}

The suppression is limited to allocations with new call directly from XRE_main.  I tested that a leak in main.XRE_main() is reported.  And I was hesitant to do the stub-function trick (though I coded that as well) because the compiler might inline it given how simple it is.  

All that said, it's trivial to dump it into a stub function if you prefer.
(In reply to Randell Jesup [:jesup] from comment #59)
> All that said, it's trivial to dump it into a stub function if you prefer.

That would save a future person looking at the valgrind suppressions from having the same reaction as me :)  Thank you.
Green try with separate stub: https://treeherder.mozilla.org/#/jobs?repo=try&revision=23cab817081b
and a red one with a deliberate leak elsewhere; carrying forward r=njn
Georg, are you planning to request an uplift to 42? Thanks
Flags: needinfo?(gfritzsche)
Moving to jesup who should have a better overview here now :)
Flags: needinfo?(gfritzsche)
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Approval Request Comment
[Feature/regressing bug #]: N/A

[User impact if declined]: hard-to-exploit UAF in shutdown, typically when doing a restart installing an update.

[Describe test coverage new/current, TreeHerder]: extremely hard repeating the bug to test in automation, or even manually.  Normal shutdown is well tested.

[Risks and why]: Almost none.  Just leaks what is effectively a global.

[String/UUID change made/needed]: none
Attachment #8660399 - Flags: approval-mozilla-aurora?
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Group: toolkit-core-security → core-security-release
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Let's take it.
Attachment #8660399 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

[Triage Comment]
42 is now in beta.
Attachment #8660399 - Flags: approval-mozilla-aurora+ → approval-mozilla-beta+
Flags: needinfo?(benjamin)
Whiteboard: [post-critsmash-triage]
Randell, do you want to request uplift for this to ESR? Thanks.
Flags: needinfo?(rjesup)
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

See Aurora approval request
Flags: needinfo?(rjesup)
Attachment #8660399 - Flags: approval-mozilla-esr38?
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report

Fix for shutdown and restart leak. Approved for esr-38
Attachment #8660399 - Flags: approval-mozilla-esr38? → approval-mozilla-esr38+
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main42+][adv-esr38.4+]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.