Closed
Bug 1193038
Opened 9 years ago
Closed 9 years ago
UAF in Telemetry::Accumulate(); code appears to not be thread-safe
Categories
(Toolkit :: Telemetry, defect)
Toolkit
Telemetry
Tracking
()
People
(Reporter: jesup, Assigned: jesup)
References
Details
(4 keywords, Whiteboard: [post-critsmash-triage][adv-main42+][adv-esr38.4+])
Attachments
(3 files, 3 obsolete files)
1.51 KB,
patch
|
jesup
:
review+
|
Details | Diff | Splinter Review |
967 bytes,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
3.66 KB,
patch
|
jesup
:
review+
Sylvestre
:
approval-mozilla-beta+
lizzard
:
approval-mozilla-esr38+
|
Details | Diff | Splinter Review |
Repeated crashes in the field with 0x5a signatures.
Called from
Telemetry::Accumulate(Telemetry::DNS_FAILED_LOOKUP_TIME, millis);
https://crash-stats.mozilla.com/report/index/aabb2399-64d7-4ec7-a888-027b02150805
There is no obvious mention anywhere in the Telemetry code/includes about the thread-safety aspects; thus one should presume it's callable from any thread (and in this case is, from a DNS resolver thread, STS thread, etc). However, I see no locking around access to shared telemetry structures, which presumably led to this set of crashes. I would imagine it's also prone to TSAN issues.
The only good thing is that few of the crashes are on Release since most telemetry is disabled on Release (IIRC). But not all....
Others include Telemetry::Accumulate(Telemetry::STS_NUMBER_OF_ONSOCKETREADY_CALLS, numberOfOnSocketReadyCalls)
(https://crash-stats.mozilla.com/report/index/c77f5b7c-cc39-44d3-9664-030ff2150807)
More crashes, all ugly:
https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29
Assignee | ||
Updated•9 years ago
|
Component: Untriaged → Telemetry
Product: Core → Toolkit
Comment 1•9 years ago
|
||
Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
That sounds like we're calling into Telemetry *really* late, though; histograms aren't freed until very very late in the shutdown process...or at least they didn't used to be...
Assignee | ||
Comment 2•9 years ago
|
||
(In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
>
> That sounds like we're calling into Telemetry *really* late, though;
> histograms aren't freed until very very late in the shutdown process...or at
> least they didn't used to be...
The first one looks to be *early*, perhaps profilemanager? thread 0 is in XRE_main() /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/NtCreateUserProcess().
The second one I link to is crashing in Add() to the array (and has >70 active threads).
Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the first - in LaunchChild(), with a smallish number of threads active, and short uptimes (6, 24, etc). There are some others like the second (which had uptime = 15519)
Hmmmm I wonder if it *is* the switch to the real profile
Assignee | ||
Comment 3•9 years ago
|
||
Also, bug 1142079 smells like wallpapering over a race like the Add() race (second example above).
Comment 4•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #0)
> The only good thing is that few of the crashes are on Release since most
> telemetry is disabled on Release (IIRC). But not all....
Telemetry will default to enabled on release soon (41 or 42), but we will only record from a limited set of probes.
Comment 5•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> (In reply to Randell Jesup [:jesup] from comment #0)
> > The only good thing is that few of the crashes are on Release since most
> > telemetry is disabled on Release (IIRC). But not all....
>
> Telemetry will default to enabled on release soon (41 or 42), but we will
> only record from a limited set of probes.
Meaning that default-on on release should be not - or much less - affected.
The breakdown for the uptime range here puts about half of the crashes in the first minute:
https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29
Note that in absolute numbers per release, it is not ranking particularly high.
Updated•9 years ago
|
Keywords: csectype-race
Assignee | ||
Comment 6•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #2)
> (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
> >
> > That sounds like we're calling into Telemetry *really* late, though;
> > histograms aren't freed until very very late in the shutdown process...or at
> > least they didn't used to be...
>
>
> The first one looks to be *early*, perhaps profilemanager? thread 0 is in
> XRE_main()
> /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/
> NtCreateUserProcess().
>
> The second one I link to is crashing in Add() to the array (and has >70
> active threads).
>
> Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the
> first - in LaunchChild(), with a smallish number of threads active, and
> short uptimes (6, 24, etc). There are some others like the second (which
> had uptime = 15519)
>
> Hmmmm I wonder if it *is* the switch to the real profile
The call sequence seen in at least half these (and most of the 0x5a's) is LaunchChild() called within "if (appInitiatedRestart) {"
So, this would correlate strongly with installing updates and restarting.
This gets set if XRE_mainRun() returns a "please restart" result, so at this point most or all threads should be shut down (and certainly all nsThreads). Clearly, however, this is not the case. We're launching a new child, but haven't exited yet, and in the meantime other threads may still be doing stuff. And clearly a bunch of threads are still running. Some are OS things (darn, can't see the thread names). Some are not:
thread 11 was XPCJSRuntime.cpp in WatchdogMain() - and interestingly, it's in a Sleep() (really a CondVar wait) inside an if (!self->ShuttingDown).
Also the sandbox broker service, the toolkit nsTerminator (which may make sense). There are ~19 threads though; most unclear what they are.
So at a minimum we'd want to either not clear out the telemetry array ever or block usage of it after it's freed (which likely would involve a lock/etc). We also need to figure out why the DNS Host Resolver is still running in shutdown. And the other crashes strongly imply the lack of locking here is causing Add() to fail due to (presumed) races and crash with randomish addresses (not 0x5a, but not null ptrs, and a number are EXEC crashes)
cc: mcmanus & jduell because the HostResolver is part of the problem here.
(In reply to Georg Fritzsche [:gfritzsche] from comment #5)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #4)
> > (In reply to Randell Jesup [:jesup] from comment #0)
> > > The only good thing is that few of the crashes are on Release since most
> > > telemetry is disabled on Release (IIRC). But not all....
> >
> > Telemetry will default to enabled on release soon (41 or 42), but we will
> > only record from a limited set of probes.
>
> Meaning that default-on on release should be not - or much less - affected.
Sure, but it'd be hard to be sure none of them can trigger the UAFs. Also, as mentioned, some of these are EXEC crashes not-at-shutdown, which can be especially dangerous.
> The breakdown for the uptime range here puts about half of the crashes in
> the first minute:
> https://crash-stats.mozilla.com/report/
> list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATe
> lemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29
>
> Note that in absolute numbers per release, it is not ranking particularly
> high.
Sure, but if made exploitable that doesn't matter (much). The restart one would likely be very hard to exploit, though. The other 1/2 are more worrying.
Assignee | ||
Comment 7•9 years ago
|
||
Who owns this code and can dig down (especially on the non-restart crashers)? sec-crit that's going to be active in 41 or maybe 42...
Flags: needinfo?(gfritzsche)
Comment 8•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #2)
> (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1)
> > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565.
> >
> > That sounds like we're calling into Telemetry *really* late, though;
> > histograms aren't freed until very very late in the shutdown process...or at
> > least they didn't used to be...
>
>
> The first one looks to be *early*, perhaps profilemanager? thread 0 is in
> XRE_main()
> /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/
> NtCreateUserProcess().
>
> The second one I link to is crashing in Add() to the array (and has >70
> active threads).
Can we narrow down where exactly? Somewhere in Histogram::Add()?
This would kill the StatisticsRecorder (which holds histograms) before the |if (appInitiatedRestart)| handling:
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.cc#1200
If we are not running into any leak detection issues, we should be fine with just leaving the StatisticsRecorder around (at least until we find out why other threads are still writing into it wildly).
Alternatively we could skip histogram writes after shutdown - we are not using the data anymore anyway.
Flags: needinfo?(gfritzsche)
Comment 9•9 years ago
|
||
The StatisticsRecorder allocation doesn't explain the very early crash though, this location should be early enough and the StatisticsRecorder seems to handle calls before initialization ok:
https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350
Assignee | ||
Comment 10•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #9)
> The StatisticsRecorder allocation doesn't explain the very early crash
> though, this location should be early enough and the StatisticsRecorder
> seems to handle calls before initialization ok:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350
Later analysis showed that the "early" crashes really are "restart" crashes, while trying to start the new process, and after shutting down/exiting XRE_main. At least some of those are tied to threads still trying to log Telemetry *after* XRE_main has exited (i.e. thread leaks) - but it shows that the telemetry framework is vulnerable to stuff getting freed and used later.
The other group (non-restart) are more concerning, since the restart ones are likely *very* hard to attack. But never say never...
Comment 11•9 years ago
|
||
These ones look like they may not be startup or shutdown related:
https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads
Comment 12•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> This would kill the StatisticsRecorder (which holds histograms) before the
> |if (appInitiatedRestart)| handling:
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
> https://dxr.mozilla.org/mozilla-central/rev/
> 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.
> cc#1200
So those seem pretty clear.
I don't really see what the exact issue is with the other ones, e.g. comment 11 - dmajor, do you have any tips/input on the issue here?
Flags: needinfo?(dmajor)
Comment 13•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #12)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #8)
> > This would kill the StatisticsRecorder (which holds histograms) before the
> > |if (appInitiatedRestart)| handling:
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416
> > https://dxr.mozilla.org/mozilla-central/rev/
> > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.
> > cc#1200
>
> So those seem pretty clear.
>
> I don't really see what the exact issue is with the other ones, e.g. comment
> 11 - dmajor, do you have any tips/input on the issue here?
E.g. i'm wondering:
* whether we can pinpoint more exactly what we are crashing on and
* whether we have something to reason that the others are during shutdown too
Comment 14•9 years ago
|
||
> https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
There's nothing wrong with that one. (It was the other process that failed, and we just happened to be in Telemetry::Accumulate at the time that we snapshotted both processes.)
> https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads
This one has a dead Histogram object. The crash happens when trying to access the vtable for the BucketIndex call inside Histogram::Add.
The main thread is in LaunchChild. Perhaps the comment "// Restart the app after XPCOM has been shut down cleanly." is relevant? Do we shut down telemetry as part of that? http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/xre/nsAppRunner.cpp#l4208
Flags: needinfo?(dmajor)
Comment 15•9 years ago
|
||
Tracking for 41+;leaving 40 for Sylvestre to decide.
tracking-firefox40:
--- → ?
tracking-firefox41:
--- → +
tracking-firefox42:
--- → +
tracking-firefox43:
--- → +
tracking-firefox-esr31:
--- → ?
Assignee | ||
Comment 16•9 years ago
|
||
(In reply to David Major [:dmajor] from comment #14)
> > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
>
> There's nothing wrong with that one. (It was the other process that failed,
> and we just happened to be in Telemetry::Accumulate at the time that we
> snapshotted both processes.)
Is there a way to separate those out? I'll note we have a bunch of crashes in this vsync code (perhaps it's hitting that code *a lot*.
> > https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads
>
> This one has a dead Histogram object. The crash happens when trying to
> access the vtable for the BucketIndex call inside Histogram::Add.
>
> The main thread is in LaunchChild. Perhaps the comment "// Restart the app
> after XPCOM has been shut down cleanly." is relevant? Do we shut down
> telemetry as part of that?
> http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/
> xre/nsAppRunner.cpp#l4208
Yes; we've clearly found a case where update-and-restart can UAF (these are ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff mentioned above; however it also highlights we have some cases where threads (HostResolver for example) are escaping getting shut down properly (perhaps due to the 1-second sleep timers).
Others to check for validity:
https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9-0d72e2150815#allthreads
(https://crash-stats.mozilla.com/report/index/8723432f-f927-4e7d-9f4d-5a4d52150811)
https://crash-stats.mozilla.com/search/?signature=~Telemetry%3A%3A&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
is a search for all that crashed in Telemetry::
/me wishes for a "search crashstats for X in the crashing thread's backtrace"
Flags: needinfo?(dmajor)
Comment 17•9 years ago
|
||
Try push to see if just leaking the StatisticsRecorder triggers anything on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c
Comment 18•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #16)
> Yes; we've clearly found a case where update-and-restart can UAF (these are
> ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff
> mentioned above; however it also highlights we have some cases where threads
> (HostResolver for example) are escaping getting shut down properly (perhaps
> due to the 1-second sleep timers).
Do we actually have a defined time where those should be shutdown?
Should any of those still be active after XRE_mainRun() [0]?
If not we will just make those Telemetry crashes go away and they might still run into undefined behavior elsewhere.
0: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4389
Assignee | ||
Comment 19•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #18)
> (In reply to Randell Jesup [:jesup] from comment #16)
> > Yes; we've clearly found a case where update-and-restart can UAF (these are
> > ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff
> > mentioned above; however it also highlights we have some cases where threads
> > (HostResolver for example) are escaping getting shut down properly (perhaps
> > due to the 1-second sleep timers).
>
> Do we actually have a defined time where those should be shutdown?
> Should any of those still be active after XRE_mainRun() [0]?
> If not we will just make those Telemetry crashes go away and they might
> still run into undefined behavior elsewhere.
nsThreads should be all gone by end of xpcom-shutdown-threads, certainly before this. I have a patch (unlanded) that asserts on leaked nsThreads I need to un-bitrot; bsmedberg wasn't ready to land it a while back, but we used it to flag a bunch of leaks and kill them. Bug 988464; I have an un-birotted version (not on the bug right now) I'm going to push through another Try run.
pthreads can leak past there, but need to make sure that they shouldn't do anything... especially when we return from main(). Best not to leak them either... In the past there were webrtc pthread leaks that did periodic processing that would occasionally hit a null vtable entry during post-main atexit processing (just because a timer woke them up).
Comment 20•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #16)
> (In reply to David Major [:dmajor] from comment #14)
> > > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads
> >
> > There's nothing wrong with that one. (It was the other process that failed,
> > and we just happened to be in Telemetry::Accumulate at the time that we
> > snapshotted both processes.)
>
> Is there a way to separate those out? I'll note we have a bunch of crashes
> in this vsync code (perhaps it's hitting that code *a lot*.
I guess you can ignore anything with an ipc_channel_error annotation. Bug 1186092 may make things better eventually.
Comment 21•9 years ago
|
||
> Others to check for validity:
> https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9-
> 0d72e2150815#allthreads
That one I don't understand; the registers in the minidump don't agree with the reported exception address.
I don't think looking at this across a bunch of different variations is going to give you anything actionable. It's entirely possible that there are random unrelated crashes mixed in. I'd say find a couple stacks that are most prominent and go after those.
Flags: needinfo?(dmajor)
Assignee | ||
Comment 22•9 years ago
|
||
This one has no ipc_channel_error annotation, it's an illegal instruction, and no indication it's in shutdown:
https://crash-stats.mozilla.com/report/index/58ace115-56f3-4ebf-b8f2-3a4bb2150813#allthreads
I do think some sort of fix for the statisticsrecorder issue is a big part of the answer here. I have a try running for bug 988464, which may help with other aspects. We should work to land some sort of fix for the primary crash here, and also probably address the HostResolver running-that-late issue (which can spin to a separate bug).
I do suspect from the set of crashes we see (such as the one above) that there are some more bugs here (as dmajor suggests), but if not, great. I *do* want us to carefully document and vet the thread-safety aspects of this code! If we don't (even if it's "all safe" today), we've left a grenade under the furniture that the next person to rearrange it will uncover.
Comment 23•9 years ago
|
||
So, the try push is fine, except some leaks that we can annotate/suppress.
However, the StatisticsRecorder had issues with crashes late in shutdown too per bug 1100501.
We need to avoid what bug 1100501, comment 135 describes.
To temporarily fix this issue (until the run-away threads are fixed), we'd need to find a point where those runaway threads should be done, but before jemalloc enters a shutdown state.
Alternatively public Telemetry accumulation functions should all be protected by checks on sTelemetry [0], but nulling that still leaves race potential.
0: https://dxr.mozilla.org/mozilla-central/rev/d5cf4e7900df6b2351bf3677b49fb70bedf68b99/toolkit/components/telemetry/Telemetry.cpp#744
Comment 24•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #22)
> and also probably address the HostResolver running-that-late
> issue (which can spin to a separate bug).
I filed bug 1196237 for the HostResolver.
Updated•9 years ago
|
Comment 25•9 years ago
|
||
Sorry guys, I'm late to this bug, my mail filters were hiding bugmail form this bug.
I agree we need to document and verify Telemetry threading behaviour. My team is really busy right now but I could document threading next week. Someone else will need to head up fixing the issues found so far, I'll help as much as I can.
Comment 26•9 years ago
|
||
Lowering severity slightly because races at startup or shutdown--the majority of these--would be very hard to turn into a useful exploit. With e10s and workers and telemetry leaking into Release it's still quite plausible there's an exploit lurking.
Keywords: sec-critical → sec-high
Comment 27•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #6)
> Sure, but it'd be hard to be sure none of them can trigger the UAFs. Also,
> as mentioned, some of these are EXEC crashes not-at-shutdown, which can be
> especially dangerous.
These appear to be the same crash. EXCEPTION_ACCESS_VIOLATION_READ at 0x5a5a5a8e is the UAF while the memory has been poisoned and not yet re-used, EXCEPTION_ACCESS_VIOLATION_EXEC is when the UAF points at memory that _has_ been reused and we're interpreting part of it as a function pointer. From my POV they're the same (that is, potentially exploitable). This kind of poisoning is good for us because it forces more crashes so we can find and fix the UAFs, but it's not protecting the user.
Assignee | ||
Comment 28•9 years ago
|
||
While likely a separate issue from the shutdown crash which this is mostly about, some of the signatures pulled up by this are definitely non-shutdown crashes with writes to non-nullptr addresses in ::Accumulate():
cec7adac-303a-4e6f-9e78-fce142150820
(https://crash-stats.mozilla.com/report/index/cec7adac-303a-4e6f-9e78-fce142150820#allthreads)
and
c05545b5-cc4f-47f7-bb89-129782150817
(https://crash-stats.mozilla.com/report/index/c05545b5-cc4f-47f7-bb89-129782150817#allthreads)
and there are more. These do not appear to be IPC-shutdown-invoked crashes (no indication in the metadata about ipc failure).
A close look at how those crashed *might* give clues to what's clearly another hole in the Telemetry code here, and one that's hit (though not often) during normal operation (not shutdown). Likely those should spin off to a separate bug and leave this one on the shutdown issue, though I suspect both are fallout from running this multithreaded code lock-free (and Atomic-free)
Comment 29•9 years ago
|
||
Benjamin, can you help find an owner for this bug (and its potential spin-off)? Thanks.
Flags: needinfo?(benjamin)
Assignee | ||
Comment 30•9 years ago
|
||
gfritzsche - You seem to have a patch for the base portion of this. Can you take it?
Flags: needinfo?(gfritzsche)
Updated•9 years ago
|
Group: core-security → toolkit-core-security
Comment 31•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #30)
> gfritzsche - You seem to have a patch for the base portion of this. Can you
> take it?
The patch doesn't take it all the way, otherwise i'd be happy to get this going.
See comment 23 for the issue against this patch.
The other problem is that - for the shutdown path - it only moves the whole issue to a later time.
If we still have runaway threads then they might trigger undefined behavior elsewhere too.
So i'm torn on this bug here. Anyone have a good/clear idea on this?
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 32•9 years ago
|
||
IIRC Patrick fixed the nsHostResolver thread problem that caused most of the crashes when we freed this, but there are other leaked threads (I'm slowly nailing them in another set of bugs, so I can land a patch to make nsThread leakage assert).
Note that the biggest crash cause is restart, and we crash while sitting launching the new process. That shouldn't run afoul of jemalloc shutdown.
I'd say let's go ahead with this. It certainly won't hurt.
Comment 33•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #32)
> IIRC Patrick fixed the nsHostResolver thread problem that caused most of the
> crashes when we freed this, but there are other leaked threads (I'm slowly
> nailing them in another set of bugs, so I can land a patch to make nsThread
> leakage assert).
Great, that sounds good.
> Note that the biggest crash cause is restart, and we crash while sitting
> launching the new process. That shouldn't run afoul of jemalloc shutdown.
Ah, you're thinking of keeping that storage alive past that launch point?
Sure, that's no problem.
Comment 34•9 years ago
|
||
Attachment #8656664 -
Flags: review?(rjesup)
Updated•9 years ago
|
Assignee: nobody → gfritzsche
Status: NEW → ASSIGNED
Comment 35•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea8415e524ed
Note that i'm out on PTO from tomorrow afternoon.
Assignee | ||
Comment 36•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #33)
> (In reply to Randell Jesup [:jesup] from comment #32)
> > Note that the biggest crash cause is restart, and we crash while sitting
> > launching the new process. That shouldn't run afoul of jemalloc shutdown.
>
> Ah, you're thinking of keeping that storage alive past that launch point?
> Sure, that's no problem.
So moving it to after the relaunch wasn't exactly my suggestion (though it will certainly help); I was responding to your comment that we need to deal with comment 23, which links to a comment about use after jemalloc-shutdown. I was suggesting we do what you had mentioned earlier - leak the object and add it to suppressions. I'd add a "and add some debug MOZ_ASSERTS if possible to flag use of Accumulate/etc in late shutdown (check a global if need be)." to the wish-list
Assignee | ||
Comment 37•9 years ago
|
||
Comment on attachment 8656664 [details] [diff] [review]
Move StatisticsRecorder cleanup after app relaunch
Review of attachment 8656664 [details] [diff] [review]:
-----------------------------------------------------------------
r+ because it's an improvement.
::: toolkit/xre/nsAppRunner.cpp
@@ +4449,5 @@
> return rv == NS_ERROR_LAUNCHED_CHILD_PROCESS ? 0 : 1;
> }
>
> + mStatisticsRecorder = nullptr;
> +
better than what we have, for sure. But still we should look at just removing this and suppressing the leak.
Attachment #8656664 -
Flags: review?(rjesup) → review+
Comment 38•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #36)
> (In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment
> #33)
> > (In reply to Randell Jesup [:jesup] from comment #32)
> > > Note that the biggest crash cause is restart, and we crash while sitting
> > > launching the new process. That shouldn't run afoul of jemalloc shutdown.
> >
> > Ah, you're thinking of keeping that storage alive past that launch point?
> > Sure, that's no problem.
>
> So moving it to after the relaunch wasn't exactly my suggestion (though it
> will certainly help); I was responding to your comment that we need to deal
> with comment 23, which links to a comment about use after jemalloc-shutdown.
> I was suggesting we do what you had mentioned earlier - leak the object and
> add it to suppressions. I'd add a "and add some debug MOZ_ASSERTS if
> possible to flag use of Accumulate/etc in late shutdown (check a global if
> need be)." to the wish-list
Ok, i guess the straight leaking would work. Sorry, i don't have time for figuring out the suppressions anymore before i leave, could look into that when i'm back.
What about landing the reviewed patch here, should we be stealthy about landing it?
Comment 39•9 years ago
|
||
FWIW, a trivial leaking version without supressions was in this try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c
Seems too late to take a fix in 41 for this. If a patch is ready for uplift in the next 24 hrs, please let me know and we can reconsider.
Assignee | ||
Comment 41•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da814562fe6 looks good, with the leak suppressed (a 1-byte leak...)
Assignee | ||
Comment 42•9 years ago
|
||
gfritzche's patch, r=jesup. I added the suppression
Attachment #8659078 -
Flags: review+
Assignee | ||
Comment 43•9 years ago
|
||
Comment on attachment 8659078 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Does this suppress everything allocated by XRE_main only, or stuff allocated by anything called from it? (Hopefully the first -- and I presume this suppresses all leaks directly allocated in that function, so in theory it could suppress other real leaks, not that I care in this case.)
Attachment #8659078 -
Flags: review?(continuation)
Comment 44•9 years ago
|
||
Comment on attachment 8659078 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Review of attachment 8659078 [details] [diff] [review]:
-----------------------------------------------------------------
::: toolkit/xre/nsAppRunner.cpp
@@ +3090,5 @@
> nsCOMPtr<nsIRemoteService> mRemoteService;
> #endif
>
> UniquePtr<ScopedXPCOMStartup> mScopedXPCOM;
> + base::StatisticsRecorder* mStatisticsRecorder;
Considering nothing is actually reading mStatisticsRecorder, you could make it a local variable in XRE_main (*not* XREMain::XRE_main), would would allow you to leak it and limit the effects of the lsan suppression, since it would then apply to XRE_main, not XREMain::XRE_main.
Assignee | ||
Comment 45•9 years ago
|
||
Attachment #8659095 -
Flags: review?(mh+mozilla)
Attachment #8659095 -
Flags: review?(continuation)
Comment 46•9 years ago
|
||
Comment on attachment 8659095 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Review of attachment 8659095 [details] [diff] [review]:
-----------------------------------------------------------------
copy/paste from irc:
> that's what I was suggesting, yes
> although, I just had a new nasty idea, which may of may not be worth if we have more of this "intentionally leak" scheme
> my nasty idea being something like UniquePtr, but that would be LeakedPtr. There would be a MakeLeaked<> template like there is MakeUnique, and then we could possily generically lsan-suppress all things leaked from MakeLeaked<*>
Attachment #8659095 -
Flags: review?(mh+mozilla) → review+
Updated•9 years ago
|
Attachment #8659095 -
Flags: review?(continuation) → review-
Updated•9 years ago
|
Attachment #8659078 -
Flags: review?(continuation) → review-
Comment 47•9 years ago
|
||
That will match any stack that contains XRE_main in any frame. What you want is to call MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT on the object you are intentionally leaking, which is what is described in comment 46.
Assignee | ||
Comment 48•9 years ago
|
||
Assignee: gfritzsche → rjesup
Attachment #8659078 -
Attachment is obsolete: true
Attachment #8659095 -
Attachment is obsolete: true
Attachment #8659304 -
Flags: review?(continuation)
Comment 49•9 years ago
|
||
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report
Review of attachment 8659304 [details] [diff] [review]:
-----------------------------------------------------------------
r=me for the LEAK_OBJECT stuff.
::: toolkit/xre/nsAppRunner.cpp
@@ +4471,5 @@
> + // A initializer to initialize histogram collection, a chromium
> + // thing used by Telemetry (and effectively a global; it's all static).
> + // Note: purposely leaked
> + base::StatisticsRecorder* statistics_recorder = new base::StatisticsRecorder();
> + MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder);
You should #include mozilla/MemoryChecking.h explicitly in this file, if it isn't there already.
Attachment #8659304 -
Flags: review?(continuation) → review+
Assignee | ||
Comment 50•9 years ago
|
||
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report
Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: longer exposure (though low risk of that being exploited)
[Describe test coverage new/current, TreeHerder]: Shutdown is tested.
[Risks and why]: very low risk
[String/UUID change made/needed]: none
[Security approval request comment]
How easily could an exploit be constructed based on the patch? Highly unlikely
Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? Somewhat - clear we're leaking, so one could look for why that's important.
Which older supported branches are affected by this flaw? all
If not all supported branches, which bug introduced the flaw?
Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Trivial backporting.
How likely is this patch to cause regressions; how much testing does it need? No likelihood of regressions. No significant testing needed. At most, do some update-and-restarts.
Attachment #8659304 -
Flags: sec-approval?
Attachment #8659304 -
Flags: approval-mozilla-aurora?
Comment 51•9 years ago
|
||
Comment on attachment 8659304 [details] [diff] [review]
fooPurposely leak StatisticsReport object and suppress the leak report
a=dveditz for sec-approval and aurora landing.
Attachment #8659304 -
Flags: sec-approval?
Attachment #8659304 -
Flags: sec-approval+
Attachment #8659304 -
Flags: approval-mozilla-aurora?
Attachment #8659304 -
Flags: approval-mozilla-aurora+
Assignee | ||
Comment 52•9 years ago
|
||
Assignee | ||
Comment 53•9 years ago
|
||
Note: this code is still a mess thread-safety-wise, at least documenting it and TSAN.
Assignee | ||
Comment 54•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a09b707e39b (needed unused << statistics_recorder)
Comment 55•9 years ago
|
||
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/56f6468c7e60 because suppressing the very generic leak in Valgrind made us nervous.
Assignee | ||
Comment 56•9 years ago
|
||
Tested with this patch (https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf91734da97f), and with a patch to introduce a leak in something called from a deeper function to ensure reporting is still working (https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cac87e190aa)
Attachment #8660398 -
Flags: review?(n.nethercote)
Assignee | ||
Comment 57•9 years ago
|
||
Attachment #8659304 -
Attachment is obsolete: true
Attachment #8660399 -
Flags: review+
Comment 58•9 years ago
|
||
Comment on attachment 8660398 [details] [diff] [review]
Suppress leak in valgrind as well
Review of attachment 8660398 [details] [diff] [review]:
-----------------------------------------------------------------
::: build/valgrind/cross-architecture.sup
@@ +39,5 @@
> + Memcheck:Leak
> + fun:malloc
> + fun:moz_xmalloc
> + fun:operator new
> + fun:XRE_main
I have no context but this is a worryingly generic suppression. I suggest you create a separate function that does nothing but allocate the StatisticsReporter and call it from main(). That way there would be, for example, a "fun:LeakyStatisticsReporterAllocator" entry between the "fun:operator new" and the "fun:XRE_main", which would make things clearer and less generic.
Attachment #8660398 -
Flags: review?(n.nethercote)
Assignee | ||
Comment 59•9 years ago
|
||
This suppression should only over allocations directly from XRE_main(), which is a very simple function. Before the change it was basically:
int XRE_main(int argc, char* argv[], const nsXREAppData* aAppData, uint32_t aFlags)
{
XREMain main;
int result = main.XRE_main(argc, argv, aAppData);
mozilla::RecordShutdownEndTimeStamp();
return result;
}
The suppression is limited to allocations with new call directly from XRE_main. I tested that a leak in main.XRE_main() is reported. And I was hesitant to do the stub-function trick (though I coded that as well) because the compiler might inline it given how simple it is.
All that said, it's trivial to dump it into a stub function if you prefer.
Comment 60•9 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #59)
> All that said, it's trivial to dump it into a stub function if you prefer.
That would save a future person looking at the valgrind suppressions from having the same reaction as me :) Thank you.
Updated•9 years ago
|
Attachment #8660398 -
Flags: review+
Assignee | ||
Comment 61•9 years ago
|
||
Green try with separate stub: https://treeherder.mozilla.org/#/jobs?repo=try&revision=23cab817081b
and a red one with a deliberate leak elsewhere; carrying forward r=njn
Assignee | ||
Comment 62•9 years ago
|
||
Comment 63•9 years ago
|
||
Georg, are you planning to request an uplift to 42? Thanks
Flags: needinfo?(gfritzsche)
Comment 64•9 years ago
|
||
Moving to jesup who should have a better overview here now :)
Flags: needinfo?(gfritzsche)
Assignee | ||
Comment 65•9 years ago
|
||
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Approval Request Comment
[Feature/regressing bug #]: N/A
[User impact if declined]: hard-to-exploit UAF in shutdown, typically when doing a restart installing an update.
[Describe test coverage new/current, TreeHerder]: extremely hard repeating the bug to test in automation, or even manually. Normal shutdown is well tested.
[Risks and why]: Almost none. Just leaks what is effectively a global.
[String/UUID change made/needed]: none
Attachment #8660399 -
Flags: approval-mozilla-aurora?
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Group: toolkit-core-security → core-security-release
Comment 67•9 years ago
|
||
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Let's take it.
Attachment #8660399 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•9 years ago
|
tracking-firefox-esr31:
? → ---
Comment 68•9 years ago
|
||
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
[Triage Comment]
42 is now in beta.
Attachment #8660399 -
Flags: approval-mozilla-aurora+ → approval-mozilla-beta+
Comment 69•9 years ago
|
||
Target Milestone: --- → mozilla43
Updated•9 years ago
|
Flags: needinfo?(benjamin)
Updated•9 years ago
|
Whiteboard: [post-critsmash-triage]
Comment 70•9 years ago
|
||
Randell, do you want to request uplift for this to ESR? Thanks.
Flags: needinfo?(rjesup)
Assignee | ||
Comment 71•9 years ago
|
||
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
See Aurora approval request
Flags: needinfo?(rjesup)
Attachment #8660399 -
Flags: approval-mozilla-esr38?
Comment 72•9 years ago
|
||
Comment on attachment 8660399 [details] [diff] [review]
Purposely leak StatisticsReport object and suppress the leak report
Fix for shutdown and restart leak. Approved for esr-38
Attachment #8660399 -
Flags: approval-mozilla-esr38? → approval-mozilla-esr38+
Comment 73•9 years ago
|
||
Updated•9 years ago
|
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main42+][adv-esr38.4+]
Updated•8 years ago
|
Group: core-security-release
You need to log in
before you can comment on or make changes to this bug.
Description
•