1193038 - UAF in Telemetry::Accumulate(); code appears to not be thread-safe

Assignee

Description

•

9 years ago

Repeated crashes in the field with 0x5a signatures. Called from Telemetry::Accumulate(Telemetry::DNS_FAILED_LOOKUP_TIME, millis); https://crash-stats.mozilla.com/report/index/aabb2399-64d7-4ec7-a888-027b02150805 There is no obvious mention anywhere in the Telemetry code/includes about the thread-safety aspects; thus one should presume it's callable from any thread (and in this case is, from a DNS resolver thread, STS thread, etc). However, I see no locking around access to shared telemetry structures, which presumably led to this set of crashes. I would imagine it's also prone to TSAN issues. The only good thing is that few of the crashes are on Release since most telemetry is disabled on Release (IIRC). But not all.... Others include Telemetry::Accumulate(Telemetry::STS_NUMBER_OF_ONSOCKETREADY_CALLS, numberOfOnSocketReadyCalls) (https://crash-stats.mozilla.com/report/index/c77f5b7c-cc39-44d3-9664-030ff2150807) More crashes, all ugly: https://crash-stats.mozilla.com/report/list?product=Firefox&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

9 years ago

Component: Untriaged → Telemetry

Product: Core → Toolkit

Nathan Froyd [:froydnj]

Comment 1

•

9 years ago

Just to confirm: this sort of thing does show up in TSan, cf bug 1141565. That sounds like we're calling into Telemetry *really* late, though; histograms aren't freed until very very late in the shutdown process...or at least they didn't used to be...

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 2

•

9 years ago

(In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1) > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565. > > That sounds like we're calling into Telemetry *really* late, though; > histograms aren't freed until very very late in the shutdown process...or at > least they didn't used to be... The first one looks to be *early*, perhaps profilemanager? thread 0 is in XRE_main() /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/NtCreateUserProcess(). The second one I link to is crashing in Add() to the array (and has >70 active threads). Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the first - in LaunchChild(), with a smallish number of threads active, and short uptimes (6, 24, etc). There are some others like the second (which had uptime = 15519) Hmmmm I wonder if it *is* the switch to the real profile

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 3

•

9 years ago

Also, bug 1142079 smells like wallpapering over a race like the Add() race (second example above).

Georg Fritzsche [:gfritzsche]

Comment 4

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #0) > The only good thing is that few of the crashes are on Release since most > telemetry is disabled on Release (IIRC). But not all.... Telemetry will default to enabled on release soon (41 or 42), but we will only record from a limited set of probes.

Georg Fritzsche [:gfritzsche]

Comment 5

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #4) > (In reply to Randell Jesup [:jesup] from comment #0) > > The only good thing is that few of the crashes are on Release since most > > telemetry is disabled on Release (IIRC). But not all.... > > Telemetry will default to enabled on release soon (41 or 42), but we will > only record from a limited set of probes. Meaning that default-on on release should be not - or much less - affected. The breakdown for the uptime range here puts about half of the crashes in the first minute: https://crash-stats.mozilla.com/report/list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATelemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29 Note that in absolute numbers per release, it is not ranking particularly high.

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Keywords: csectype-race

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 6

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #2) > (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1) > > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565. > > > > That sounds like we're calling into Telemetry *really* late, though; > > histograms aren't freed until very very late in the shutdown process...or at > > least they didn't used to be... > > > The first one looks to be *early*, perhaps profilemanager? thread 0 is in > XRE_main() > /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/ > NtCreateUserProcess(). > > The second one I link to is crashing in Add() to the array (and has >70 > active threads). > > Most of the crashes (especially the 0x5a5a5a8e ones) are similar to the > first - in LaunchChild(), with a smallish number of threads active, and > short uptimes (6, 24, etc). There are some others like the second (which > had uptime = 15519) > > Hmmmm I wonder if it *is* the switch to the real profile The call sequence seen in at least half these (and most of the 0x5a's) is LaunchChild() called within "if (appInitiatedRestart) {" So, this would correlate strongly with installing updates and restarting. This gets set if XRE_mainRun() returns a "please restart" result, so at this point most or all threads should be shut down (and certainly all nsThreads). Clearly, however, this is not the case. We're launching a new child, but haven't exited yet, and in the meantime other threads may still be doing stuff. And clearly a bunch of threads are still running. Some are OS things (darn, can't see the thread names). Some are not: thread 11 was XPCJSRuntime.cpp in WatchdogMain() - and interestingly, it's in a Sleep() (really a CondVar wait) inside an if (!self->ShuttingDown). Also the sandbox broker service, the toolkit nsTerminator (which may make sense). There are ~19 threads though; most unclear what they are. So at a minimum we'd want to either not clear out the telemetry array ever or block usage of it after it's freed (which likely would involve a lock/etc). We also need to figure out why the DNS Host Resolver is still running in shutdown. And the other crashes strongly imply the lack of locking here is causing Add() to fail due to (presumed) races and crash with randomish addresses (not 0x5a, but not null ptrs, and a number are EXEC crashes) cc: mcmanus & jduell because the HostResolver is part of the problem here. (In reply to Georg Fritzsche [:gfritzsche] from comment #5) > (In reply to Georg Fritzsche [:gfritzsche] from comment #4) > > (In reply to Randell Jesup [:jesup] from comment #0) > > > The only good thing is that few of the crashes are on Release since most > > > telemetry is disabled on Release (IIRC). But not all.... > > > > Telemetry will default to enabled on release soon (41 or 42), but we will > > only record from a limited set of probes. > > Meaning that default-on on release should be not - or much less - affected. Sure, but it'd be hard to be sure none of them can trigger the UAFs. Also, as mentioned, some of these are EXEC crashes not-at-shutdown, which can be especially dangerous. > The breakdown for the uptime range here puts about half of the crashes in > the first minute: > https://crash-stats.mozilla.com/report/ > list?product=Firefox&range_unit=days&range_value=28&signature=mozilla%3A%3ATe > lemetry%3A%3AAccumulate%28mozilla%3A%3ATelemetry%3A%3AID%2C+unsigned+int%29 > > Note that in absolute numbers per release, it is not ranking particularly > high. Sure, but if made exploitable that doesn't matter (much). The restart one would likely be very hard to exploit, though. The other 1/2 are more worrying.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 7

•

9 years ago

Who owns this code and can dig down (especially on the non-restart crashers)? sec-crit that's going to be active in 41 or maybe 42...

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 8

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #2) > (In reply to Nathan Froyd [:froydnj][:nfroyd] from comment #1) > > Just to confirm: this sort of thing does show up in TSan, cf bug 1141565. > > > > That sounds like we're calling into Telemetry *really* late, though; > > histograms aren't freed until very very late in the shutdown process...or at > > least they didn't used to be... > > > The first one looks to be *early*, perhaps profilemanager? thread 0 is in > XRE_main() > /LaunchChild()/WinLoanchChild()/CreasteProcessW()/CreateProcessInternalW()/ > NtCreateUserProcess(). > > The second one I link to is crashing in Add() to the array (and has >70 > active threads). Can we narrow down where exactly? Somewhere in Histogram::Add()? This would kill the StatisticsRecorder (which holds histograms) before the |if (appInitiatedRestart)| handling: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416 https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram.cc#1200 If we are not running into any leak detection issues, we should be fine with just leaving the StatisticsRecorder around (at least until we find out why other threads are still writing into it wildly). Alternatively we could skip histogram writes after shutdown - we are not using the data anymore anyway.

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 9

•

9 years ago

The StatisticsRecorder allocation doesn't explain the very early crash though, this location should be early enough and the StatisticsRecorder seems to handle calls before initialization ok: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 10

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #9) > The StatisticsRecorder allocation doesn't explain the very early crash > though, this location should be early enough and the StatisticsRecorder > seems to handle calls before initialization ok: > https://dxr.mozilla.org/mozilla-central/rev/ > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4350 Later analysis showed that the "early" crashes really are "restart" crashes, while trying to start the new process, and after shutting down/exiting XRE_main. At least some of those are tied to threads still trying to log Telemetry *after* XRE_main has exited (i.e. thread leaks) - but it shows that the telemetry framework is vulnerable to stuff getting freed and used later. The other group (non-restart) are more concerning, since the restart ones are likely *very* hard to attack. But never say never...

Georg Fritzsche [:gfritzsche]

Comment 11

•

9 years ago

These ones look like they may not be startup or shutdown related: https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads

Georg Fritzsche [:gfritzsche]

Comment 12

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #8) > This would kill the StatisticsRecorder (which holds histograms) before the > |if (appInitiatedRestart)| handling: > https://dxr.mozilla.org/mozilla-central/rev/ > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416 > https://dxr.mozilla.org/mozilla-central/rev/ > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram. > cc#1200 So those seem pretty clear. I don't really see what the exact issue is with the other ones, e.g. comment 11 - dmajor, do you have any tips/input on the issue here?

Flags: needinfo?(dmajor)

Georg Fritzsche [:gfritzsche]

Comment 13

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #12) > (In reply to Georg Fritzsche [:gfritzsche] from comment #8) > > This would kill the StatisticsRecorder (which holds histograms) before the > > |if (appInitiatedRestart)| handling: > > https://dxr.mozilla.org/mozilla-central/rev/ > > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4416 > > https://dxr.mozilla.org/mozilla-central/rev/ > > 0876695d1abdeb363a780bda8b6cc84f20ba51c9/ipc/chromium/src/base/histogram. > > cc#1200 > > So those seem pretty clear. > > I don't really see what the exact issue is with the other ones, e.g. comment > 11 - dmajor, do you have any tips/input on the issue here? E.g. i'm wondering: * whether we can pinpoint more exactly what we are crashing on and * whether we have something to reason that the others are during shutdown too

(Away)

Comment 14

•

9 years ago

> https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads There's nothing wrong with that one. (It was the other process that failed, and we just happened to be in Telemetry::Accumulate at the time that we snapshotted both processes.) > https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads This one has a dead Histogram object. The crash happens when trying to access the vtable for the BucketIndex call inside Histogram::Add. The main thread is in LaunchChild. Perhaps the comment "// Restart the app after XPCOM has been shut down cleanly." is relevant? Do we shut down telemetry as part of that? http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/xre/nsAppRunner.cpp#l4208

Flags: needinfo?(dmajor)

Liz Henry (:lizzard) (relman/hg->git project)

Comment 15

•

9 years ago

Tracking for 41+;leaving 40 for Sylvestre to decide.

tracking-firefox40: --- → ?

tracking-firefox41: --- → +

tracking-firefox42: --- → +

tracking-firefox43: --- → +

tracking-firefox-esr31: --- → ?

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 16

•

9 years ago

(In reply to David Major [:dmajor] from comment #14) > > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads > > There's nothing wrong with that one. (It was the other process that failed, > and we just happened to be in Telemetry::Accumulate at the time that we > snapshotted both processes.) Is there a way to separate those out? I'll note we have a bunch of crashes in this vsync code (perhaps it's hitting that code *a lot*. > > https://crash-stats.mozilla.com/report/index/c3fed497-7244-4607-9115-0fd132150817#allthreads > > This one has a dead Histogram object. The crash happens when trying to > access the vtable for the BucketIndex call inside Histogram::Add. > > The main thread is in LaunchChild. Perhaps the comment "// Restart the app > after XPCOM has been shut down cleanly." is relevant? Do we shut down > telemetry as part of that? > http://hg.mozilla.org/releases/mozilla-beta/annotate/cc69ace5aa5b/toolkit/ > xre/nsAppRunner.cpp#l4208 Yes; we've clearly found a case where update-and-restart can UAF (these are ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff mentioned above; however it also highlights we have some cases where threads (HostResolver for example) are escaping getting shut down properly (perhaps due to the 1-second sleep timers). Others to check for validity: https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9-0d72e2150815#allthreads (https://crash-stats.mozilla.com/report/index/8723432f-f927-4e7d-9f4d-5a4d52150811) https://crash-stats.mozilla.com/search/?signature=~Telemetry%3A%3A&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature is a search for all that crashed in Telemetry:: /me wishes for a "search crashstats for X in the crashing thread's backtrace"

Flags: needinfo?(dmajor)

Georg Fritzsche [:gfritzsche]

Comment 17

•

9 years ago

Try push to see if just leaking the StatisticsRecorder triggers anything on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c

Georg Fritzsche [:gfritzsche]

Comment 18

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #16) > Yes; we've clearly found a case where update-and-restart can UAF (these are > ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff > mentioned above; however it also highlights we have some cases where threads > (HostResolver for example) are escaping getting shut down properly (perhaps > due to the 1-second sleep timers). Do we actually have a defined time where those should be shutdown? Should any of those still be active after XRE_mainRun() [0]? If not we will just make those Telemetry crashes go away and they might still run into undefined behavior elsewhere. 0: https://dxr.mozilla.org/mozilla-central/rev/0876695d1abdeb363a780bda8b6cc84f20ba51c9/toolkit/xre/nsAppRunner.cpp#4389

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 19

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] from comment #18) > (In reply to Randell Jesup [:jesup] from comment #16) > > Yes; we've clearly found a case where update-and-restart can UAF (these are > > ~50% of the telemetry crashes). This would be the StatisticsRecorder stuff > > mentioned above; however it also highlights we have some cases where threads > > (HostResolver for example) are escaping getting shut down properly (perhaps > > due to the 1-second sleep timers). > > Do we actually have a defined time where those should be shutdown? > Should any of those still be active after XRE_mainRun() [0]? > If not we will just make those Telemetry crashes go away and they might > still run into undefined behavior elsewhere. nsThreads should be all gone by end of xpcom-shutdown-threads, certainly before this. I have a patch (unlanded) that asserts on leaked nsThreads I need to un-bitrot; bsmedberg wasn't ready to land it a while back, but we used it to flag a bunch of leaks and kill them. Bug 988464; I have an un-birotted version (not on the bug right now) I'm going to push through another Try run. pthreads can leak past there, but need to make sure that they shouldn't do anything... especially when we return from main(). Best not to leak them either... In the past there were webrtc pthread leaks that did periodic processing that would occasionally hit a null vtable entry during post-main atexit processing (just because a timer woke them up).

(Away)

Comment 20

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #16) > (In reply to David Major [:dmajor] from comment #14) > > > https://crash-stats.mozilla.com/report/index/586a840f-94a5-4699-b978-5214c2150816#allthreads > > > > There's nothing wrong with that one. (It was the other process that failed, > > and we just happened to be in Telemetry::Accumulate at the time that we > > snapshotted both processes.) > > Is there a way to separate those out? I'll note we have a bunch of crashes > in this vsync code (perhaps it's hitting that code *a lot*. I guess you can ignore anything with an ipc_channel_error annotation. Bug 1186092 may make things better eventually.

(Away)

Comment 21

•

9 years ago

> Others to check for validity: > https://crash-stats.mozilla.com/report/index/dcb8cc99-a601-4d61-81a9- > 0d72e2150815#allthreads That one I don't understand; the registers in the minidump don't agree with the reported exception address. I don't think looking at this across a bunch of different variations is going to give you anything actionable. It's entirely possible that there are random unrelated crashes mixed in. I'd say find a couple stacks that are most prominent and go after those.

Flags: needinfo?(dmajor)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 22

•

9 years ago

This one has no ipc_channel_error annotation, it's an illegal instruction, and no indication it's in shutdown: https://crash-stats.mozilla.com/report/index/58ace115-56f3-4ebf-b8f2-3a4bb2150813#allthreads I do think some sort of fix for the statisticsrecorder issue is a big part of the answer here. I have a try running for bug 988464, which may help with other aspects. We should work to land some sort of fix for the primary crash here, and also probably address the HostResolver running-that-late issue (which can spin to a separate bug). I do suspect from the set of crashes we see (such as the one above) that there are some more bugs here (as dmajor suggests), but if not, great. I *do* want us to carefully document and vet the thread-safety aspects of this code! If we don't (even if it's "all safe" today), we've left a grenade under the furniture that the next person to rearrange it will uncover.

Georg Fritzsche [:gfritzsche]

Comment 23

•

9 years ago

So, the try push is fine, except some leaks that we can annotate/suppress. However, the StatisticsRecorder had issues with crashes late in shutdown too per bug 1100501. We need to avoid what bug 1100501, comment 135 describes. To temporarily fix this issue (until the run-away threads are fixed), we'd need to find a point where those runaway threads should be done, but before jemalloc enters a shutdown state. Alternatively public Telemetry accumulation functions should all be protected by checks on sTelemetry [0], but nulling that still leaves race potential. 0: https://dxr.mozilla.org/mozilla-central/rev/d5cf4e7900df6b2351bf3677b49fb70bedf68b99/toolkit/components/telemetry/Telemetry.cpp#744

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Blocks: 1196237

Georg Fritzsche [:gfritzsche]

Comment 24

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #22) > and also probably address the HostResolver running-that-late > issue (which can spin to a separate bug). I filed bug 1196237 for the HostResolver.

Matt Wobensmith [:mwobensmith][:matt:]

Updated

•

9 years ago

status-firefox39: affected → wontfix

status-firefox40: affected → wontfix

tracking-firefox40: ? → ---

Vladan Djeric (:vladan)

Comment 25

•

9 years ago

Sorry guys, I'm late to this bug, my mail filters were hiding bugmail form this bug. I agree we need to document and verify Telemetry threading behaviour. My team is really busy right now but I could document threading next week. Someone else will need to head up fixing the issues found so far, I'll help as much as I can.

Daniel Veditz [:dveditz]

Comment 26

•

9 years ago

Lowering severity slightly because races at startup or shutdown--the majority of these--would be very hard to turn into a useful exploit. With e10s and workers and telemetry leaking into Release it's still quite plausible there's an exploit lurking.

Keywords: sec-critical → sec-high

Daniel Veditz [:dveditz]

Comment 27

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #6) > Sure, but it'd be hard to be sure none of them can trigger the UAFs. Also, > as mentioned, some of these are EXEC crashes not-at-shutdown, which can be > especially dangerous. These appear to be the same crash. EXCEPTION_ACCESS_VIOLATION_READ at 0x5a5a5a8e is the UAF while the memory has been poisoned and not yet re-used, EXCEPTION_ACCESS_VIOLATION_EXEC is when the UAF points at memory that _has_ been reused and we're interpreting part of it as a function pointer. From my POV they're the same (that is, potentially exploitable). This kind of poisoning is good for us because it forces more crashes so we can find and fix the UAFs, but it's not protecting the user.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 28

•

9 years ago

While likely a separate issue from the shutdown crash which this is mostly about, some of the signatures pulled up by this are definitely non-shutdown crashes with writes to non-nullptr addresses in ::Accumulate(): cec7adac-303a-4e6f-9e78-fce142150820 (https://crash-stats.mozilla.com/report/index/cec7adac-303a-4e6f-9e78-fce142150820#allthreads) and c05545b5-cc4f-47f7-bb89-129782150817 (https://crash-stats.mozilla.com/report/index/c05545b5-cc4f-47f7-bb89-129782150817#allthreads) and there are more. These do not appear to be IPC-shutdown-invoked crashes (no indication in the metadata about ipc failure). A close look at how those crashed *might* give clues to what's clearly another hole in the Telemetry code here, and one that's hit (though not often) during normal operation (not shutdown). Likely those should spin off to a separate bug and leave this one on the shutdown issue, though I suspect both are fallout from running this multithreaded code lock-free (and Atomic-free)

Randell Jesup [:jesup] (needinfo me)

Assignee

Updated

•

9 years ago

Comment 29

•

9 years ago

Benjamin, can you help find an owner for this bug (and its potential spin-off)? Thanks.

Flags: needinfo?(benjamin)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 30

•

9 years ago

gfritzsche - You seem to have a patch for the base portion of this. Can you take it?

Flags: needinfo?(gfritzsche)

BMO Automation

Updated

•

9 years ago

Group: core-security → toolkit-core-security

Georg Fritzsche [:gfritzsche]

Comment 31

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #30) > gfritzsche - You seem to have a patch for the base portion of this. Can you > take it? The patch doesn't take it all the way, otherwise i'd be happy to get this going. See comment 23 for the issue against this patch. The other problem is that - for the shutdown path - it only moves the whole issue to a later time. If we still have runaway threads then they might trigger undefined behavior elsewhere too. So i'm torn on this bug here. Anyone have a good/clear idea on this?

Flags: needinfo?(gfritzsche)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 32

•

9 years ago

IIRC Patrick fixed the nsHostResolver thread problem that caused most of the crashes when we freed this, but there are other leaked threads (I'm slowly nailing them in another set of bugs, so I can land a patch to make nsThread leakage assert). Note that the biggest crash cause is restart, and we crash while sitting launching the new process. That shouldn't run afoul of jemalloc shutdown. I'd say let's go ahead with this. It certainly won't hurt.

Georg Fritzsche [:gfritzsche]

Comment 33

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #32) > IIRC Patrick fixed the nsHostResolver thread problem that caused most of the > crashes when we freed this, but there are other leaked threads (I'm slowly > nailing them in another set of bugs, so I can land a patch to make nsThread > leakage assert). Great, that sounds good. > Note that the biggest crash cause is restart, and we crash while sitting > launching the new process. That shouldn't run afoul of jemalloc shutdown. Ah, you're thinking of keeping that storage alive past that launch point? Sure, that's no problem.

Georg Fritzsche [:gfritzsche]

Comment 34

•

9 years ago

Attached patch Move StatisticsRecorder cleanup after app relaunch — Details — Splinter Review

Attachment #8656664 - Flags: review?(rjesup)

Georg Fritzsche [:gfritzsche]

Updated

•

9 years ago

Assignee: nobody → gfritzsche

Status: NEW → ASSIGNED

Georg Fritzsche [:gfritzsche]

Comment 35

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ea8415e524ed Note that i'm out on PTO from tomorrow afternoon.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 36

•

9 years ago

(In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment #33) > (In reply to Randell Jesup [:jesup] from comment #32) > > Note that the biggest crash cause is restart, and we crash while sitting > > launching the new process. That shouldn't run afoul of jemalloc shutdown. > > Ah, you're thinking of keeping that storage alive past that launch point? > Sure, that's no problem. So moving it to after the relaunch wasn't exactly my suggestion (though it will certainly help); I was responding to your comment that we need to deal with comment 23, which links to a comment about use after jemalloc-shutdown. I was suggesting we do what you had mentioned earlier - leak the object and add it to suppressions. I'd add a "and add some debug MOZ_ASSERTS if possible to flag use of Accumulate/etc in late shutdown (check a global if need be)." to the wish-list

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 37

•

9 years ago

Comment on attachment 8656664 [details] [diff] [review] Move StatisticsRecorder cleanup after app relaunch Review of attachment 8656664 [details] [diff] [review]: ----------------------------------------------------------------- r+ because it's an improvement. ::: toolkit/xre/nsAppRunner.cpp @@ +4449,5 @@ > return rv == NS_ERROR_LAUNCHED_CHILD_PROCESS ? 0 : 1; > } > > + mStatisticsRecorder = nullptr; > + better than what we have, for sure. But still we should look at just removing this and suppressing the leak.

Attachment #8656664 - Flags: review?(rjesup) → review+

Georg Fritzsche [:gfritzsche]

Comment 38

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #36) > (In reply to Georg Fritzsche [:gfritzsche] [away Sep 5 - 13] from comment > #33) > > (In reply to Randell Jesup [:jesup] from comment #32) > > > Note that the biggest crash cause is restart, and we crash while sitting > > > launching the new process. That shouldn't run afoul of jemalloc shutdown. > > > > Ah, you're thinking of keeping that storage alive past that launch point? > > Sure, that's no problem. > > So moving it to after the relaunch wasn't exactly my suggestion (though it > will certainly help); I was responding to your comment that we need to deal > with comment 23, which links to a comment about use after jemalloc-shutdown. > I was suggesting we do what you had mentioned earlier - leak the object and > add it to suppressions. I'd add a "and add some debug MOZ_ASSERTS if > possible to flag use of Accumulate/etc in late shutdown (check a global if > need be)." to the wish-list Ok, i guess the straight leaking would work. Sorry, i don't have time for figuring out the suppressions anymore before i leave, could look into that when i'm back. What about landing the reviewed patch here, should we be stealthy about landing it?

Georg Fritzsche [:gfritzsche]

Comment 39

•

9 years ago

FWIW, a trivial leaking version without supressions was in this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ae61a71843c

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 40

•

9 years ago

Seems too late to take a fix in 41 for this. If a patch is ready for uplift in the next 24 hrs, please let me know and we can reconsider.

status-firefox41: affected → wontfix

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 41

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=9da814562fe6 looks good, with the leak suppressed (a 1-byte leak...)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 42

•

9 years ago

Attached patch Purposely leak StatisticsReport object and suppress the leak report (obsolete) — Details — Splinter Review

gfritzche's patch, r=jesup. I added the suppression

Attachment #8659078 - Flags: review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 43

•

9 years ago

Comment on attachment 8659078 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Does this suppress everything allocated by XRE_main only, or stuff allocated by anything called from it? (Hopefully the first -- and I presume this suppresses all leaks directly allocated in that function, so in theory it could suppress other real leaks, not that I care in this case.)

Attachment #8659078 - Flags: review?(continuation)

Mike Hommey [:glandium]

Comment 44

•

9 years ago

Comment on attachment 8659078 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Review of attachment 8659078 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/xre/nsAppRunner.cpp @@ +3090,5 @@ > nsCOMPtr<nsIRemoteService> mRemoteService; > #endif > > UniquePtr<ScopedXPCOMStartup> mScopedXPCOM; > + base::StatisticsRecorder* mStatisticsRecorder; Considering nothing is actually reading mStatisticsRecorder, you could make it a local variable in XRE_main (*not* XREMain::XRE_main), would would allow you to leak it and limit the effects of the lsan suppression, since it would then apply to XRE_main, not XREMain::XRE_main.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 45

•

9 years ago

Attached patch Purposely leak StatisticsReport object and suppress the leak report (obsolete) — Details — Splinter Review

Alternative https://treeherder.mozilla.org/#/jobs?repo=try&revision=e39ae201963f

Attachment #8659095 - Flags: review?(mh+mozilla)

Attachment #8659095 - Flags: review?(continuation)

Mike Hommey [:glandium]

Comment 46

•

9 years ago

Comment on attachment 8659095 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Review of attachment 8659095 [details] [diff] [review]: ----------------------------------------------------------------- copy/paste from irc: > that's what I was suggesting, yes > although, I just had a new nasty idea, which may of may not be worth if we have more of this "intentionally leak" scheme > my nasty idea being something like UniquePtr, but that would be LeakedPtr. There would be a MakeLeaked<> template like there is MakeUnique, and then we could possily generically lsan-suppress all things leaked from MakeLeaked<*>

Attachment #8659095 - Flags: review?(mh+mozilla) → review+

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Attachment #8659095 - Flags: review?(continuation) → review-

Andrew McCreight [:mccr8]

Updated

•

9 years ago

Attachment #8659078 - Flags: review?(continuation) → review-

Andrew McCreight [:mccr8]

Comment 47

•

9 years ago

That will match any stack that contains XRE_main in any frame. What you want is to call MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT on the object you are intentionally leaking, which is what is described in comment 46.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 48

•

9 years ago

Attached patch fooPurposely leak StatisticsReport object and suppress the leak report (obsolete) — Details — Splinter Review

Assignee: gfritzsche → rjesup

Attachment #8659078 - Attachment is obsolete: true

Attachment #8659095 - Attachment is obsolete: true

Attachment #8659304 - Flags: review?(continuation)

Andrew McCreight [:mccr8]

Comment 49

•

9 years ago

Comment on attachment 8659304 [details] [diff] [review] fooPurposely leak StatisticsReport object and suppress the leak report Review of attachment 8659304 [details] [diff] [review]: ----------------------------------------------------------------- r=me for the LEAK_OBJECT stuff. ::: toolkit/xre/nsAppRunner.cpp @@ +4471,5 @@ > + // A initializer to initialize histogram collection, a chromium > + // thing used by Telemetry (and effectively a global; it's all static). > + // Note: purposely leaked > + base::StatisticsRecorder* statistics_recorder = new base::StatisticsRecorder(); > + MOZ_LSAN_INTENTIONALLY_LEAK_OBJECT(statistics_recorder); You should #include mozilla/MemoryChecking.h explicitly in this file, if it isn't there already.

Attachment #8659304 - Flags: review?(continuation) → review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 50

•

9 years ago

Comment on attachment 8659304 [details] [diff] [review] fooPurposely leak StatisticsReport object and suppress the leak report Approval Request Comment [Feature/regressing bug #]: N/A [User impact if declined]: longer exposure (though low risk of that being exploited) [Describe test coverage new/current, TreeHerder]: Shutdown is tested. [Risks and why]: very low risk [String/UUID change made/needed]: none [Security approval request comment] How easily could an exploit be constructed based on the patch? Highly unlikely Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? Somewhat - clear we're leaking, so one could look for why that's important. Which older supported branches are affected by this flaw? all If not all supported branches, which bug introduced the flaw? Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Trivial backporting. How likely is this patch to cause regressions; how much testing does it need? No likelihood of regressions. No significant testing needed. At most, do some update-and-restarts.

Attachment #8659304 - Flags: sec-approval?

Attachment #8659304 - Flags: approval-mozilla-aurora?

Daniel Veditz [:dveditz]

Comment 51

•

9 years ago

Comment on attachment 8659304 [details] [diff] [review] fooPurposely leak StatisticsReport object and suppress the leak report a=dveditz for sec-approval and aurora landing.

Attachment #8659304 - Flags: sec-approval?

Attachment #8659304 - Flags: sec-approval+

Attachment #8659304 - Flags: approval-mozilla-aurora?

Attachment #8659304 - Flags: approval-mozilla-aurora+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 52

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/2c9fe69fd9b4

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 53

•

9 years ago

Note: this code is still a mess thread-safety-wise, at least documenting it and TSAN.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 54

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/9a09b707e39b (needed unused << statistics_recorder)

Phil Ringnalda (:philor)

Comment 55

•

9 years ago

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/56f6468c7e60 because suppressing the very generic leak in Valgrind made us nervous.

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 56

•

9 years ago

Attached patch Suppress leak in valgrind as well — Details — Splinter Review

Tested with this patch (https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf91734da97f), and with a patch to introduce a leak in something called from a deeper function to ensure reporting is still working (https://treeherder.mozilla.org/#/jobs?repo=try&revision=7cac87e190aa)

Attachment #8660398 - Flags: review?(n.nethercote)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 57

•

9 years ago

Attached patch Purposely leak StatisticsReport object and suppress the leak report — Details — Splinter Review

Attachment #8659304 - Attachment is obsolete: true

Attachment #8660399 - Flags: review+

Nicholas Nethercote [inactive]

Comment 58

•

9 years ago

Comment on attachment 8660398 [details] [diff] [review] Suppress leak in valgrind as well Review of attachment 8660398 [details] [diff] [review]: ----------------------------------------------------------------- ::: build/valgrind/cross-architecture.sup @@ +39,5 @@ > + Memcheck:Leak > + fun:malloc > + fun:moz_xmalloc > + fun:operator new > + fun:XRE_main I have no context but this is a worryingly generic suppression. I suggest you create a separate function that does nothing but allocate the StatisticsReporter and call it from main(). That way there would be, for example, a "fun:LeakyStatisticsReporterAllocator" entry between the "fun:operator new" and the "fun:XRE_main", which would make things clearer and less generic.

Attachment #8660398 - Flags: review?(n.nethercote)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 59

•

9 years ago

This suppression should only over allocations directly from XRE_main(), which is a very simple function. Before the change it was basically: int XRE_main(int argc, char* argv[], const nsXREAppData* aAppData, uint32_t aFlags) { XREMain main; int result = main.XRE_main(argc, argv, aAppData); mozilla::RecordShutdownEndTimeStamp(); return result; } The suppression is limited to allocations with new call directly from XRE_main. I tested that a leak in main.XRE_main() is reported. And I was hesitant to do the stub-function trick (though I coded that as well) because the compiler might inline it given how simple it is. All that said, it's trivial to dump it into a stub function if you prefer.

Nicholas Nethercote [inactive]

Comment 60

•

9 years ago

(In reply to Randell Jesup [:jesup] from comment #59) > All that said, it's trivial to dump it into a stub function if you prefer. That would save a future person looking at the valgrind suppressions from having the same reaction as me :) Thank you.

Nicholas Nethercote [inactive]

Updated

•

9 years ago

Attachment #8660398 - Flags: review+

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 61

•

9 years ago

Green try with separate stub: https://treeherder.mozilla.org/#/jobs?repo=try&revision=23cab817081b and a red one with a deliberate leak elsewhere; carrying forward r=njn

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 62

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/346b9ee524d1

Sylvestre Ledru [:Sylvestre]

Comment 63

•

9 years ago

Georg, are you planning to request an uplift to 42? Thanks

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 64

•

9 years ago

Moving to jesup who should have a better overview here now :)

Flags: needinfo?(gfritzsche)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 65

•

9 years ago

Comment on attachment 8660399 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Approval Request Comment [Feature/regressing bug #]: N/A [User impact if declined]: hard-to-exploit UAF in shutdown, typically when doing a restart installing an update. [Describe test coverage new/current, TreeHerder]: extremely hard repeating the bug to test in automation, or even manually. Normal shutdown is well tested. [Risks and why]: Almost none. Just leaks what is effectively a global. [String/UUID change made/needed]: none

Attachment #8660399 - Flags: approval-mozilla-aurora?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 66

•

9 years ago

https://hg.mozilla.org/mozilla-central/rev/346b9ee524d1

status-firefox43: affected → fixed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Updated

•

9 years ago

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Daniel Veditz [:dveditz]

Updated

•

9 years ago

Group: toolkit-core-security → core-security-release

Sylvestre Ledru [:Sylvestre]

Comment 67

•

9 years ago

Comment on attachment 8660399 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Let's take it.

Attachment #8660399 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Sylvestre Ledru [:Sylvestre]

Updated

•

9 years ago

tracking-firefox-esr31: ? → ---

Sylvestre Ledru [:Sylvestre]

Comment 68

•

9 years ago

Comment on attachment 8660399 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report [Triage Comment] 42 is now in beta.

Attachment #8660399 - Flags: approval-mozilla-aurora+ → approval-mozilla-beta+

Carsten Book [:Tomcat]

Comment 69

•

9 years ago

https://hg.mozilla.org/releases/mozilla-beta/rev/48b17faad125

status-firefox42: affected → fixed

Target Milestone: --- → mozilla43

Benjamin Smedberg

Updated

•

9 years ago

Flags: needinfo?(benjamin)

Matt Wobensmith [:mwobensmith][:matt:]

Updated

•

9 years ago

Whiteboard: [post-critsmash-triage]

Liz Henry (:lizzard) (relman/hg->git project)

Comment 70

•

9 years ago

Randell, do you want to request uplift for this to ESR? Thanks.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Assignee

Comment 71

•

9 years ago

Comment on attachment 8660399 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report See Aurora approval request

Flags: needinfo?(rjesup)

Attachment #8660399 - Flags: approval-mozilla-esr38?

Liz Henry (:lizzard) (relman/hg->git project)

Comment 72

•

9 years ago

Comment on attachment 8660399 [details] [diff] [review] Purposely leak StatisticsReport object and suppress the leak report Fix for shutdown and restart leak. Approved for esr-38

Attachment #8660399 - Flags: approval-mozilla-esr38? → approval-mozilla-esr38+

Carsten Book [:Tomcat]

Comment 73

•

9 years ago

https://hg.mozilla.org/releases/mozilla-esr38/rev/9f88a1fdc112

status-firefox-esr38: affected → fixed

Al Billings [:abillings - ex-MoCo]

Updated

•

9 years ago

Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main42+][adv-esr38.4+]

Daniel Veditz [:dveditz]

Updated

•

8 years ago

Group: core-security-release

Move StatisticsRecorder cleanup after app relaunch 9 years ago Georg Fritzsche [:gfritzsche] 1.51 KB, patch	jesup : review+	Details \| Diff \| Splinter Review
Purposely leak StatisticsReport object and suppress the leak report 9 years ago Randell Jesup [:jesup] (needinfo me) 2.80 KB, patch	jesup : review+ mccr8 : review-	Details \| Diff \| Splinter Review
Purposely leak StatisticsReport object and suppress the leak report 9 years ago Randell Jesup [:jesup] (needinfo me) 3.48 KB, patch	glandium : review+ mccr8 : review-	Details \| Diff \| Splinter Review
fooPurposely leak StatisticsReport object and suppress the leak report 9 years ago Randell Jesup [:jesup] (needinfo me) 3.08 KB, patch	mccr8 : review+ dveditz : approval-mozilla-aurora+ dveditz : sec-approval+	Details \| Diff \| Splinter Review
Suppress leak in valgrind as well 9 years ago Randell Jesup [:jesup] (needinfo me) 967 bytes, patch	n.nethercote : review+	Details \| Diff \| Splinter Review
Purposely leak StatisticsReport object and suppress the leak report 9 years ago Randell Jesup [:jesup] (needinfo me) 3.66 KB, patch	jesup : review+ Sylvestre : approval-mozilla-beta+ lizzard : approval-mozilla-esr38+	Details \| Diff \| Splinter Review