1342420 - Crash in mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run

Reporter

Description

•

7 years ago

This bug was filed from the Socorro interface and is 
report bp-60c71aa5-f0df-4e3b-98b6-eae892170221.
=============================================================
Crashing Thread (23)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run() 	obj-firefox/dist/include/mozilla/TaskDispatcher.h:191
1 	xul.dll 	mozilla::TaskQueue::Runner::Run() 	xpcom/threads/TaskQueue.cpp:236
2 	xul.dll 	nsThreadPool::Run() 	xpcom/threads/nsThreadPool.cpp:225
3 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
4 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:311
5 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:338
6 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
7 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:205
8 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:465
9 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
10 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
11 	ucrtbase.dll 	_o__CIpow 	
12 	kernel32.dll 	BaseThreadInitThunk 	
13 	ntdll.dll 	__RtlUserThreadStart 	
14 	ntdll.dll 	_RtlUserThreadStart

these cross-platform crashes have been around for a while. the crashing address of a portion of them indicates a uaf situation.

Daniel Veditz [:dveditz]

Comment 1

•

7 years ago

The majority of crashes with this signature are a null deref, but there are a decent number of UAF-looking crash addresses too.

Group: core-security → dom-core-security

Component: Untriaged → IPC

Keywords: csectype-uaf, sec-high

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 2

•

7 years ago

I think this code is typically used by media, so I'm going to move it there. Hopefully that's the right component.

Component: IPC → Audio/Video

Gian-Carlo Pascutto [:gcp]

Comment 3

•

7 years ago

I had to do some digging to make the link to Audio/Video code from the stack. But it seems correct:

1) The relation between the crashing code and Audio/Video is TaskQueue, which is used by our video decoders.

2) The crash stack shows one thread inside msmpeg2vdec.dll 	CH264DecoderSoftware::GetDataFromStream(char**, int*, InputSampleInfo_t*) which indicates an H264 video decode is ongoing.

Jesup, my best guess is that something is adding a Runnable but not keeping it alive long enough or so? Got an idea who can own this?

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 4

•

7 years ago

gcp's comment about CH264DecoderSoftware::GetDataFromStream() in msmpeg2vdec.dll implies strongly this has something to do with Platform Decoders.

Flags: needinfo?(rjesup) → needinfo?(jyavenard)

Randell Jesup [:jesup] (needinfo me)

Updated

•

7 years ago

Component: Audio/Video → Audio/Video: Playback

Randell Jesup [:jesup] (needinfo me)

Comment 5

•

7 years ago

This is an old bug (42b7 at least).  Looking at some older crashes, like https://crash-stats.mozilla.com/report/index/18813b11-fec2-45f5-9373-b3e332170328 and https://crash-stats.mozilla.com/report/index/ef44c07e-8c06-4a5c-9b76-bdcb52170328, they seem to imply that mTasks is invalid (this is on a "for (size_t i = 0; i < mTasks->mStateChangeTasks.Length(); ++i)"), and some of the other reports hit on the mTasks->mRegularTasks.Length(), and a bunch more hit on one of the mTasks->...[i]->Run() lines.

Several of these i looked at had nothing running that was media-related (other than cubeb idle in the background; just means that audio was used sometime in the session).

This all makes me think it's a flaw somewhere in TaskDispatcher - mTasks lifetime?  Or a cache issue (missing locking for multi-thread access)? I haven't looked deeply at the code in TaskDispatcher yet.  Since mStateChangeTasks and mRegularTasks are both nsTArray<nsCOMPtr<nsIRunnable>>'s, it seems unlikely that it's a runnable that has been freed.

CC'd a number of people who've touched TaskDispatcher.h for their awareness/thoughts.

Given some of these are EXEC crashes (widen search to a few months and I see them; didn't look at the 1-week view), and it's a clear UAF with apparent reallocation in many cases, I'm tempted to bump it to sec-critical - al, dan?

Updated affected list.  No 55 crashes so far, but not high frequency in nightly/aurora in any rev.

status-firefox54: ? → affected

status-firefox55: --- → ?

status-firefox-esr45: --- → affected

status-firefox-esr52: --- → affected

Flags: needinfo?(n.nethercote)

Flags: needinfo?(jyavenard)

Flags: needinfo?(dveditz)

Flags: needinfo?(bobbyholley)

Flags: needinfo?(abillings)

Nicholas Nethercote [inactive]

Comment 6

•

7 years ago

I've only touched TaskDispatcher.h a little bit. I don't really know anything about it, sorry.

Flags: needinfo?(n.nethercote)

Bobby Holley (:bholley)

Comment 7

•

7 years ago

You want JW here.

Flags: needinfo?(jwwang)

Flags: needinfo?(dveditz)

Flags: needinfo?(bobbyholley)

Flags: needinfo?(abillings)

JW Wang [:jwwang] [:jw_wang]

Comment 8

•

7 years ago

We have
https://crash-stats.mozilla.com/report/index/978fff99-50b0-406f-9d39-a0ecf2170329
https://crash-stats.mozilla.com/report/index/6bfbd5fb-b02f-4292-8549-9db5b2170329
https://crash-stats.mozilla.com/report/index/9edc3dbe-9217-485d-ae27-4f7da2170330

which look like a null deref and crashed on the main thread.

I will add some null checks in the hope of reducing null deref crash so other UAF issues can stand out.

Flags: needinfo?(jwwang)

JW Wang [:jwwang] [:jw_wang]

Comment 9

•

7 years ago

Attached patch 1342420_P1_add_null_checks.patch — Details — Splinter Review

Add null checks to avoid null deref.

Assignee: nobody → jwwang

Status: NEW → ASSIGNED

Attachment #8853361 - Flags: review?(bobbyholley)

Andrew McCreight [:mccr8]

Comment 10

•

7 years ago

Thanks for looking at this. It might be better to file a separate public bug for the null deref crashes.

Bobby Holley (:bholley)

Updated

•

7 years ago

Attachment #8853361 - Flags: review?(bobbyholley) → review+

Randell Jesup [:jesup] (needinfo me)

Comment 11

•

7 years ago

Comment on attachment 8853361 [details] [diff] [review]
1342420_P1_add_null_checks.patch

Review of attachment 8853361 [details] [diff] [review]:
-----------------------------------------------------------------

null-check will slightly help stability... but it doesn't solve anything.  Most likely the null-derefs are "the memory was reallocated and set to null".  We can search for non-null crashes today.  And the null-check doesn't help the security aspect of the bug, I'm afraid.

JW Wang [:jwwang] [:jw_wang]

Comment 12

•

7 years ago

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1338025#c7, the crash address might be incorrect for some reason. The null checks makes debugging easier and less confusing.

Randell Jesup [:jesup] (needinfo me)

Comment 13

•

7 years ago

If we really can't rely on the crash address, we need to either flag those value/platform combos in crash-stats, or omit them (or put <unknown>) -- but if 0 or fffff are "unknown", we'll have to consider a lot more bugs as potential sec issues.  

Ted?  Is this true?  Can we do anything?

Flags: needinfo?(ted)

(not currently active) Ted Mielczarek

Comment 14

•

7 years ago

I have Breakpad patches but they're complicated--it's basically pulling in a better disassembler and disassembling the faulting instruction to try to figure out the actual fault address. The x86-64 fault literally doesn't give out the faulting address when it's a non-canonical pointer.

Flags: needinfo?(ted)

Julien Cristau [:jcristau]

Comment 15

•

7 years ago

Too late for firefox 52, mass-wontfix.

status-firefox52: affected → wontfix

JW Wang [:jwwang] [:jw_wang]

Updated

•

7 years ago

Depends on: 1353610

JW Wang [:jwwang] [:jw_wang]

Comment 16

•

7 years ago

(In reply to Andrew McCreight [:mccr8] from comment #10)
> Thanks for looking at this. It might be better to file a separate public bug
> for the null deref crashes.

Bug is filed for null checks.

Daniel Veditz [:dveditz]

Updated

•

7 years ago

Keywords: testcase-wanted

Stephanie Ouillon [:arroway] (needinfo me)

Comment 17

•

7 years ago

Sec-high triage: this bug hasn't been updated for one month. JW, is there any update?

Flags: needinfo?(jwwang)

JW Wang [:jwwang] [:jw_wang]

Comment 18

•

7 years ago

No, I don't have new findings. The crashes so far all look like random memory corruption to me.

Flags: needinfo?(jwwang)

Anthony Jones (:ajones, :kentuckyfriedtakahe, :k17e)

Updated

•

7 years ago

Priority: -- → P1

Randell Jesup [:jesup] (needinfo me)

Comment 19

•

7 years ago

I'm pretty concerned about this one, since it's an EXEC crash on a runnable, and has been happening regularly

Very likely this is a runnable that was somehow released or overwritten - I'll place my bets on released, even though we're holding a ref to it.  Now the event queue should hold a ref to it, so if it is released somehow there's a refcounting mistake.  The other option would be the PerThreadTaskGroup object itself somehow to be overwritten or destroyed.  That's passed around in UniquePtr's, so it should be correct.

In the TaskDispatcher, these runnables are held in a nsTArray<nsCOMPtr<nsIRunnable>>, so it shouldn't get released out from under us, and thus the assumption that there's a refcounting mistake in some other code, or something overwrote memory (and does so very regularly).  If the PerThreadTaskGroup were destroyed, all the runnables would be likely freed, but if so I'm unclear how we'd get into here.

Also, as noted, many of these crashes are EXECs to these pointers, so this might be exploitable.

I also don't see any obvious stacks in the media code (we have media decoder/etc threads, but those are part of pools that hang around normally).  It's certainly possible we're running media code, though.

This doesn't appear to be truly random memory corruption, but that's not impossible.

Does anyone have any thoughts, or additional tripwires/release-asserts we could land?

Frederik Braun [:freddy]

Comment 20

•

7 years ago

(In reply to Randell Jesup [:jesup] from comment #19)
> Does anyone have any thoughts, or additional tripwires/release-asserts we
> could land?


Who did you mean to address with this question? Please forward the needinfo accordingly.

Flags: needinfo?(rjesup)

Randell Jesup [:jesup] (needinfo me)

Comment 21

•

7 years ago

> ho did you mean to address with this question? Please forward the needinfo accordingly.
Anyone on this bug, really.  It's very unclear where this is coming from.

Flags: needinfo?(rjesup) → needinfo?(wmccloskey)

JW Wang [:jwwang] [:jw_wang]

Comment 22

•

7 years ago

https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3AAutoTaskDispatcher%3A%3ATaskGroupRunnable%3A%3ARun&date=%3E%3D2017-01-20T06%3A06%3A47.000Z&date=%3C2017-07-20T06%3A06%3A47.000Z

There are no reports after 54.0.1, I guess it's been fixed by some other bugs.

Randell Jesup [:jesup] (needinfo me)

Comment 23

•

7 years ago

(In reply to JW Wang [:jwwang] [:jw_wang] from comment #22)
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&signature=mozilla%3A%3AAutoTaskDispatcher%3A%3ATaskGroupRunn
> able%3A%3ARun&date=%3E%3D2017-01-20T06%3A06%3A47.000Z&date=%3C2017-07-
> 20T06%3A06%3A47.000Z
> 
> There are no reports after 54.0.1, I guess it's been fixed by some other
> bugs.

Still happening in 55b7 (reports from beta and nightly are always much lower in volume).  About 10 or 12 in the last month or so in 55.
https://crash-stats.mozilla.com/report/index/d0637fc6-b9d5-4316-b24d-aab6d0170705 (clear UAF)

status-firefox55: ? → affected

status-firefox56: --- → ?

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 24

•

7 years ago

The code looks correct to me. All I can think to do is to add some thread safety assertions, since this code is threaded and it doesn't have any assertions right now. However, I don't actually see any threading bugs just from inspection.

Flags: needinfo?(wmccloskey)

Frederik Braun [:freddy]

Comment 25

•

7 years ago

Bill, can *you* add those thread safety assertions?

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 26

•

7 years ago

I'm pretty busy. I might have time in a couple weeks. This is really JW's area, though.

Flags: needinfo?(wmccloskey) → needinfo?(jwwang)

JW Wang [:jwwang] [:jw_wang]

Comment 27

•

7 years ago

Turned out my plate is full of media tasks and won't have time for XPCOM stuff. Maybe some other XPCOM guy is interested in taking this bug.

Assignee: jwwang → nobody

Status: ASSIGNED → NEW

Flags: needinfo?(jwwang)

JW Wang [:jwwang] [:jw_wang]

Updated

•

7 years ago

Component: Audio/Video: Playback → XPCOM

Emma Humphries ☕️🎸🧞‍♀️✨ (she/they) [:emceeaich] (Pacific Time) use needinfo

Updated

•

7 years ago

Keywords: stale-bug

Stephanie Ouillon [:arroway] (needinfo me)

Comment 28

•

7 years ago

Nathan, do you know who could work on this bug?

Flags: needinfo?(nfroyd)

Daniel Veditz [:dveditz]

Comment 29

•

7 years ago

The null derefs seemed to go away in more recent versions(In reply to JW Wang [:jwwang] [:jw_wang] from comment #16)
> (In reply to Andrew McCreight [:mccr8] from comment #10)
> > Thanks for looking at this. It might be better to file a separate public bug for the null deref crashes.
> 
> Bug is filed for null checks.

The initial comment was filed on a UAF crash and explicitly mentions UAF as the reason for filing the bug.

Assignee: nobody → nfroyd

Keywords: stale-bug

Nathan Froyd [:froydnj]

Comment 30

•

7 years ago

I don't have much to add; the following is just me trying to write everything out so maybe somebody else can poke holes in this logic.

OK, so we have a UAF / null deref in this code:

// State change tasks get run all together before any code is run, so
// that all state changes are made in an atomic unit.
for (size_t i = 0; i < mTasks->mStateChangeTasks.Length(); ++i) {
mTasks->mStateChangeTasks[i]->Run();
}

mTasks is a UniquePtr; mStateChangeTasks is an array of nsCOMPtrs. So, assuming those are used correctly, we shouldn't have any issues. The only way mStateChangeTasks gets touched is in AutoTaskDispatcher::AddStateChangeTask(), where we pass in an existing strong reference. AddStateChangeTask is only called by AbstractThread::DispatchStateChange, which in turn is only called by places that allocate a runnable and immediately pass it in. So, that looks like a dead end, and the likely UAF is mTasks?

TaskGroupRunnable::mTasks is a UniquePtr<PerThreadTaskGroup>, so we should start by figuring out how data flows into that. TaskGroupRunnable is constructed in AutoTaskDispatcher::DispatchTaskGroup, which receives a UniquePtr of the appropriate type. DispatchTaskGroup is called in two places:

1. ~AutoTaskDispatcher;
2. AutoTaskDispatcher::DispatchTasksFor.

Both of these places are moving UniquePtr<PerThreadTaskGroup> out of AutoTaskDispatcher::mTaskGroups, which is an nsTArray<UniquePtr<PerThreadTaskGroup>>. That seems safe enough (though we have suspicious looking array indexing crashes in ~AutoTaskDispatcher...). How does mTaskGroups get added to? In one of:

1. AutoTaskDispatcher::EnsureTaskGroup; or
2. AutoTaskDispatcher::AddTask.

EnsureTaskGroup is only called in AutoTaskDispatcher::AddStateChangeTask. So where would AddStateChangeTask or AddTask get called from? Well, where would AutoTaskDispatcher get used? It gets used in two places:

1. TaskQueue::AutoTaskGuard (subclassed);
2. EventTargetWrapper, as a Maybe<> member.

Assuming I understand things correctly, EventTargetWrapper and its use of AutoTaskDispatcher can only be used on cycle-collected threads, because when we construct EventTargetWrapper::mTailDispatcher in EventTargetWrapper::TailDispatcher, we call nsContentUtils::RunInStableState, which calls into the CC...if we weren't on a CC thread, we'd crash. But what if we *weren't* on a CC thread when TailDispatcher was called for the nth time, and we handed out a reference to memory that FireTailDispatcher was about to destroy on a different, CC'd thread? That seems like a semi-plausible scenario; a lot of the crashes are happening on non-CC'd threads (generally threads in an nsThreadPool).

That leaves TaskQueue::AutoTaskGuard. AutoTaskGuard's constructor takes a TaskQueue*, and sets the TaskQueue's mTailDispatcher member to the address AutoTaskGuard being constructed. So maybe TaskQueue is handing out pointers/references to this thing that lives on the stack, and those pointers are being used--as an AutoTaskDispatcher *after* the AutoTaskGuard has been deconstructed? The AutoTaskDispatcher's members would still be reasonable values (subject to the vagaries of stack use by other functions), but they'd point to freed memory, perhaps, and we'd get hosed?

The only problem with that theory (and the one around EventTargetWrapper, above), is that TaskQueue::TailDispatcher (and EventTargetWrapper::TailDispatcher) is only called to immediately do something with the result; the result of TailDispatcher is not stored away for later use. So the problematic scenario is something like:

1. Call TailDispatcher, get reference back;
2. Destroy the underlying AutoTaskDispatcher on a different thread;
3. Use the TailDispatcher...which is overwritten with freed memory at this point?

That scenario doesn't seem to work for TaskQueue::AutoTaskGuard, since AutoTaskGuard is stack-scoped, so I think that leaves EventTargetWrapper as the likeliest place for problems? We could test that theory by doing some combination of:

1. converting the MOZ_ASSERT in EventTargetWrapper::TailDispatcher into a MOZ_RELEASE_ASSERT; and/or
2. Making Maybe::reset() actually scribble over the destroyed memory with the poison value.

Actually, the second one sounds like a really good idea, irrespective of whether it does something in this bug. I'll file that as a separate, probably security-sensitive bug?

Flags: needinfo?(nfroyd)

Daniel Veditz [:dveditz]

Comment 31

•

6 years ago

Nathan: did you file that other bug? If so please link to it from here or add it to "See Also".

Do we have any other ways forward on this bug?

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 32

•

6 years ago

(In reply to Daniel Veditz [:dveditz] from comment #31)
> Nathan: did you file that other bug? If so please link to it from here or
> add it to "See Also".

Bug 1414901, added to "See Also".

> Do we have any other ways forward on this bug?

I can convert the MOZ_ASSERT in EventTargetWrapper::TailDispatcher into a release or diagnostic assert, I guess?  We get a few hits on this crash each week on beta and nightly, so that might provide a little information.

Bug 1414901 would help some here, too.

Flags: needinfo?(nfroyd)

Comment 33

•

6 years ago

The most recent (and likely most interesting) crash is on 60.0b7 from a 2018-03-29 06:17:37 build
https://crash-stats.mozilla.com/report/index/8540ccb4-b2fe-4fd3-8e49-909290180329

There's nothing else much. What else do you think we need to do with the bugs mentioned above fixed, Nathan?

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 34

•

6 years ago

(In reply to Frederik Braun [:freddyb] from comment #33)
> The most recent (and likely most interesting) crash is on 60.0b7 from a
> 2018-03-29 06:17:37 build
> https://crash-stats.mozilla.com/report/index/8540ccb4-b2fe-4fd3-8e49-
> 909290180329
> 
> There's nothing else much. What else do you think we need to do with the
> bugs mentioned above fixed, Nathan?

I can't tell from that crash if we're crashing somewhere in a poisoned region or not.  Did we remove the poison address information from crash reports? =/

Also, why does that crash not show up when searching for the signature?

Flags: needinfo?(nfroyd)

(not currently active) Ted Mielczarek

Comment 35

•

6 years ago

The poison info annotations are still there:
https://dxr.mozilla.org/mozilla-central/rev/c23c7481957f7b982cffc0ce1d25979c69ca2c2f/toolkit/xre/nsAppRunner.cpp#4455

...but apparently we don't send the parent process ones for child processes (which sorta makes sense, as we could ostensibly choose a different address per process):
https://dxr.mozilla.org/mozilla-central/rev/c23c7481957f7b982cffc0ce1d25979c69ca2c2f/toolkit/crashreporter/nsExceptionHandler.cpp#321

I guess we should make child processes explicitly annotate these as well.

Sylvestre Ledru [:Sylvestre]

Comment 36

•

6 years ago

Moving to p3 because no activity for at least 24 weeks.

Priority: P1 → P3

Andrew McCreight [:mccr8]

Updated

•

5 years ago

Keywords: stalled

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

status-firefox53: affected → wontfix

status-firefox54: affected → wontfix

status-firefox55: affected → wontfix

status-firefox56: ? → wontfix

status-firefox-esr45: affected → wontfix

status-firefox-esr52: affected → wontfix

Andrew McCreight [:mccr8]

Updated

•

3 years ago

Assignee: froydnj+bz → nobody

BMO Automation

Updated

•

2 years ago

Severity: critical → S2

Tom Ritter [:tjr] (OOTO until 4/30 at least)

Comment 37

•

4 months ago

We're working to close out stalled bugs that cannot be actioned. There are still a few coming in every release but at this point we're going to close it out and can reopen or refile and reinvestigate if we want to.

Status: NEW → RESOLVED

Closed: 4 months ago

Resolution: --- → INCOMPLETE

BugBot (nomail) [:suhaib / :marco/ :calixte]

Comment 38

•

4 months ago

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled

Daniel Veditz [:dveditz]

Updated

•

3 months ago

Group: dom-core-security