Closed Bug 1342420 Opened 7 years ago Closed 4 months ago

Crash in mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run

Categories

(Core :: XPCOM, defect, P3)

40 Branch
defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr45 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- wontfix
firefox53 --- wontfix
firefox54 --- wontfix
firefox55 --- wontfix
firefox56 --- wontfix

People

(Reporter: philipp, Unassigned)

References

Details

(4 keywords)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-60c71aa5-f0df-4e3b-98b6-eae892170221.
=============================================================
Crashing Thread (23)
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::AutoTaskDispatcher::TaskGroupRunnable::Run() 	obj-firefox/dist/include/mozilla/TaskDispatcher.h:191
1 	xul.dll 	mozilla::TaskQueue::Runner::Run() 	xpcom/threads/TaskQueue.cpp:236
2 	xul.dll 	nsThreadPool::Run() 	xpcom/threads/nsThreadPool.cpp:225
3 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1067
4 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:311
5 	xul.dll 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:338
6 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:225
7 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:205
8 	xul.dll 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:465
9 	nss3.dll 	_PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c:397
10 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c:95
11 	ucrtbase.dll 	_o__CIpow 	
12 	kernel32.dll 	BaseThreadInitThunk 	
13 	ntdll.dll 	__RtlUserThreadStart 	
14 	ntdll.dll 	_RtlUserThreadStart

these cross-platform crashes have been around for a while. the crashing address of a portion of them indicates a uaf situation.
The majority of crashes with this signature are a null deref, but there are a decent number of UAF-looking crash addresses too.
Group: core-security → dom-core-security
Component: Untriaged → IPC
I think this code is typically used by media, so I'm going to move it there. Hopefully that's the right component.
Component: IPC → Audio/Video
I had to do some digging to make the link to Audio/Video code from the stack. But it seems correct:

1) The relation between the crashing code and Audio/Video is TaskQueue, which is used by our video decoders.

2) The crash stack shows one thread inside msmpeg2vdec.dll 	CH264DecoderSoftware::GetDataFromStream(char**, int*, InputSampleInfo_t*) which indicates an H264 video decode is ongoing.

Jesup, my best guess is that something is adding a Runnable but not keeping it alive long enough or so? Got an idea who can own this?
Flags: needinfo?(rjesup)
gcp's comment about CH264DecoderSoftware::GetDataFromStream() in msmpeg2vdec.dll implies strongly this has something to do with Platform Decoders.
Flags: needinfo?(rjesup) → needinfo?(jyavenard)
Component: Audio/Video → Audio/Video: Playback
This is an old bug (42b7 at least).  Looking at some older crashes, like https://crash-stats.mozilla.com/report/index/18813b11-fec2-45f5-9373-b3e332170328 and https://crash-stats.mozilla.com/report/index/ef44c07e-8c06-4a5c-9b76-bdcb52170328, they seem to imply that mTasks is invalid (this is on a "for (size_t i = 0; i < mTasks->mStateChangeTasks.Length(); ++i)"), and some of the other reports hit on the mTasks->mRegularTasks.Length(), and a bunch more hit on one of the mTasks->...[i]->Run() lines.

Several of these i looked at had nothing running that was media-related (other than cubeb idle in the background; just means that audio was used sometime in the session).

This all makes me think it's a flaw somewhere in TaskDispatcher - mTasks lifetime?  Or a cache issue (missing locking for multi-thread access)? I haven't looked deeply at the code in TaskDispatcher yet.  Since mStateChangeTasks and mRegularTasks are both nsTArray<nsCOMPtr<nsIRunnable>>'s, it seems unlikely that it's a runnable that has been freed.

CC'd a number of people who've touched TaskDispatcher.h for their awareness/thoughts.

Given some of these are EXEC crashes (widen search to a few months and I see them; didn't look at the 1-week view), and it's a clear UAF with apparent reallocation in many cases, I'm tempted to bump it to sec-critical - al, dan?

Updated affected list.  No 55 crashes so far, but not high frequency in nightly/aurora in any rev.
Flags: needinfo?(n.nethercote)
Flags: needinfo?(jyavenard)
Flags: needinfo?(dveditz)
Flags: needinfo?(bobbyholley)
Flags: needinfo?(abillings)
I've only touched TaskDispatcher.h a little bit. I don't really know anything about it, sorry.
Flags: needinfo?(n.nethercote)
You want JW here.
Flags: needinfo?(jwwang)
Flags: needinfo?(dveditz)
Flags: needinfo?(bobbyholley)
Flags: needinfo?(abillings)
We have
https://crash-stats.mozilla.com/report/index/978fff99-50b0-406f-9d39-a0ecf2170329
https://crash-stats.mozilla.com/report/index/6bfbd5fb-b02f-4292-8549-9db5b2170329
https://crash-stats.mozilla.com/report/index/9edc3dbe-9217-485d-ae27-4f7da2170330

which look like a null deref and crashed on the main thread.

I will add some null checks in the hope of reducing null deref crash so other UAF issues can stand out.
Flags: needinfo?(jwwang)
Add null checks to avoid null deref.
Assignee: nobody → jwwang
Status: NEW → ASSIGNED
Attachment #8853361 - Flags: review?(bobbyholley)
Thanks for looking at this. It might be better to file a separate public bug for the null deref crashes.
Attachment #8853361 - Flags: review?(bobbyholley) → review+
Comment on attachment 8853361 [details] [diff] [review]
1342420_P1_add_null_checks.patch

Review of attachment 8853361 [details] [diff] [review]:
-----------------------------------------------------------------

null-check will slightly help stability... but it doesn't solve anything.  Most likely the null-derefs are "the memory was reallocated and set to null".  We can search for non-null crashes today.  And the null-check doesn't help the security aspect of the bug, I'm afraid.
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1338025#c7, the crash address might be incorrect for some reason. The null checks makes debugging easier and less confusing.
If we really can't rely on the crash address, we need to either flag those value/platform combos in crash-stats, or omit them (or put <unknown>) -- but if 0 or fffff are "unknown", we'll have to consider a lot more bugs as potential sec issues.  

Ted?  Is this true?  Can we do anything?
Flags: needinfo?(ted)
I have Breakpad patches but they're complicated--it's basically pulling in a better disassembler and disassembling the faulting instruction to try to figure out the actual fault address. The x86-64 fault literally doesn't give out the faulting address when it's a non-canonical pointer.
Flags: needinfo?(ted)
Too late for firefox 52, mass-wontfix.
Depends on: 1353610
(In reply to Andrew McCreight [:mccr8] from comment #10)
> Thanks for looking at this. It might be better to file a separate public bug
> for the null deref crashes.

Bug is filed for null checks.
Sec-high triage: this bug hasn't been updated for one month. JW, is there any update?
Flags: needinfo?(jwwang)
No, I don't have new findings. The crashes so far all look like random memory corruption to me.
Flags: needinfo?(jwwang)
I'm pretty concerned about this one, since it's an EXEC crash on a runnable, and has been happening regularly

Very likely this is a runnable that was somehow released or overwritten - I'll place my bets on released, even though we're holding a ref to it.  Now the event queue should hold a ref to it, so if it is released somehow there's a refcounting mistake.  The other option would be the PerThreadTaskGroup object itself somehow to be overwritten or destroyed.  That's passed around in UniquePtr's, so it should be correct.

In the TaskDispatcher, these runnables are held in a nsTArray<nsCOMPtr<nsIRunnable>>, so it shouldn't get released out from under us, and thus the assumption that there's a refcounting mistake in some other code, or something overwrote memory (and does so very regularly).  If the PerThreadTaskGroup were destroyed, all the runnables would be likely freed, but if so I'm unclear how we'd get into here.

Also, as noted, many of these crashes are EXECs to these pointers, so this might be exploitable.

I also don't see any obvious stacks in the media code (we have media decoder/etc threads, but those are part of pools that hang around normally).  It's certainly possible we're running media code, though.

This doesn't appear to be truly random memory corruption, but that's not impossible.

Does anyone have any thoughts, or additional tripwires/release-asserts we could land?
(In reply to Randell Jesup [:jesup] from comment #19)
> Does anyone have any thoughts, or additional tripwires/release-asserts we
> could land?


Who did you mean to address with this question? Please forward the needinfo accordingly.
Flags: needinfo?(rjesup)
> ho did you mean to address with this question? Please forward the needinfo accordingly.
Anyone on this bug, really.  It's very unclear where this is coming from.
Flags: needinfo?(rjesup) → needinfo?(wmccloskey)
(In reply to JW Wang [:jwwang] [:jw_wang] from comment #22)
> https://crash-stats.mozilla.com/signature/
> ?product=Firefox&signature=mozilla%3A%3AAutoTaskDispatcher%3A%3ATaskGroupRunn
> able%3A%3ARun&date=%3E%3D2017-01-20T06%3A06%3A47.000Z&date=%3C2017-07-
> 20T06%3A06%3A47.000Z
> 
> There are no reports after 54.0.1, I guess it's been fixed by some other
> bugs.

Still happening in 55b7 (reports from beta and nightly are always much lower in volume).  About 10 or 12 in the last month or so in 55.
https://crash-stats.mozilla.com/report/index/d0637fc6-b9d5-4316-b24d-aab6d0170705 (clear UAF)
The code looks correct to me. All I can think to do is to add some thread safety assertions, since this code is threaded and it doesn't have any assertions right now. However, I don't actually see any threading bugs just from inspection.
Flags: needinfo?(wmccloskey)
Bill, can *you* add those thread safety assertions?
Flags: needinfo?(wmccloskey)
I'm pretty busy. I might have time in a couple weeks. This is really JW's area, though.
Flags: needinfo?(wmccloskey) → needinfo?(jwwang)
Turned out my plate is full of media tasks and won't have time for XPCOM stuff. Maybe some other XPCOM guy is interested in taking this bug.
Assignee: jwwang → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(jwwang)
Component: Audio/Video: Playback → XPCOM
Nathan, do you know who could work on this bug?
Flags: needinfo?(nfroyd)
The null derefs seemed to go away in more recent versions(In reply to JW Wang [:jwwang] [:jw_wang] from comment #16)
> (In reply to Andrew McCreight [:mccr8] from comment #10)
> > Thanks for looking at this. It might be better to file a separate public bug for the null deref crashes.
> 
> Bug is filed for null checks.

The initial comment was filed on a UAF crash and explicitly mentions UAF as the reason for filing the bug.
Assignee: nobody → nfroyd
Keywords: stale-bug
I don't have much to add; the following is just me trying to write everything out so maybe somebody else can poke holes in this logic.

OK, so we have a UAF / null deref in this code:

        // State change tasks get run all together before any code is run, so
        // that all state changes are made in an atomic unit.
        for (size_t i = 0; i < mTasks->mStateChangeTasks.Length(); ++i) {
          mTasks->mStateChangeTasks[i]->Run();
        }

mTasks is a UniquePtr; mStateChangeTasks is an array of nsCOMPtrs.  So, assuming those are used correctly, we shouldn't have any issues.  The only way mStateChangeTasks gets touched is in AutoTaskDispatcher::AddStateChangeTask(), where we pass in an existing strong reference.  AddStateChangeTask is only called by AbstractThread::DispatchStateChange, which in turn is only called by places that allocate a runnable and immediately pass it in.  So, that looks like a dead end, and the likely UAF is mTasks?

TaskGroupRunnable::mTasks is a UniquePtr<PerThreadTaskGroup>, so we should start by figuring out how data flows into that.  TaskGroupRunnable is constructed in AutoTaskDispatcher::DispatchTaskGroup, which receives a UniquePtr of the appropriate type.  DispatchTaskGroup is called in two places:

1. ~AutoTaskDispatcher;
2. AutoTaskDispatcher::DispatchTasksFor.

Both of these places are moving UniquePtr<PerThreadTaskGroup> out of AutoTaskDispatcher::mTaskGroups, which is an nsTArray<UniquePtr<PerThreadTaskGroup>>.  That seems safe enough (though we have suspicious looking array indexing crashes in ~AutoTaskDispatcher...).  How does mTaskGroups get added to? In one of:

1. AutoTaskDispatcher::EnsureTaskGroup; or
2. AutoTaskDispatcher::AddTask.

EnsureTaskGroup is only called in AutoTaskDispatcher::AddStateChangeTask.  So where would AddStateChangeTask or AddTask get called from?  Well, where would AutoTaskDispatcher get used?  It gets used in two places:

1. TaskQueue::AutoTaskGuard (subclassed);
2. EventTargetWrapper, as a Maybe<> member.

Assuming I understand things correctly, EventTargetWrapper and its use of AutoTaskDispatcher can only be used on cycle-collected threads, because when we construct EventTargetWrapper::mTailDispatcher in EventTargetWrapper::TailDispatcher, we call nsContentUtils::RunInStableState, which calls into the CC...if we weren't on a CC thread, we'd crash.  But what if we *weren't* on a CC thread when TailDispatcher was called for the nth time, and we handed out a reference to memory that FireTailDispatcher was about to destroy on a different, CC'd thread?  That seems like a semi-plausible scenario; a lot of the crashes are happening on non-CC'd threads (generally threads in an nsThreadPool).

That leaves TaskQueue::AutoTaskGuard.  AutoTaskGuard's constructor takes a TaskQueue*, and sets the TaskQueue's mTailDispatcher member to the address AutoTaskGuard being constructed.  So maybe TaskQueue is handing out pointers/references to this thing that lives on the stack, and those pointers are being used--as an AutoTaskDispatcher *after* the AutoTaskGuard has been deconstructed?  The AutoTaskDispatcher's members would still be reasonable values (subject to the vagaries of stack use by other functions), but they'd point to freed memory, perhaps, and we'd get hosed?

The only problem with that theory (and the one around EventTargetWrapper, above), is that TaskQueue::TailDispatcher (and EventTargetWrapper::TailDispatcher) is only called to immediately do something with the result; the result of TailDispatcher is not stored away for later use.  So the problematic scenario is something like:

1. Call TailDispatcher, get reference back;
2. Destroy the underlying AutoTaskDispatcher on a different thread;
3. Use the TailDispatcher...which is overwritten with freed memory at this point?

That scenario doesn't seem to work for TaskQueue::AutoTaskGuard, since AutoTaskGuard is stack-scoped, so I think that leaves EventTargetWrapper as the likeliest place for problems?  We could test that theory by doing some combination of:

1. converting the MOZ_ASSERT in EventTargetWrapper::TailDispatcher into a MOZ_RELEASE_ASSERT; and/or
2. Making Maybe::reset() actually scribble over the destroyed memory with the poison value.

Actually, the second one sounds like a really good idea, irrespective of whether it does something in this bug.  I'll file that as a separate, probably security-sensitive bug?
Flags: needinfo?(nfroyd)
Nathan: did you file that other bug? If so please link to it from here or add it to "See Also".

Do we have any other ways forward on this bug?
Flags: needinfo?(nfroyd)
(In reply to Daniel Veditz [:dveditz] from comment #31)
> Nathan: did you file that other bug? If so please link to it from here or
> add it to "See Also".

Bug 1414901, added to "See Also".

> Do we have any other ways forward on this bug?

I can convert the MOZ_ASSERT in EventTargetWrapper::TailDispatcher into a release or diagnostic assert, I guess?  We get a few hits on this crash each week on beta and nightly, so that might provide a little information.

Bug 1414901 would help some here, too.
Flags: needinfo?(nfroyd)
See Also: → 1414901
The most recent (and likely most interesting) crash is on 60.0b7 from a 2018-03-29 06:17:37 build
https://crash-stats.mozilla.com/report/index/8540ccb4-b2fe-4fd3-8e49-909290180329

There's nothing else much. What else do you think we need to do with the bugs mentioned above fixed, Nathan?
Flags: needinfo?(nfroyd)
(In reply to Frederik Braun [:freddyb] from comment #33)
> The most recent (and likely most interesting) crash is on 60.0b7 from a
> 2018-03-29 06:17:37 build
> https://crash-stats.mozilla.com/report/index/8540ccb4-b2fe-4fd3-8e49-
> 909290180329
> 
> There's nothing else much. What else do you think we need to do with the
> bugs mentioned above fixed, Nathan?

I can't tell from that crash if we're crashing somewhere in a poisoned region or not.  Did we remove the poison address information from crash reports? =/

Also, why does that crash not show up when searching for the signature?
Flags: needinfo?(nfroyd)
The poison info annotations are still there:
https://dxr.mozilla.org/mozilla-central/rev/c23c7481957f7b982cffc0ce1d25979c69ca2c2f/toolkit/xre/nsAppRunner.cpp#4455

...but apparently we don't send the parent process ones for child processes (which sorta makes sense, as we could ostensibly choose a different address per process):
https://dxr.mozilla.org/mozilla-central/rev/c23c7481957f7b982cffc0ce1d25979c69ca2c2f/toolkit/crashreporter/nsExceptionHandler.cpp#321

I guess we should make child processes explicitly annotate these as well.
Moving to p3 because no activity for at least 24 weeks.
Priority: P1 → P3
Keywords: stalled
Assignee: froydnj+bz → nobody
Severity: critical → S2

We're working to close out stalled bugs that cannot be actioned. There are still a few coming in every release but at this point we're going to close it out and can reopen or refile and reinvestigate if we want to.

Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → INCOMPLETE

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit BugBot documentation.

Keywords: stalled
Group: dom-core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: