Closed Bug 1440937 Opened 7 years ago Closed 3 years ago

Crash in mozilla::gfx::DrawTargetCaptureImpl::ReplayToDrawTarget

Categories

(Core :: Graphics: Layers, defect, P1)

58 Branch
All
Windows
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- wontfix
firefox-esr91 --- unaffected
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox69 --- wontfix

People

(Reporter: philipp, Assigned: rhunt)

References

Details

(5 keywords)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is report bp-b3d723e3-8b08-43a7-b5a7-f077f0180224. ============================================================= Top 10 frames of crashing thread: 0 xul.dll mozilla::gfx::DrawTargetCaptureImpl::ReplayToDrawTarget gfx/2d/DrawTargetCapture.cpp:347 1 xul.dll mozilla::gfx::DrawTarget::DrawCapturedDT gfx/2d/DrawTarget.cpp:187 2 xul.dll mozilla::layers::PaintThread::AsyncPaintContents gfx/layers/PaintThread.cpp:333 3 xul.dll mozilla::detail::RunnableFunction<<lambda_a689607eeabb0cdd3694b3306746dc18> >::Run xpcom/threads/nsThreadUtils.h:529 4 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1040 5 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:517 6 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:364 7 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:319 8 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:299 9 xul.dll nsThread::ThreadFunc xpcom/threads/nsThread.cpp:423 ============================================================= content crashes with this signature start appearing in firefox 58 and higher - with rather low volume though.
The crashes I checked point at a line changed in 58 as part of bug 1395478. matt: dvander is not around, please find someone to investigate this. (sorry to pick on you but you reviewed some of the patches) Some of the crashes are EXCEPTION_ACCESS_VIOLATION_EXEC so this might warrant upgrading to sec-critical.
Assignee: nobody → matt.woodrow
Group: core-security → gfx-core-security
Has STR: --- → no
Flags: needinfo?(matt.woodrow)
Priority: -- → P1
Flags: needinfo?(matt.woodrow) → needinfo?(rhunt)
Ryan any thoughts here?
This seems like the backing store of the DrawTargetCapture is being mutated while we replay it on the paint thread. This should only be possible if the capture that we send to the paint thread is being used on the main thread somehow. But this really shouldn't be happening. The DrawTarget that we use for painting should never be retained by any of the painting code. Before OMTP we had an assert for this for a long time [1]. Admittedly it was debug only, so maybe we weren't hitting it in release? But that seems unlikely to me. I've wrote a patch that adds an equivalent assertion to make sure we always transfer sole ownership of the DrawTargetCapture to the paint thread. I'm not confident this will fix the crashes, but it will eliminate this possibility. [1] https://searchfox.org/mozilla-central/rev/a23c3959b62cd3f616435e02810988ef5bac9031/gfx/layers/client/TextureClient.cpp#569
Flags: needinfo?(rhunt)
Attached patch move-capture.patch — — Splinter Review
Attachment #9008741 - Flags: review?(matt.woodrow)
If there truly is racy mutation of the DrawTargetCapture going on, I would expect it to be the cause of bug 1414389 as well. Unfortunately attempts to add an assertion to detect that in that bug didn't yield anything useful.
Attachment #9008741 - Flags: review?(matt.woodrow) → review+
Comment on attachment 9008741 [details] [diff] [review] move-capture.patch [Security approval request comment] How easily could an exploit be constructed based on the patch? Not easily, it adds a diagnostic assertion to prove ownership and changes some pointers to be unique. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? No Which older supported branches are affected by this flaw? From the tracking flags I see esr60 If not all supported branches, which bug introduced the flaw? Enabling of OMTP Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? Just a diagnostic for the current branch How likely is this patch to cause regressions; how much testing does it need? Unlikely, it passes try so it should be stable enough for nightly.
Attachment #9008741 - Flags: sec-approval?
Comment on attachment 9008741 [details] [diff] [review] move-capture.patch sec-approval+ for trunk.
Attachment #9008741 - Flags: sec-approval? → sec-approval+
Group: gfx-core-security → core-security-release
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Apologies, this should have been leave-open. I don't expect this patch to resolve the issue, just narrow down the possibilities.
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Ryan, what needs to happen next?
Flags: needinfo?(rhunt)
I haven't seen any crashes beyond 62.02 [1]. I don't know if that's because it's such low frequency or if something has changed. I landed an assertion in this bug that should be hit before this crash happens, and I haven't seen any reports with it. So I'd lean towards this being resolved, but I've just been waiting for more time to pass. [1] https://crash-stats.mozilla.com/signature/?product=Firefox&signature=mozilla%3A%3Agfx%3A%3ADrawTargetCaptureImpl%3A%3AReplayToDrawTarget&date=%3E%3D2018-11-02T11%3A39%3A37.000Z&date=%3C2018-11-09T10%3A39%3A37.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#reports
Flags: needinfo?(rhunt)
i think the signature might have shifted slightly.
Crash Signature: [@ mozilla::gfx::DrawTargetCaptureImpl::ReplayToDrawTarget] → [@ mozilla::gfx::DrawTargetCaptureImpl::ReplayToDrawTarget] [@ mozilla::gfx::DrawTarget::DrawCapturedDT ]
Adding Jessie (new graphics engineering manager) to all sec-crit and sec-high graphics bugs
Ryan, has enough time passed for us to know if this is resolved or not?
(In reply to [:philipp] from comment #12) > i think the signature might have shifted slightly. Ah, you're right. It looks like this is crash is still occurring, and the switch from MSVC to clang has changed the signature [1]. The signature is now in DrawCapturedDT but it looks like the crash is still in ReplayToDrawTarget. I'm assuming the function is inlined and clang doesn't give us any debuginfo here. I don't really have any new ideas idea from looking at these crash reports. The only theory I've had is that the storage for the capture list was being mutated, causing the iterators in ReplayToDrawTarget to become invalidated. This should be impossible though because we release assert that the DrawTargetCapture we send to the paint thread has only one reference [2], and the paint thread doesn't mutate the capture list ever. Another theory I've had is that there's some UB/logic error in CaptureCommandList causing our iteration or allocation to go awry. I can't rule this out because it's a C++ custom allocator, but nothing has stood out to me. Additionally, if that was the case I'd expect the crash rate to be much more frequent and reproducible. So the puzzle pieces I have here are: 1. DrawTargetCapture::ReplayToDrawTarget 2. Direct2D and Skia 3. Bad read access 4. Near null or wild ptr 5. Crash rate and code seems like race condition Anything obvious I'm missing would be appreciated. [1] https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Agfx%3A%3ADrawTarget%3A%3ADrawCapturedDT&date=%3E%3D2018-08-26T10%3A32%3A55.000Z&date=%3C2018-11-26T08%3A32%3A55.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-version&_sort=-date&page=1 [2] https://searchfox.org/mozilla-central/rev/d35199ef15ffa57d190886e20d5df6471faf0870/gfx/layers/PaintThread.cpp#209
See Also: → 1414389
I got access to some minidumps of crashes and have been taking closer looks. The change to clang lost us inlining information for the stacks of these crashes but I was able to confirm the crashes are still in ReplayToDrawTarget from the minidumps. Specifically we're crashing when dereferencing the pointer to the vtable of the DrawCommand. Occasionally we successfully access the vtable, and then crash with an execution violation at a garbage address. These DrawCommands are stored on the capture command list, so this indicates that memory is still accessible but corrupted somehow. It could be a UAF from a race in the OMTP code, or I think it could also be from some other code. The command lists can get quite large (up to 25MiB) so it's a large heap target. The minidumps from bug 1414389 are similar. They're crashing when dereferencing the pointer to the vtable of a Pattern object that is stored in the command list. As a measure to detect corruption and reduce chance of potential exploits, Matt Woodrow recommended adding some redundancy to the command list. I'm thinking we could lower the offset between commands to 16bits, then use the other 16bits to store a duplicate of the offset. While iterating we can validate the offsets have not been changed. We could also negate the redundant encoding to prevent corruption that sprays the same values from being undetected. Another mitigation would be to harden the command list, by changing all commands to be POD and validate references to resources. We could then assert when invalid command lists are encountered or even recover from them.
Attached patch redundant-command-list.patch — — Splinter Review
Attachment #9028655 - Flags: review?(matt.woodrow)
Assignee: matt.woodrow → rhunt
Attachment #9028655 - Flags: review?(matt.woodrow) → review+
Is the leave-open keyword still needed?
(In reply to Julien Cristau [:jcristau] from comment #18) > Is the leave-open keyword still needed? Yes, the patches that have landed here didn't resolve the issue, but did narrow down the potential causes.
Comment on attachment 9028655 [details] [diff] [review] redundant-command-list.patch [Security Approval Request] How easily could an exploit be constructed based on the patch?: Not easily. This patch adds hardening against corruption in the OMTP command buffer. We don't know exactly how corruption might be happening, so this patch is to just detect if it's happening. Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No Which older supported branches are affected by this flaw?: Unsure of the flaw, so potentially all If not all supported branches, which bug introduced the flaw?: None Do you have backports for the affected branches?: No If not, how different, hard to create, and risky will they be?: Not hard, but also not necessary right now. I just want to get more information here. How likely is this patch to cause regressions; how much testing does it need?: This code is tricky, but any logic error in it should be easily detected in automated tests (crashing) and they are green.
Attachment #9028655 - Flags: sec-approval?
Comment on attachment 9028655 [details] [diff] [review] redundant-command-list.patch sec-approval+ for checking this into mozilla-central.
Attachment #9028655 - Flags: sec-approval? → sec-approval+
I don't think it can be. The patch is another diagnostic to try and figure out what's going on with these crashes. I'll ni? myself to be sure I reevaluate this promptly.
Flags: needinfo?(rhunt)
Flags: needinfo?(rhunt)

Any chance you can look at this again for 66 or even 67?

The diagnostic assert is getting triggered, which indicates that we are experiencing memory corruption of the capture command list after writing it.

We write an 'offset' value in the stream and the bit-flipped version immediately after. While replaying the command stream, we validate the offset is followed by a bit-flipped version.

Here's a search of crashes in the last three months in 66 of this assert happening inside of DrawTargetCaptureImpl::ReplayToDrawTarget [1]. I see 43 crashes. There are 30 other crashes in 66 in this function that aren't because of this assert [2].

My take is that this means we're experiencing some ambient memory corruption that is corrupting our command stream, causing these crashes. These streams can become fairly large and are contiguous, maybe making them probabilistic-ally likely to be hit?

I'm not sure of a mitigation for this, besides just leaving this assert to prevent security sensitive crashes from happening. We might be able to get even less crashes by hardening the command stream further.

[1] https://crash-stats.mozilla.com/search/?moz_crash_reason=~MOZ_RELEASE_ASSERT%28advance%20%3D%3D%20redundant%29&signature=~mozilla%3A%3Agfx%3A%3ADrawTargetCaptureImpl%3A%3AReplayToDrawTarget&version=66&version=66.0&version=66.0b10&version=66.0b9&version=66.0b8&version=66.0b7&version=66.0b6&version=66.0b5&version=66.0b4&version=66.0b3&version=66.0b2&version=66.0b1&version=66.0b0&version=66.0a1&product=Firefox&date=%3E%3D2018-11-25T18%3A26%3A00.000Z&date=%3C2019-02-25T18%3A26%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature
[2] https://crash-stats.mozilla.com/search/?moz_crash_reason=%21~MOZ_RELEASE_ASSERT%28advance%20%3D%3D%20redundant%29&signature=~mozilla%3A%3Agfx%3A%3ADrawTargetCaptureImpl%3A%3AReplayToDrawTarget&version=66&version=66.0&version=66.0b10&version=66.0b9&version=66.0b8&version=66.0b7&version=66.0b6&version=66.0b5&version=66.0b4&version=66.0b3&version=66.0b2&version=66.0b1&version=66.0b0&version=66.0a1&product=Firefox&date=%3E%3D2018-11-25T18%3A26%3A00.000Z&date=%3C2019-02-25T18%3A26%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Flags: needinfo?(rhunt)

Is there further action to be taken here, Ryan?

Flags: needinfo?(rhunt)
Keywords: stalled

I'm at a bit of a loss here of how to proceed, so I would say this is stalled.

Flags: needinfo?(rhunt)
See Also: → 1560182

The leave-open keyword is there and there is no activity for 6 months.
:rhunt, maybe it's time to close this bug?

Flags: needinfo?(rhunt)
Flags: needinfo?(rhunt)

Removing employee no longer with company from CC list of private bugs.

AFAICT, this code no longer exists. Can we close this out?

Flags: needinfo?(bhood)

Happily. :)

Status: REOPENED → RESOLVED
Closed: 6 years ago3 years ago
Flags: needinfo?(bhood)
Resolution: --- → FIXED
Resolution: FIXED → INCOMPLETE
Target Milestone: mozilla64 → ---

Since the bug is closed, the stalled keyword is now meaningless.
For more information, please visit auto_nag documentation.

Keywords: stalled
Resolution: INCOMPLETE → WORKSFORME
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: