Closed Bug 1653902 Opened 5 years ago Closed 5 years ago

Crash in [@ memcpy | mozilla::layers::PlanarYCbCrImage::BuildSurfaceDescriptorBuffer]

Categories

(Core :: Audio/Video: Playback, defect, P3)

All
Windows
defect

Tracking

()

RESOLVED FIXED
89 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- fixed

People

(Reporter: gsvelto, Assigned: mattwoodrow)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug is for crash report bp-e8eb97d1-76cb-4043-9f67-4ad520200717.

Top 10 frames of crashing thread:

0 vcruntime140.dll memcpy f:\dd\vctools\crt\vcruntime\src\string\i386\memcpy.asm:600
1 xul.dll mozilla::layers::PlanarYCbCrImage::BuildSurfaceDescriptorBuffer gfx/layers/ImageContainer.cpp:509
2 xul.dll mozilla::RemoteVideoDecoderParent::ProcessDecodedData dom/media/ipc/RemoteVideoDecoder.cpp:380
3 xul.dll mozilla::RemoteDecoderParent::DecodeNextSample::<unnamed-tag>::operator dom/media/ipc/RemoteDecoderParent.cpp:118
4 xul.dll mozilla::MozPromise<nsTArray<RefPtr<mozilla::MediaData> >, mozilla::MediaResult, 1>::ThenValue<`lambda at /builds/worker/checkouts/gecko/dom/media/ipc/RemoteDecoderParent.cpp:104:7'>::DoResolveOrRejectInternal xpcom/threads/MozPromise.h:837
5 xul.dll mozilla::MozPromise<mozilla::TrackInfo::TrackType, mozilla::MediaResult, 1>::ThenValueBase::ResolveOrRejectRunnable::Run xpcom/threads/MozPromise.h:410
6 xul.dll mozilla::TaskQueue::Runner::Run xpcom/threads/TaskQueue.cpp:158
7 xul.dll nsThreadPool::Run xpcom/threads/nsThreadPool.cpp:299
8 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1234
9 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:332

Not a recent regression since I could find crashes going back to buildid 20200303214128 (for the nightly channel). It seems like the copy here is writing past the end of a page since all the crashes are EXCEPTION_ACCESS_VIOLATION_WRITE to an address that ends with 0x1000. What is odd is that's always 0x1000 and not another multiple of 4KiB.

It looks like these crashes are only ever occurring on Windows. Is it safe to assume we're writing outside our buffer, or could it be that the memcpy happens to cross a page boundary and that causes this exception? I can't see anything that would suggest that behavior in the Microsoft memcpy docs, so that leaves writing outside the buffer unless there's some other way this is happening I haven't been able to think of so far.

That implies either the destination address is wrong or the length (here height * stride) is wrong. That all comes from the RDD side of things, so I'll poke around that code to see if anything looks wrong or could account for this code path only happening on Windows.

Flags: needinfo?(gsvelto)

(In reply to Jon Bauman [:jbauman:] from comment #1)

It looks like these crashes are only ever occurring on Windows. Is it safe to assume we're writing outside our buffer, or could it be that the memcpy happens to cross a page boundary and that causes this exception?

We're definitely writing past the end of a buffer and that's triggering the crash. The only "peculiar" thing about it is that this happens always at a 4KiB page boundary whose address ends with 0x1000.

That implies either the destination address is wrong or the length (here height * stride) is wrong. That all comes from the RDD side of things, so I'll poke around that code to see if anything looks wrong or could account for this code path only happening on Windows.

It might be possible to have a look at the contents of those variables by downloading a raw minidump and opening it in Visual Studio. I'll give it a try tomorrow morning if I have some spare cycles. Leaving the NI? for now.

It might be possible to have a look at the contents of those variables by downloading a raw minidump and opening it in Visual Studio.

I'd appreciate that as I don't have a Windows setup. In the meantime, I'll dig more into the code to trace back the origins of the values of interest to see if I can locate anything unexpected.

Working backwards from PlanarYCbCrImage::BuildSurfaceDescriptorBuffer, we can see that the call originates inside a block where we don't have a valid GPU-accelerated surface descriptor. I'm not sure if that could account for the Windows-specific nature of this issue, but it's worth considering.

In BuildSurfaceDescriptorBuffer, the destination buffer passed to CopyPlane is derived from the aSdBuffer parameter added to the yOffset.

The yOffset is declared locally uninitialized, then passed by reference to ComputeYCbCrOffsets where it is unconditionally set to zero, so I think we can ignore it for the sake of this analysis.

The buffer comes from the aSdBuffer parameter, via the memOrShmem intermediate. Since in ProcessDecodedData, the sdBuffer argument is set from the buffer local which is of type ShmemBuffer, I think it's reasonable to conclude that we should be taking the MemoryOrShmem::TShmem case in BuildSurfaceDescriptorBuffer. This code looks correct, though I'm not familiar with the underlying abstraction such that a mistake would necessarily jump out. It is worth noting that Shmem provides a Size method, so it would be possible to add some checking here and see if things match on the ProcessDecodedData and BuildSurfaceDescriptorBuffer side.

Finally, the lengths passed to CopyPlane come from the local pdata, which is a pointer to the mData field of the PlanarYCbCrImage type. That means it comes from the image receiver of the call to BuildSurfaceDescriptorBuffer. That is derived from calling AsPlanarYCbCrImage on the mImage member of the VideoData local. That implementation is trivial, so if there's something wrong, it's a problem in the mImage member itself. To see where it's initialized, we look back to its initialization from a static cast of an element of the aData parameter. This is the most error-prone part of the code I've seen so far. Perhaps it would be worth making the assertion that data has type VIDEO_DATA into a MOZ_RELEASE_ASSERT. If that fails, we're definitely passing garbage into BuildSurfaceDescriptorBuffer.

To understand where bad data may make its way into ProcessDecodedData, we have to go back a frame to RemoteDecoderParent::DecodeNextSample where the aValue promise parameter is resolved into the array of MediaData elements that ProcessDecodedData loops over.

That's all I have time for at the moment, but so far, I'm not seeing anything particularly obvious.

I pulled down a minidump and analyzed it but it didn't do much good: most of the locals on the stack have been optimized away and the minidump captured only very small parts of the objects involved so I don't have access to the dimensions of the buffers.

Flags: needinfo?(gsvelto)

Seeing as we don't have any way to reproduce this ourselves, and it appears to happen fairly rarely (around 1x/day), I'm not sure what more to do here. I'm open to suggestions for other things to investigate, but currently don't have any promising leads.

Severity: -- → S3
Priority: -- → P3

I took a look at a minidump, and it seems that the Shmem was allocated with a size of 0 (which succeeds, unfortunately).

I think that could happen if PlanarYCbCrImage::AdoptData was called, such that we mostly had valid data in the PlanarYCbCrImage, except for mBufferSize.

It might be better for BuildSurfaceDescriptorBuffer to take an allocator callback, so that it can request the specific destination size that we want (instead of trying to use the source destination size). We're already computing our own dest offsets (using ComputeYCbCrOffsets), so we should probably be using ComputeYCbCrBufferSize with those new values.

Looks like this is currently the #12 topcrash in the RDD process

Assignee: nobody → matt.woodrow
Status: NEW → ASSIGNED
Pushed by mwoodrow@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/8e8a7f94124a Allocate a buffer for BuildSurfaceDescriptorBuffer using the computed destination layout. r=jbauman
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: