Closed Bug 1176363 Opened 9 years ago Closed 9 years ago

crash in SkPixelRef::lockPixels()

Categories

(Core :: WebRTC, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
mozilla42
Tracking Status
firefox41 --- wontfix
firefox42 --- fixed

People

(Reporter: potch, Assigned: pehrsons)

References

Details

(Keywords: crash)

Crash Data

Attachments

(3 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-18a57c8e-1d15-4ac1-9a36-eb10c2150617.
=============================================================

This usually happens a few minutes after calling canvas.captureStream().

Here's the code I used: http://codepen.io/anon/pen/Ejvqbw
Is this a GFX bug?  (Or should someone else answer the question?)
backlog: --- → webRTC+
Rank: 10
Flags: needinfo?(jgilbert)
Priority: -- → P1
George is the skia person I believe.
Flags: needinfo?(jgilbert) → needinfo?(gwright)
This is a thread safety issue with the way captureStream is implemented.

Moz2D isn't really threadsafe, and taking a snapshot (via CanvasRenderingContext2D::GetSurfaceSnapshot) returns a copy-on-write lightweight snapshot that refers back to the original DrawTarget (each DrawTarget implements this separately).

This mechanism isn't threadsafe, so have the snapshot accessed on the ImageBridge thread (the crashing thread) is racing with access to the DrawTarget for canvas drawing on the main thread.
Aha.  Thanks Matt.  Is there some way to add a thread-safety assertion to the snapshots?  And is there a workaround to resolve the problem without causing a threading issue?  

Related question: how does the impact of such workarounds affect performance?
Flags: needinfo?(matt.woodrow)
I'll also note I tried it on my Mac and got no crashes in 10 minutes.  Andreas, did we make any chances in how we grab frames since this was reported?
Flags: needinfo?(pehrsons)
By the looks of things we actually use runnables to avoid releasing off the main thread and hitting those assertions. Threading in Moz2D is something we need to look into further.

The best fix is to change the implementation of TimerDriver::TakeSnapshot() to allocate a new SourceSurfaceRawData and copy the pixels from the canvas snapshot into that. That way we guarantee that we have ownership of the data we're using.

This will likely perform worse as its making an extra copy, but there isn't much choice right now.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(gwright)
Ah, I wish I had known this earlier. As Randell mentioned, some thread safety assertions would have been very nice.

I'll have a patch up real soon. It can hopefully fix some of the other issues we've been seeing with captureStream as well.

(In reply to Randell Jesup [:jesup] from comment #5)
> I'll also note I tried it on my Mac and got no crashes in 10 minutes. 
> Andreas, did we make any chances in how we grab frames since this was
> reported?

No changes, no. It's probably just rare.
Flags: needinfo?(pehrsons)
Bug 1176363 - Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow
Attachment #8629783 - Flags: review?(matt.woodrow)
See Also: → 1165928
Comment on attachment 8629783 [details]
MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow

https://reviewboard.mozilla.org/r/12645/#review11133

Ship It!
Attachment #8629783 - Flags: review?(matt.woodrow) → review+
Assignee: nobody → pehrsons
Status: NEW → ASSIGNED
Keywords: checkin-needed
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/80da8a74fad2 for frequent Win8 mochitest-gl and less frequent mochitest-1 assertion failures like https://treeherder.mozilla.org/logviewer.html#?job_id=11416725&repo=mozilla-inbound or the ones I retriggered up for you on your try push.
OK, I figured out what is racing now.

The stack trace in most failures (like comment 12) is:
> 04:44:35     INFO -  Thread 0 (crashed) <- main thread
> 04:44:35     INFO -   0  xul.dll!mozilla::gfx::DataSourceSurface::Unmap() [2D.h:f90e28d039c1 : 475 + 0x53]
> 04:44:35     INFO -   1  xul.dll!mozilla::gfx::DrawTargetD2D1::OptimizeSourceSurface(mozilla::gfx::SourceSurface *) [DrawTargetD2D1.cpp:f90e28d039c1 : 1584 + 0x1d]
> 04:44:35     INFO -   2  xul.dll!nsLayoutUtils::SurfaceFromElement(mozilla::dom::HTMLVideoElement *,unsigned int,mozilla::gfx::DrawTarget *) [nsLayoutUtils.cpp:f90e28d039c1 : 6918 + 0x10]
> 04:44:35     INFO -   3  xul.dll!nsLayoutUtils::SurfaceFromElement(mozilla::dom::Element *,unsigned int,mozilla::gfx::DrawTarget *) [nsLayoutUtils.cpp:f90e28d039c1 : 6947 + 0x7]
> 04:44:35     INFO -   4  xul.dll!mozilla::dom::CanvasRenderingContext2D::DrawImage(mozilla::dom::HTMLImageElementOrHTMLCanvasElementOrHTMLVideoElement const &,double,double,double,double,double,double,double,double,unsigned char,mozilla::ErrorResult &) [CanvasRenderingContext2D.cpp:f90e28d039c1 : 4379 + 0x16]
> 04:44:35     INFO -   5  xul.dll!mozilla::dom::CanvasRenderingContext2DBinding::drawImage [CanvasRenderingContext2DBinding.cpp:f90e28d039c1 : 4126 + 0x9]
> ...

I was debugging on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4f063845c0b

One of the M-gl tests failed, but due to timing changes gave me the other stack trace, http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/pehrsons@gmail.com-b4f063845c0b/try-win64-debug/try_win8_64-debug_test-mochitest-gl-bm119-tests1-windows-build1085.txt.gz:
> 05:59:39     INFO -  Thread 34 (crashed)
> 05:59:39     INFO -   0  xul.dll!mozilla::gfx::DataSourceSurface::Unmap() [2D.h:b4f063845c0b : 477 + 0x53]
> 05:59:39     INFO -   1  xul.dll!mozilla::gfx::CreatePartialBitmapForSurface [HelpersD2D.h:b4f063845c0b : 620 + 0x13]
> 05:59:39     INFO -   2  xul.dll!mozilla::gfx::DrawTargetD2D1::GetImageForSurface(mozilla::gfx::SourceSurface *,mozilla::gfx::Matrix &,mozilla::gfx::ExtendMode,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const *) [DrawTargetD2D1.cpp:b4f063845c0b : 1552 + 0x3d]
> 05:59:39     INFO -   3  xul.dll!mozilla::gfx::DrawTargetD2D1::CopySurface(mozilla::gfx::SourceSurface *,mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> const &,mozilla::gfx::IntPointTyped<mozilla::gfx::UnknownUnits> const &) [DrawTargetD2D1.cpp:b4f063845c0b : 329 + 0x2d]
> 05:59:39     INFO -   4  xul.dll!mozilla::layers::CairoImage::GetTextureClient(mozilla::layers::CompositableClient *) [ImageContainer.cpp:b4f063845c0b : 654 + 0x3b]
> 05:59:39     INFO -   5  xul.dll!mozilla::layers::ImageClientSingle::UpdateImage(mozilla::layers::ImageContainer *,unsigned int) [ImageClient.cpp:b4f063845c0b : 154 + 0x8]
> 05:59:39     INFO -   6  xul.dll!mozilla::layers::UpdateImageClientNow [ImageBridgeChild.cpp:b4f063845c0b : 408 + 0x11]
> 05:59:39     INFO -   7  xul.dll!RunnableFunction<void (*)(mozilla::layers::ImageClient *,mozilla::layers::ImageContainer *),Tuple2<mozilla::layers::ImageClient *,nsRefPtr<mozilla::layers::ImageContainer> > >::Run() [task.h:b4f063845c0b : 420 + 0x9]
> ...


So it's a race between
* drawing the video element (with our now self-owned raw datasourcesurface) to a 2d canvas on the main thread,
* and compositing the video element with the same raw surface on the image client bridge thread (?)

So it basically boils down to an existing issue on Windows that I just happen to exercise a tad bit more than existing tests.

Matt, any ideas for fixes?
Flags: needinfo?(matt.woodrow)
See Also: → 1166432
Flags: needinfo?(matt.woodrow)
Attachment #8633709 - Flags: review?(bas)
Attachment #8633769 - Flags: review?(matt.woodrow)
Attachment #8633709 - Flags: review?(bas) → review+
Comment on attachment 8633769 [details] [diff] [review]
Stop using DrawTargets off the main thread

Review of attachment 8633769 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/d3d11/TextureD3D11.cpp
@@ +491,5 @@
>    }
>  
>    gfxWindowsPlatform* windowsPlatform = gfxWindowsPlatform::GetPlatform();
> +  ID3D11Device* d3d11device = windowsPlatform->GetD3D11DeviceForCurrentThread();
> +  bool haveD3d11Backend = windowsPlatform->GetContentBackend() == BackendType::DIRECT2D1_1 || NS_IsMainThread();

Comment about why using d3d11 is ok (even with d2d 1.1) off the main thread.

@@ +502,5 @@
>                                    D3D11_BIND_RENDER_TARGET | D3D11_BIND_SHADER_RESOURCE);
>  
>      newDesc.MiscFlags = D3D11_RESOURCE_MISC_SHARED;
>  
> +    if (!NS_IsMainThread()) {

Comment here about sync object handling this for us on the main thread.

::: gfx/layers/d3d11/TextureD3D11.h
@@ +63,5 @@
>    virtual bool CanExposeDrawTarget() const override { return true; }
>  
>    virtual gfx::DrawTarget* BorrowDrawTarget() override;
>  
> +  virtual void UpdateFromSurface(gfx::DataSourceSurface* aSurface) override;

Comment about how we use this for off-main-thread updates, and that BorrowDrawTarget() is not safe for this.
Attachment #8633769 - Flags: review?(matt.woodrow) → review+
This is increasingly a Windows bug.
OS: Mac OS X → Windows
Andreas, feel free to land my patch if you want to land your stuff. We can just leave the bug open until Bas lands with the review comments fixed.
Thanks for the work here guys. I'm running our patches on try now Matt: https://treeherder.mozilla.org/#/jobs?repo=try&revision=83fd58f54a8f
Comment on attachment 8629783 [details]
MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow

Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow
Attachment #8629783 - Attachment description: MozReview Request: Bug 1176363 - Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow → MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow
Attachment #8629783 - Flags: review+
Comment on attachment 8629783 [details]
MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow

Rebased and added part number. Carrying r=mattwoodrow.
Attachment #8629783 - Flags: review+
Rebased Matt's patch and fixed a typo (Atomic.h -> Atomics.h). Carrying forward r=bas.
Attachment #8633709 - Attachment is obsolete: true
Attachment #8634648 - Flags: review+
Martin, see linux M-e10s-2 on [1], `application crashed [@ mozilla::TransportLayerShutdown]` is going crazy now. Maybe some timing changed with these patches but they seem unrelated to the crash. Is this on your radar?

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=83fd58f54a8f
Flags: needinfo?(martin.thomson)
I think that I want your patches if they are that effective.I've landed a fix for the crash, but not the underlying problem, which I've been trying to reproduce unsuccessfully for most of this week.
Flags: needinfo?(martin.thomson)
Please land part 1 from reviewboard and part 2 from bugzilla.

Try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=83fd58f54a8f

The failing test was since disabled, so we should be good here.

Leaving open for bas to land his patch when it's ready.

Thanks!
Depends on: 1185011
Update with a bunch of fixes and made to pass try.
Attachment #8639136 - Flags: review?(matt.woodrow)
Comment on attachment 8639136 [details] [diff] [review]
Part 1: Stop using DrawTargets off the main thread v2

Review of attachment 8639136 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ImageContainer.cpp
@@ +649,5 @@
> +  RefPtr<DataSourceSurface> dataSurf = surface->GetDataSurface();
> +  textureClient->UpdateFromSurface(dataSurf);
> +
> +  if (NS_IsMainThread()) {
> +    textureClient->SyncWithObject(forwarder->GetSyncObject());

This is a bit weird because the TextureClient implicitly decides whether to use the sync object (or keyedmutex), but we're required to explicitly tell it to sync (sometimes).

It'd be a bit clearer if these two things matched, or if we had a textureClient->NeedsSync() method to check (rather than NS_IsMainThread which isn't very obvious).

Anyway, a comment about why we only do this for the main thread would be a great start.
Attachment #8639136 - Flags: review?(matt.woodrow) → review+
https://hg.mozilla.org/mozilla-central/rev/efb3e9a0b0b8
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Comment on attachment 8629783 [details]
MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow

Approval Request Comment
[Feature/regressing bug #]: Canvas CaptureStream
[User impact if declined]: A live 30FPS canvas captureStream causes Firefox to crash after a couple of minutes. Depends a bit on the platform (timing sensitive) but has shown up frequently in automation, and been reported by users. Canvas captureStreams are protected by a pref and are not available by default.
[Describe test coverage new/current, TreeHerder]: Good coverage in automation that shows the issue as largely fixed. (occasional crashes have occurred, perhaps once a month, but not several a day like before this fix)
[Risks and why]: low. This patch only affects code behind a pref. Accompanying patches (part 2 on this bug and bug 1185011) are in very central code and thus well tested in automation.
[String/UUID change made/needed]: none
Attachment #8629783 - Flags: approval-mozilla-aurora?
Comment on attachment 8634648 [details] [diff] [review]
Part 2. Allow mapping of SourceSurfaceRawData from multiple threads

Approval Request Comment
See comment 35.
Attachment #8634648 - Flags: approval-mozilla-aurora?
Matt, I was unable to find this crash in FF41 in the past 28 days. Do you see it? I am just wondering whether this change can just ride the train instead of getting uplifted to Aurora. Thoughts?
Flags: needinfo?(matt.woodrow)
I was never able to reproduce this, so unsure.
Flags: needinfo?(matt.woodrow)
The test failures in the dependent bugs definitely affect Aurora41.
Comment on attachment 8629783 [details]
MozReview Request: Bug 1176363 - Part 1. Make a raw copy of each Canvas::CaptureStream frame. r=mattwoodrow

Seems safe to uplift to Aurora as 1) this has been in m-c for a week and 2) by default CanvasStream is pref'd off.
Attachment #8629783 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #8634648 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
I was able to uplift Andreas' patches, but Bas' patch needs rebasing.

https://hg.mozilla.org/releases/mozilla-aurora/rev/ef164895b708
https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1c96567f25
Flags: needinfo?(bas)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #41)
> I was able to uplift Andreas' patches, but Bas' patch needs rebasing.
> 
> https://hg.mozilla.org/releases/mozilla-aurora/rev/ef164895b708
> https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1c96567f25

I only requested aurora approval for those because I don't think Bas' patch is necessary. As I take it it just asserts that no-one is using DrawTargets in the wrong way and thus is mainly needed for new offenders - existing issues would probably pop up as intermittent oranges anyway.

I'll let Bas chip in as well.
The crashes didn't go away for good until Bas' patch landed.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #43)
> The crashes didn't go away for good until Bas' patch landed.

Ah, got it. That's why there were some stray oranges for a few days..
See Also: → 1168762
See Also: → 1179551
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #42)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #41)
> > I was able to uplift Andreas' patches, but Bas' patch needs rebasing.
> > 
> > https://hg.mozilla.org/releases/mozilla-aurora/rev/ef164895b708
> > https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1c96567f25
> 
> I only requested aurora approval for those because I don't think Bas' patch
> is necessary. As I take it it just asserts that no-one is using DrawTargets
> in the wrong way and thus is mainly needed for new offenders - existing
> issues would probably pop up as intermittent oranges anyway.
> 
> I'll let Bas chip in as well.

No, my patch fixes the actual problem :-). Matt's patches just make the crash significantly less likely to occur. Having said that, my patches are significantly more risky. They've had some testing but they at least need bug 1190950 uplifted as well. I'll leave it up to release management to determine if we want to do that uplift.
Flags: needinfo?(rkothari)
Flags: needinfo?(pehrsons)
Flags: needinfo?(bas)
Attachment #8633769 - Attachment is obsolete: true
(In reply to Bas Schouten (:bas.schouten) from comment #45)
> No, my patch fixes the actual problem :-).

Right, not sure why I thought they were just asserting...
Flags: needinfo?(pehrsons)
Bas, can you please nominate the patch(s) from bug 1190950 for uplift to Beta? Without that tracking flag "approval-mozilla-aurora/beta" set to "?" I cannot do much.
Flags: needinfo?(rkothari) → needinfo?(bas)
Done.
Flags: needinfo?(bas)
Depends on: 1190950
Bas, were you still going to look at this rebase for Beta41?
Flags: needinfo?(bas)
Hi Bas and Matt,

I have a question about the main thread assertion in TextureClient::BorrowDrawTarget()[1].

If we have off-main-thread painting, we still need to call the BorrowDrawTarget() at painting thread and hit the assertion.

With [2], if we borrow the drawTarget and get the snapshot at main thread, we still have the data race problem when we access the snapshot at another thread.

Any suggestion for calling borrowTarget() at painting thread?

[1]
https://hg.mozilla.org/mozilla-central/annotate/a6786bf8d71d4cf40c3d40e06d8e3c9866863475/gfx/layers/client/TextureClient.cpp#l836

[2]
https://bugzilla.mozilla.org/attachment.cgi?id=8639136
Flags: needinfo?(matt.woodrow)
Jerry, we should move this to a new bug, since fixing it might take a bit of work.

BorrowDrawTarget asserts that it's only called on the main thread because Moz2D isn't really threadsafe and some backends have issues with using it across threads (especially d2d).

We're only going to be using a DrawTarget (and snapshots of it) from a single thread, so this should be solvable fairly easy. D2D in particular will need to be setup with a separate device for each thread that it is used on.
Flags: needinfo?(matt.woodrow)
Flags: needinfo?(bas)
No longer depends on: slow-canvas-to-webgl
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: