Closed Bug 1411401 Opened 7 years ago Closed 6 years ago

MinGW build does not run without `gfx.direct2d.disabled` and `layers.acceleration.disabled` due to weird MinGW stuff

Categories

(Core :: Graphics, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox57 --- wontfix
firefox58 --- affected

People

(Reporter: tjr, Assigned: tjr)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tor][gfx-noted])

Attachments

(1 file)

I don't know if I'll be able to summarize this well, but the gist of it is that there's a calling convention mismatch between the MS ABI and MinGW.

Background: 
https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/?style=threaded&viewmonth=201412&viewday=23 <- This thread is the best resource
https://lists.torproject.org/pipermail/tor-talk/2013-June/028564.html
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64384

One culprit we know causing this problem is ID2D1RenderTarget::GetSize. Because there are only a few uses of this in the codebase, we're going to add a wrapper function that can normalize things for MinGW.
Priority: -- → P3
Whiteboard: [tor] → [tor][gfx-noted]
There is a solution for that in Wine widl:
https://source.winehq.org/git/wine.git/commitdiff/b42a15513eaa973b40ab967014b311af64acbb98
It's not perfect as discussed here:
https://www.winehq.org/pipermail/wine-devel/2017-July/118470.html
but it should do the trick. mingw-w64 d2d headers are not generated from .idl files, but we could have something similar there.

As a side note, that's one of those bugs what would be irrelevant in clang with mingw-w64.
If I understand correctly, our options are:

1: write and carry a simple-ish local patch for MinGW. That patch will add the object pointer as a method argument. Then we patch our cairo code to work with that new function definition when compiling for MinGW.

2: develop and carry a complicated patch for gcc

3: patch our cairo code to do something crazy

I went for the third option (attached). Browser still crashes, because I'm not sure I did it correctly or if there's another crasher after the GetSize one; but if I understand the problem correctly, something similar to this might work, without the need to patch MinGW or gcc?
Actually, I think this could be fixed in upstream mingw-w64, without changes to cairo nor local patches. The solution would be like something that Wine does, here is example of Wine widl generated GetSize:

#ifdef WIDL_EXPLICIT_AGGREGATE_RETURNS
    virtual D2D1_SIZE_F* STDMETHODCALLTYPE GetSize(
        D2D1_SIZE_F *__ret) = 0;
    D2D1_SIZE_F STDMETHODCALLTYPE GetSize(
        )
    {
        D2D1_SIZE_F __ret;
        return *GetSize(&__ret);
    }
#else
    virtual D2D1_SIZE_F STDMETHODCALLTYPE GetSize(
        ) = 0;
#endif

We could do the same for mingw-w64. Inline wrapper hides ABI incompatibility and avoids requirement to change callers.

(This solution will not work for code that implements those interfaces by inheriting interface by a class, but there is probably no such case in cairo anyway).
Does taking the address of this virtual function with "&" do the right thing? May want to dive into the disassembly. I'm worried we might have to do something gross like going into the object's vtable by hand.

Although if Jacek has a cleaner way then all the better. :-)
(In reply to Jacek Caban from comment #4)
> We could do the same for mingw-w64. Inline wrapper hides ABI incompatibility
> and avoids requirement to change callers.

Gave it a shot here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7287b5aa27b5cac6496f71059341805baa766cac&selectedJob=139625965

I'm going to need to work with dmajor to learn how to debug this though (as it still crashes).
There are more similar functions that need fixup. Grepping thought header from Wine, I can see that those may be needed (I didn't check which ones are used in cairo):

ID2D1Bitmap::GetSize
ID2D1Bitmap::GetPixelSize
ID2D1Bitmap::GetPixelFormat
ID2D1SolidColorBrush::GetColor
ID2D1LinearGradientBrush::GetStartPoint
ID2D1LinearGradientBrush::GetEndPoint
ID2D1RadialGradientBrush::GetCenter
ID2D1RadialGradientBrush::GetGradientOriginOffset
ID2D1Layer::GetSize
ID2D1RenderTarget::GetPixelFormat
ID2D1RenderTarget::GetSize
ID2D1RenderTarget::GetPixelSize

I hope that helps.
(In reply to Tom Ritter [:tjr] from comment #8)
> New try run:

Does it work?
The new build seems to be asserting here: https://dxr.mozilla.org/mozilla-central/source/js/src/jsfun.h#142

I guess that's progress?
I'm not sure if it matters for these builds, but just in case, could you add this patch to your stack until Microsoft updates their debugger? It will help in case we ever need to record these builds with Time Travel Debugging. https://hg.mozilla.org/try/rev/b1ceb5a9a7f1071f9f710f9e9ba5bae14e696fa2
Hm. I tried debugging in gdb, but hit a lot of walls: https://pastebin.mozilla.org/9073549 I'm asking in #mingw-w64 for help.

I'm not sure if using cygwin's gdb is doomed to failure; not sure about the architecture mismatches and if those are an issue. I'm not sure why gdb doesn't think it was compiled with debugging information. And I'm not sure about https://searchfox.org/mozilla-central/source/xpcom/build/PoisonIOInterposerBase.cpp#226 and why this is causes asserts when running under gdb :)
> I did a new push for the sake of freshness:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=97c14f720b24b14970f71be1c76570d027c5f566

Aaaand, of course it fails in some new and exciting way. I don't see any instructions nearby in memory that would help me place it. If only I had some .pdb's :-(
Attachment #8921747 - Attachment is obsolete: true
Blocks: 1330608
Priority: P3 → P1
I spent some more time on this today.

My latest build is https://treeherder.mozilla.org/#/jobs?repo=try&revision=d87c79b35f208c0cb9e77c98152896aa9fe355fc
It does not contain the sandbox which is now enabled.

I can't debug it in gdb on Windows, I'm hitting https://searchfox.org/mozilla-central/source/xpcom/build/PoisonIOInterposerBase.cpp#226.

I can actually get the browser to get pretty far along, although it will pop a dialog box saying instruction at XXX cannot reference memory at YYY (and then continue). It will then crash a little while later at this seemingly inexplicable assembly code:
> 052D8C80  mov         eax,0  
> 052D8C85  mov         dword ptr [eax],69h

I think Tor has some experience debugging on Windows with gdb - do either of you recognize this PoisonIOInterposerBase thing?
Flags: needinfo?(richard)
Flags: needinfo?(gk)
(In reply to Tom Ritter [:tjr] from comment #15)
> It will then crash a little while later at this seemingly
> inexplicable assembly code:
> > 052D8C80  mov         eax,0  
> > 052D8C85  mov         dword ptr [eax],69h

This looks like MOZ_NoReturn, which is the ending point of MOZ_ASSERT/MOZ_CRASH/etc: https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/mfbt/Assertions.h#214

So it looks like you crashed on line 69h==105 of some file. I don't suppose there's any nearby assembly that might point to a __FILE__ string, or an assert text?
(In reply to David Major [:dmajor] from comment #16)
> (In reply to Tom Ritter [:tjr] from comment #15)
> > It will then crash a little while later at this seemingly
> > inexplicable assembly code:
> > > 052D8C80  mov         eax,0  
> > > 052D8C85  mov         dword ptr [eax],69h
> 
> This looks like MOZ_NoReturn, which is the ending point of
> MOZ_ASSERT/MOZ_CRASH/etc:
> https://searchfox.org/mozilla-central/rev/
> 74b7ffee403c7ffd05b8b476c411cbf11d134eb9/mfbt/Assertions.h#214
> 
> So it looks like you crashed on line 69h==105 of some file. I don't suppose
> there's any nearby assembly that might point to a __FILE__ string, or an
> assert text?


More Assembly: 

> 05178C64  mov         dword ptr [esp+8],69h  
> 05178C6C  mov         dword ptr [esp+4],0A18403Ch  
> 05178C74  mov         dword ptr [esp],0A1840D4h  
> 05178C7B  call        02B43240  
> 05178C80  mov         eax,0  
> 05178C85  mov         dword ptr [eax],69h  

0A18403Ch is the string in this assertion https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/layout/svg/nsSVGIntegrationUtils.cpp#68

0A1840D4h is the string in this assertion which has a line number match! https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/layout/svg/nsSVGIntegrationUtils.cpp#106

I will have to dig in to that...
Great! (FWIW this is my usual method of debugging builds that I don't have symbols for.)
We talked about it at the meeting yesterday. Clearing ni. Tom, if you need anything in addition to that please ping us again.
Flags: needinfo?(richard)
Flags: needinfo?(gk)
Okay this seems to have magically fixed itself, kinda. (For the record, up until now the sandbox had always been disabled, it was only recently we enabled it.)

I stubbed out the Poison Interposer, and that helped the debug build.

================================================================================================
https://treeherder.mozilla.org/#/jobs?repo=try&revision=29e0334d2c02f4bfcf4c2a30cbf61bb7a4b252f8
Sandbox Disabled, Poison Disabled 
------------------------------------------------------------------------------------------------
debug

Sometimes crashes, sometimes doesn't. See these messages either time.
 
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to initialize GPU process (t=9.91905) [GFX1-]: Failed to initialize GPU process
[GFX2-]: Could not acquire an MLGDevice!
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to initialize GPU process (t=9.91905) |[1][GFX1-]: [D3D11] failed to get compositor device. (t=9.98167) [GFX1-]: [D3D11] failed to get compositor device.
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to initialize GPU process (t=9.91905) |[1][GFX1-]: [D3D11] failed to get compositor device. (t=9.98167) |[2][GFX1-]: [D3D11] Failed to init compositor with reason: FEATURE_FAILURE_D3D11_NO_DEVICE (t=9.98167) [GFX1-]: [D3D11] Failed to init compositor with reason: FEATURE_FAILURE_D3D11_NO_DEVICE
 
------------------------------------------------------------------------------------------------
opt - Has not crashed.
 
================================================================================================
https://treeherder.mozilla.org/#/jobs?repo=try&revision=157fa82a6236b31ead51d00d8ea695a3db14b294  
Sandbox Disabled, Poison Enabled 

debug - Crashes on startup with Asserts in PoisonIOInterposerBase
opt - Has not crashed
 
================================================================================================
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dffb8d630f3a3a7324830c0ae33aafd01c330b40
Sandbox Enabled, Poison Disabled - This one is the most interesting
------------------------------------------------------------------------------------------------
debug - Crashes
 
Early on (before I've gone through the profile selection window) I see
[GFX2-]: Failure initializing MLGDeviceD3D11: Could not create copy blend state (80070057)
[GFX2-]: Failure initializing MLGDeviceD3D11: Could not create copy blend state (80070057)
[GFX2-]: Could not acquire an MLGDevice!
 
After the profile selection window I see
[GFX2-]: Failure initializing MLGDeviceD3D11: Could not create copy blend state (80070057)
[GPU 9384, Main Thread] WARNING: Shutting down GPU process early due to a crash!: file /builds/worker/workspace/build/src/gfx/ipc/GPUParent.cpp, line 439
 
Running it in gdb I see (before I select a profile in the profile window):
 
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create remote compositor (t=9.01102) [GFX1-]: Failed to create remote compositor
[GFX2-]: Could not acquire an MLGDevice!
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create remote compositor (t=9.01102) |[1][GFX1-]: [D3D11] failed to get compositor device. (t=9.07364) [GFX1-]: [D3D11] failed to get compositor device.
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create remote compositor (t=9.01102) |[1][GFX1-]: [D3D11] failed to get compositor device. (t=9.07364) |[2][GFX1-]: [D3D11] Failed to init compositor with reason: FEATURE_FAILURE_D3D11_NO_DEVICE (t=9.07364) [GFX1-]: [D3D11] Failed to init compositor with reason: FEATURE_FAILURE_D3D11_NO_DEVICE
Crash Annotation GraphicsCriticalError: |[0][GFX1-]: Failed to create remote compositor (t=9.01102) |[1][GFX1-]: [D3D11] failed to get compositor device. (t=9.07364) |[2][GFX1-]: [D3D11] Failed to init compositor with reason: FEATURE_FAILURE_D3D11_NO_DEVICE (t=9.07364) |[3][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=10.7841) [GFX1-]: Receive IPC close with reason=AbnormalShutdown
 
I also some of
 
[GPU 3028, Main Thread] WARNING: NS_ENSURE_TRUE(InitStaticMembers()) failed: file /builds/worker/workspace/build/src/modules/libpref/Preferences.cpp, line 3929  (and other lines)
 
 
If I set security.sandbox.content.level
 
to 0: I see the same GPU errors, but it seems to run.
 
to 1: It doesn't run. Nothing that jumps out at me.
 
 
------------------------------------------------------------------------------------------------
opt - Definetly a subprocess crash or error happening. No newtab page, just a white content window.
Attempting to debug this doesn't yield a crash I can find (using simple gdb and another try with 'set follow-fork-mode child')
 
================================================================================================
https://treeherder.mozilla.org/#/jobs?repo=try&revision=130bc92976e2a20026c08213cfa45ec054fbc095
Sandbox Enabled Poison Enabled

debug - Crashes on startup with Asserts in PoisonIOInterposerBase
opt - same as no-poison opt
 
 
================================================================================================
Conclusions
 - I should stub out the Poison Interposer in MinGW builds (both debug and opt)
 - Something is definitely going screwy with the Sandbox.


I sent in a new try run here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=52164b634bc2801d32628654e71d938807c7ebd6

It has the time traveling debugging patch, enables and sets the content sandbox to level 1 by default, and stubs out the poison interposer.
Blocks: 1441558
Blocks: 1443838
Depends on: 1443856
Blocks: 1434316
Blocks: 1460668
No longer depends on: 1443856
If I see this correctly then we should get this patch with the mingw-w64 bump to commit ee9fc3d0b8c8868280e38384edd968067b8e71fc. However, even building with that one crashes my Tor Browser when I set both prefs to "false". It seems it is not crashing anymore as early as loading the first window, but still, as soon, as I try to load actual content the browser stops working.
(In reply to Georg Koppen from comment #22)
> If I see this correctly then we should get this patch with the mingw-w64
> bump to commit ee9fc3d0b8c8868280e38384edd968067b8e71fc. However, even
> building with that one crashes my Tor Browser when I set both prefs to
> "false". It seems it is not crashing anymore as early as loading the first
> window, but still, as soon, as I try to load actual content the browser
> stops working.

No; this patch had never been upstreamed into MinGW due to some structural changes that were requested. Upstreaming is possible; I'm hoping Jacek will be able to clean it up for me as part of the mingw-clang work he's going to do.
The commit that Georg mentioned is actually upstreamed version of your patch. It's not enabled by default, it requires explicitly defining WIDL_EXPLICIT_AGGREGATE_RETURNS (eg. -DWIDL_EXPLICIT_AGGREGATE_RETURNS in CXXFLAGS). It would be possible to define that on mingw-w64 side, but it's not clear to me if we want that.

As for clang, it shouldn't be needed. I'd expect clang to do the right thing here without hacks.
Awesome. I'm going to mark this as fixed since it's going to be rolled up into the mozconfigs added in Bug 1434316
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: