Closed Bug 1788592 Opened 3 months ago Closed 3 months ago

Crash in [@ __delayLoadHelper2 | _tailMerge_oleaut32.dll | _tailMerge_d3dcompiler_47.dll | <T>::operator() | __crt_seh_guarded_call<T>::operator()<T>]

Categories

(Core :: Audio/Video: GMP, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
106 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox-esr102 --- unaffected
firefox104 --- unaffected
firefox105 --- unaffected
firefox106 blocking fixed

People

(Reporter: aryx, Assigned: pehrsons)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

Starting with Firefox 106.0a1 20220830210405 when the WebRTC update in bug 1766646 etc. landed, we get these crashes which mention GMP in there stack. See also bug 1788174.

45 crashes from 30 installations, only on Windows 10 and 11.

Crash report: https://crash-stats.mozilla.org/report/index/91e02fb4-1584-4f04-bd40-cfa4d0220901

Reason: unknown 0x3228369022

Top 10 frames of crashing thread:

0 kernelbase.dll RaiseException 
1 xul.dll __delayLoadHelper2 /builds/worker/workspace/obj-build/toolkit/library/build/d:/agent/_work/2/s/src/vctools/delayimp/delayhlp.cpp:312
2 xul.dll _tailMerge_oleaut32.dll 
3 xul.dll _tailMerge_d3dcompiler_47.dll 
4 xul.dll _tailMerge_d3dcompiler_47.dll 
5 ucrtbase.dll <lambda_f03950bc5685219e0bcd2087efbe011e>::operator 
6 ucrtbase.dll __crt_seh_guarded_call<int>::operator<<lambda_7777bce6b2f8c936911f934f8298dc43>, <lambda_f03950bc5685219e0bcd2087efbe011e>&, <lambda_3883c3dff614d5e0c5f61bb1ac94921c> > 
7 ucrtbase.dll execute_onexit_table 
8 xul.dll dllmain_crt_process_detach /builds/worker/workspace/obj-build/toolkit/library/build/d:/agent/_work/2/s/src/vctools/crt/vcstartup/src/startup/dll_dllmain.cpp:106
9 xul.dll dllmain_dispatch /builds/worker/workspace/obj-build/toolkit/library/build/d:/agent/_work/2/s/src/vctools/crt/vcstartup/src/startup/dll_dllmain.cpp:212
Flags: needinfo?(mfroman)

The bug is marked as tracked for firefox106 (nightly). We have limited time to fix this, the soft freeze is in 14 days. However, the bug still isn't assigned.

:jimm, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)

https://hg.mozilla.org/mozilla-central/file/af1fc1e6eb24573a5ebad1754b9d4917e934a5f9/dom/media/gmp/GMPChild.cpp#l609

    case MsgDropped:
      _exit(0);  // Don't trigger a crash report.

🤔 (This is indeed the line where we are crashing)

Flags: needinfo?(gsvelto)

Andreas, this looks like it could be related to https://bugzilla.mozilla.org/show_bug.cgi?id=1785030 since that is the most recent change to GMPChild.cpp. Can you take a look?

Flags: needinfo?(mfroman) → needinfo?(apehrson)

Bug 1785030 was identical but for fake-cdm. This is widevinecdm. The fix was specifically preloading oleaut32 for fake-cdm so doing the same for other plugins would likely help. Bug 1788675 is the same issue but for fakeopenh264.

I figured this out the hard way and am far from an expert. I still wonder what exit handler is trying to use oleaut32, and why this got triggered by the webrtc update.

Flags: needinfo?(apehrson)
See Also: → 1785030

This is a huge volume for the nightly channel, setting as release blocker.

Severity: S2 → S1
Keywords: regression, topcrash
Priority: -- → P1
Assignee: nobody → apehrson
Status: NEW → ASSIGNED
Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/39520d3ed068
Pre-load oleaut32.dll for all gmp plugins. r=bobowen,media-playback-reviewers,alwu
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 106 Branch

Not seeing the same issues in this morning's Nightly with Netflix.

Flags: needinfo?(jmathies)
Depends on: 1789713

I took a look at the minidump and wasn't able to figure out what exit handler is being called. The addresses on the stack don't make much sense. I might try in a debugger tomorrow.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)

🤔 (This is indeed the line where we are crashing)

It seems that we're triggering some at-exit callback and that's crashing the issue, see this line in the stack:

7 	ucrtbase.dll 	execute_onexit_table

Also the last error value is ERROR_ACCESS_DENIED which suggests some Windows function must have failed before the crash.

Flags: needinfo?(gsvelto)

The at-exit callback is doing:

VariantClear(vtMissing)

it comes from the _variant_t destructor. So presumably we're getting a _variant_t vtMissing when we didn't have that before.

We're throwing an exception because LoadLibrary fails with 0x5

FWIW, I was debugging this using https://jsfiddle.net/jib1/02j8q6h9/show to trigger the use of OpenH264 instead of Netflix because Widevine gets upset if you attach a debugger.

Hello,

(In reply to Jeff Muizelaar [:jrmuizel] from comment #12)

The at-exit callback is doing:

VariantClear(vtMissing)

it comes from the _variant_t destructor. So presumably we're getting a _variant_t vtMissing when we didn't have that before.

I was able to reproduce Jeff's experiments with his help. I can provide additional information about why this happens.

The call we see here is a direct consequence of the library comsupp.lib being statically linked when building xul.dll, and xul.dll being dynamically loaded in plugin-container.exe. Indeed, vtMissing and the corresponding dynamic_atexit_destructor_for_'vtMissing' are defined in comutil.obj which is a part of comsupp.lib. Exiting the process results in calling the dynamic_atexit_destructor_for_'vtMissing', which calls the delay-import for VariantClear, resulting in trying to load OLEAUT32.DLL, and failing at that because the sandbox is active and we cannot read the library file on disk.

So the main question now is what part of the first sentence of the previous paragraph is new compared to older versions and why.

So there are two ways to know if comsupp.lib was statically linked into xul.dll for a given version of Firefox you have at hand:

  • if you have the xul.pdb file with debug symbols, that's as simple as checking if grep vtMissing xul.pdb gives a match;
  • if you don't, I wrote a quick and dirty script that will give you a probable answer for x64 builds, and you'll need to pip3 install pefile to use it.

Here is the script's output for Release and Nightly:

$ python3 comsuppdetector.py /c/Program\ Files/Mozilla\ Firefox/xul.dll
No potential `dynamic_atexit_destructor_for_'vtMissing'' was found.
It is safe to assume that comsupp.lib was not statically linked into 'C:/Program Files/Mozilla Firefox/xul.dll'.
$ python3 comsuppdetector.py /c/Program\ Files/Firefox\ Nightly/xul.dll
Found a potential instance of `dynamic_atexit_destructor_for_'vtMissing'' at 0x1854a8940, vtMissing would be at 0x1863a6008.
It is likely that comsupp.lib was statically linked into 'C:/Program Files/Firefox Nightly/xul.dll'.

So it seems that a recent change indeed makes comsupp.lib to be statically linked into xul.dll, leading to the current situation. [:rkraesig] mentioned that comsupp.lib is linked implicitly when including comdef.h, so by mixing this information with the script I should be able to find the culprit commit. We can then determine whether there was a good reason for the linking and we should keep it, or there was no good reason and we should add a continuous integration check that comsupp.lib doesn't get statically linked into xul.dll by mistake.

Well, we are sure bug 1766646 caused this. But it consists of 5k+ upstream libwebrtc commits, so the question is what in there triggered this. I'm not sure how well you can bisect that range of commits and expect it to compile.

Also note that we include comdef.h also in m-b (before the regression). So do we happen not to use it there and it gets optimized out? Or is this dynamic atexit handler conditional on something?

By following the new instances of #include <comdef.h>, I found the line that causes the situation described above. I believe you are correct that in past builds, the dependency to comsupp.lib was probably already present but somehow optimized away. It seems that since the webrtc update the dependency is now kept, and that the reason for this is that the webrtc code now actually uses one of its definitions, namely the _bstr_t class.

By removing the use of _bstr_t in third_party/libwebrtc/modules/desktop_capture/win/desktop_capture_utils.cc, comsupp.lib doesn't get linked anymore, and all the chain of consequences from above disappears. Moreover, in this specific case, the use of _bstr_t doesn't seem required at all, I think it should work to use a standard %ls or Microsoft %S wide string specifier with rtc::SimpleStringBuilder::AppendFormat, which uses vsnprintf under the hood. This means that with the current code base, technically speaking we could probably remove the dependency to comsupp.lib if we wanted.

Forgetting about this specific case though, we need to think more generally whether (1) we want to allow the use of what is defined in comsupp.lib in xul.dll even though we don't have a strict requirement for it right now, or (2) we want to ban comsupp.lib usage in xul.dll until the day we realize that we now have a strict requirement for it. In case (1), we should keep the pre-loading of oleaut32.dll and consider it a required dependency for using comsupp.lib, while in case (2), we should remove the pre-loading and add continuous integration checks that there is no vtMissing in xul.dll.

Blocks: 1790331

Here are some final notes on this. I was able to reproduce the same behavior with a custom executable file built from a new project within Visual Studio:

  • Just including comdef.h without using anything specific doesn't actually lead to linking with comsupp.lib, which gets optimized away.
  • Adding a use of _bstr_t makes the linking happen.

I have updated my detector script, and here is the new output for the different cases I tested.

Including comdef.h without using _bstr_t:

>python comsuppdetector.py comsupptests\x64\Release\comsupptests.exe
No import for oleaut32.dll was found.
It is safe to assume that neither comsupp.lib, comsuppw.lib, comsuppd.lib, nor comsuppwd.lib were statically linked into 'comsupptests\x64\Release\comsupptests.exe'.

>python comsuppdetector.py comsupptests\x64\Debug\comsupptests.exe
No import for oleaut32.dll was found.
It is safe to assume that neither comsupp.lib, comsuppw.lib, comsuppd.lib, nor comsuppwd.lib were statically linked into 'comsupptests\x64\Debug\comsupptests.exe'.

Including comdef.h and using _bstr_t:

>python comsuppdetector.py comsupptests\x64\Release\comsupptests.exe
Found a potential instance of `dynamic_atexit_destructor_for_'vtMissing'' at 0x1400029d0, vtMissing would be at 0x140005008.
It is likely that comsupp.lib or comsuppw.lib was statically linked into 'comsupptests\x64\Release\comsupptests.exe'.

>python comsuppdetector.py comsupptests\x64\Debug\comsupptests.exe
Found a potential instance of _variant_t::~_variant_t at 0x140014f10.
It is likely that comsuppd.lib or comsuppwd.lib was statically linked into 'comsupptests\x64\Debug\comsupptests.exe'.

And the output for shipped versions of Firefox:

>python comsuppdetector.py "C:\Program Files\Mozilla Firefox\xul.dll"
It appears safe to assume that neither comsupp.lib, comsuppw.lib, comsuppd.lib, nor comsuppwd.lib were statically linked into 'C:\Program Files\Mozilla Firefox\xul.dll'.

>python comsuppdetector.py "C:\Program Files\Firefox Nightly\xul.dll"
Found a potential instance of `dynamic_atexit_destructor_for_'vtMissing'' at 0x1854b79e0, vtMissing would be at 0x1863b5008.
It is likely that comsupp.lib or comsuppw.lib was statically linked into 'C:\Program Files\Firefox Nightly\xul.dll'.

I had a chat with bobowen about this and we agreed keeping the preloading of oleaut32.dll is the right short-term fix. Bug 1790331 will update the comments landed with this bug to better describe the reason it is needed.

You need to log in before you can comment on or make changes to this bug.