Open Bug 1529125 Opened 6 years ago Updated 2 years ago

IOInterposer fails assertion in UnregisterCurrentThread for new test

Categories

(Core :: XPCOM, defect)

62 Branch
defect

Tracking

()

REOPENED
mozilla67
Tracking Status
firefox67 --- affected

People

(Reporter: gregtatum, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

I am working on Bug 1527737, and I am writing a new xpcshell test for the profiler and collecting FileIO markers. The test is failing only on debug builds, and it appears to be an error in the logic of the IOInterposer.

I set some breakpoints in RegisterCurrentThread, and UnregisterCurrentThread and observed the following behavior, which appears to be a bug.

thread #2: tid = 0x5790ed XUL`mozilla::IOInterposer::RegisterCurrentThread(aIsMainThread=false) at IOInterposer.cpp:501, name = 'Gecko_IOThread', stop reason = step over
  sThreadLocalDataInitialized == false, do not create local storage.

thread #6: tid = 0x57961c XUL`mozilla::IOInterposer::RegisterCurrentThread(aIsMainThread=false) at IOInterposer.cpp:501, name = 'Socket Thread', stop reason = step over
  sThreadLocalDataInitialized == false, do not create local storage.

thread #1: tid = 0x5790d0 XUL`mozilla::IOInterposer::RegisterCurrentThread(aIsMainThread=true) at IOInterposer.cpp:505, queue = 'com.apple.main-thread', stop reason = step over
  sThreadLocalDataInitialized == true, create local storage.
  
thread #21: tid = 0x579b4d XUL`mozilla::IOInterposer::RegisterCurrentThread(aIsMainThread=false) at IOInterposer.cpp:505, name = 'Timer Thread', stop reason = step over
  sThreadLocalDataInitialized == true, create local storage.
  
thread #6: tid = 0x57961c XUL`mozilla::IOInterposer::UnregisterCurrentThread() at IOInterposer.cpp:513, name = 'Socket Thread', stop reason = breakpoint 4.1
  sThreadLocalDataInitialized == true, try to destruct

Failing try run on debug only: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cd5a551bd3c2df463aec59e3e0f04b63e540d52

Blocks: 1527737

Ok, I reproduced this assertion failure in a debug build:

STR:

  • Run a debug build
  • Go to https://profiler.firefox.com/
  • Install the Gecko Profiler Add-on and open the menu
  • Enable "Main thread IO" under features
  • Click "Start"
  • Click "Discard and Stop"
  • Wait a second and the tab crashes
Assertion failure: curThreadData, at /Users/gregorytatum/dev/gecko/xpcom/build/IOInterposer.cpp:505
#01: nsThread::ThreadFunc(void*)[/Users/gregorytatum/dev/gecko/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/XUL +0x1472da]
#02: _pt_root[/Users/gregorytatum/dev/gecko/obj-ff-dbg/dist/NightlyDebug.app/Contents/MacOS/libnss3.dylib +0x16e8b0]
#03: _pthread_body[/usr/lib/system/libsystem_pthread.dylib +0x3305]
#04: _pthread_start[/usr/lib/system/libsystem_pthread.dylib +0x626f]

Looking in searchfox, it appears that IOInterposer::RegisterCurrentThread is being called before initialization, so we are missing those registrations.

I'm not sure what the desired behavior is, should we ignore this, or somehow keep a list of these, and only register them once the interposer is initiated?

Markus: do you have thoughts on how we should approach this? I've only been looking at this from the FileIO perspective, but there is other IO types that are being interposed here. I'm not sure who's consuming that data without digging into it further.

Flags: needinfo?(mstange)

Ok, I took another read-through today, and wrote a patch that retains references for the threads before initialization. This allows my tests from Bug 1527737 to pass, and my STR does not trigger the assertion anymore. The other approach I can think of is to just drop the assertion. :froydnj I flagged you for review, as I saw your review flag on a lot of pieces here. Let me know if there is someone I can redirect it to.

Add an RAII guarded initialization for the IO interposer to the
initialization process for xpc shell tests. This ensures that whenever
an xpcshell tests users the IOInterposer, that it will correctly
catch all registered threads, and will not miss any.

Depends on D20416

Add an RAII guarded initialization for the IO interposer to the
initialization process for content processes. This ensures that whenever
a content process uses the IOInterposer, that it will correctly catch
all registered threads, and will not miss any.

Depends on D20732

Attachment #9045774 - Attachment description: Bug 1529125 - Add IOInterposerInit to xpc shell tests → Bug 1529125 - Add IOInterposerInit to xpc shell tests; r?froydnj
Attachment #9045775 - Attachment description: Bug 1529125 - Add IOInterposerInit to the initialization of content processes → Bug 1529125 - Add IOInterposerInit to the initialization of content processes; r?froydnj
Attachment #9045254 - Attachment is obsolete: true
Attachment #9045774 - Attachment is obsolete: true
Attachment #9045775 - Attachment is obsolete: true

Add an RAII guarded initialization for the IO interposer to the
initialization process for xpc shell tests. This ensures that whenever
an xpcshell tests uses the IOInterposer, that it will correctly
catch all registered threads, and will not miss any.

Add an RAII guarded initialization for the IO interposer to the
initialization process for content processes. This ensures that whenever
a content process uses the IOInterposer, that it will correctly catch
all registered threads, and will not miss any.

Depends on D20736

Flags: needinfo?(mstange)

:froydnj Also, assuming this is the approach we go with, do you have a recommended try run for this? My normal aggressive try run for a profiler change is:

mach try -b do -p all -u xpcshell,gtest,mochitest-dt -t none

Flags: needinfo?(nfroyd)
Assignee: nobody → gtatum

(In reply to Greg Tatum [:gregtatum] from comment #9)

:froydnj Also, assuming this is the approach we go with, do you have a recommended try run for this? My normal aggressive try run for a profiler change is:

mach try -b do -p all -u xpcshell,gtest,mochitest-dt -t none

My usual try run is -b do -p all -u all[x64] -t none to hit all the Linux x86-64 tests, but I think the changes under discussion here do warrant a larger set of tests. I think I'd maybe just run all mochitests for a little more coverage, and I'd possibly run all of the talos tests, just to make sure there aren't weird interactions with IOInterposer and our performance tests.

Flags: needinfo?(nfroyd)

It turns out sMasterList was showing up as a leak, so I ended up not using the RAII guard, and manually changed the call to IOInterposer::Clear to be earlier in the child process shutdown process. It's passing locally now, so doing a quick full debug build and try run, before re-requesting review.

I'm getting failures with leak checks now in Windows debug, plus some other mysterious assertion errors around the Windows implementation of PoisonIOInterposer. I'm going to try narrow the scope of this patch a bit, and attempt to land the xpcshell part only, then follow-up with more investigation into the windows error with adding the initialization to the content process.

15:12:41     INFO - PID 9568 | Assertion failure: false (MOZ_ASSERT_UNREACHABLE: Unrecognized opcode sequence), at z:/build/build/src/obj-firefox/dist/include/mozilla/interceptor/PatcherDetour.h:840
15:13:01     INFO - PID 9568 | #01: static int mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess,128> >,long (*)(void **, unsigned long, _OBJECT_ATTRIBUTES *, _IO_STATUS_BLOCK *, _LARGE_INTEGER *, unsigned long, unsigned long, unsigned long, unsigned long, void *, unsigned long)>::InitOnceCallback(union _RTL_RUN_ONCE *, void *, void * *) [mozglue/misc/nsWindowsDllInterceptor.h:194]
15:13:01     INFO - 
15:13:01     INFO - PID 9568 | #02: ntdll.dll + 0x2285f
15:13:01     INFO - 
15:13:01     INFO - PID 9568 | #03: KERNELBASE.dll + 0x539ca
15:13:01     INFO - 
15:13:01     INFO - PID 9568 | #04: mozilla::InitPoisonIOInterposer() [xpcom/build/PoisonIOInterposerWin.cpp:352]
15:13:01     INFO - 
15:13:01     INFO - PID 9568 | #05: mozilla::IOInterposer::Init() [xpcom/build/IOInterposer.cpp:406]
15:13:01     INFO - 
15:13:01     INFO - PID 9568 | #06: XRE_InitChildProcess(int,char * * const,XREChildData const *) [toolkit/xre/nsEmbedFunctions.cpp:382]
15:13:01     INFO - 
15:13:02     INFO - PID 9568 | #07: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:265]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #08: wmain [toolkit/xre/nsWindowsWMain.cpp:129]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #09: static int __scrt_common_main_seh() [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:288]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #10: KERNEL32.DLL + 0x13034
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #11: ntdll.dll + 0x71461
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #01: static int mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess,128> >,long (*)(void **, unsigned long, _OBJECT_ATTRIBUTES *, _IO_STATUS_BLOCK *, _LARGE_INTEGER *, unsigned long, unsigned long, unsigned long, unsigned long, void *, unsigned long)>::InitOnceCallback(union _RTL_RUN_ONCE *, void *, void * *) [mozglue/misc/nsWindowsDllInterceptor.h:194]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #02: ntdll.dll + 0x2285f
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #03: KERNELBASE.dll + 0x539ca
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #04: mozilla::InitPoisonIOInterposer() [xpcom/build/PoisonIOInterposerWin.cpp:352]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #05: mozilla::IOInterposer::Init() [xpcom/build/IOInterposer.cpp:406]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #06: XRE_InitChildProcess(int,char * * const,XREChildData const *) [toolkit/xre/nsEmbedFunctions.cpp:382]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #07: NS_internal_main(int,char * *,char * *) [browser/app/nsBrowserApp.cpp:265]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #08: wmain [toolkit/xre/nsWindowsWMain.cpp:129]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #09: static int __scrt_common_main_seh() [f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:288]
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #10: KERNEL32.DLL + 0x13034
15:13:02     INFO - 
15:13:02     INFO - PID 9568 | #11: ntdll.dll + 0x71461

Here is the XPC shell try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=daf79d5536f88855b4544cde827e944a59414701

Pushed by gtatum@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1bc807fb4c89 Add IOInterposerInit to xpc shell tests; r=froydnj
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla67

I'm re-opening this, as I only landed the xpc shell part of the patches, and need to still get the content process working on Windows.

(In reply to Greg Tatum [:gregtatum] from comment #15)

15:12:41 INFO - PID 9568 | Assertion failure: false (MOZ_ASSERT_UNREACHABLE: Unrecognized opcode sequence), at z:/build/build/src/obj-firefox/dist/include/mozilla/interceptor/PatcherDetour.h:840
15:13:01 INFO - PID 9568 | #01: static int mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess,128> >,long (*)(void **, unsigned long, _OBJECT_ATTRIBUTES *, _IO_STATUS_BLOCK *, _LARGE_INTEGER *, unsigned long, unsigned long, unsigned long, unsigned long, void *, unsigned long)>::InitOnceCallback(union _RTL_RUN_ONCE *, void *, void * *) [mozglue/misc/nsWindowsDllInterceptor.h:194]
15:13:01 INFO -
15:13:01 INFO - PID 9568 | #02: ntdll.dll + 0x2285f
15:13:01 INFO -
15:13:01 INFO - PID 9568 | #03: KERNELBASE.dll + 0x539ca
15:13:01 INFO -
15:13:01 INFO - PID 9568 | #04: mozilla::InitPoisonIOInterposer() [xpcom/build/PoisonIOInterposerWin.cpp:352]
15:13:01 INFO -
15:13:01 INFO - PID 9568 | #05: mozilla::IOInterposer::Init() [xpcom/build/IOInterposer.cpp:406]
15:13:01 INFO -
15:13:01 INFO - PID 9568 | #06: XRE_InitChildProcess(int,char * * const,XREChildData const *) [toolkit/xre/nsEmbedFunctions.cpp:382]

Aaron, the hg history suggests you touched the code triggering this assertion in bug 1432653. Do you have suggestions about what needs to be done to fix it?

Flags: needinfo?(aklotz)

This is nsWindowsDllInterceptor telling us that the instruction sequence at the beginning of the function being hooked is something that we do not yet support.

Seeing a disassembly (or even better, a minidump) of the function in question would be helpful here.

Assuming the line numbers in PoisonIOInterposerWin.cpp haven't changed, NtReadFile would be the function whose disassembly I'd want to look at (preferably side by side with the bytes of the function).

Flags: needinfo?(aklotz)

(Also knowing the version number for ntdll.dll would be useful)

I'm seeing this assertion on this try job: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b83546bbafd008708e18df6602cec454d58989f2&searchStr=Windows%2C10%2Cx64%2Cdebug&selectedJob=231325635

The minidump c96d5b4b-fe2e-43dc-9569-219142cd4bfc.dmp might be what you need.

ntdll.dll version is 6.2.17134.254

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

(In reply to Florian Quèze [:florian] from comment #22)

I'm seeing this assertion on this try job: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b83546bbafd008708e18df6602cec454d58989f2&searchStr=Windows%2C10%2Cx64%2Cdebug&selectedJob=231325635

The minidump c96d5b4b-fe2e-43dc-9569-219142cd4bfc.dmp might be what you need.

ntdll.dll version is 6.2.17134.254

Flags: needinfo?(aklotz)

I'm removing the needinfo from :aklotz, and moving it to Bug 1533808.

Flags: needinfo?(aklotz)
Depends on: 1533808

Also, I've landed the XPC shell part of this patch, but am waiting on Bug 1533808 before I can land the content process part of the patch.

Pushed by fqueze@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3719f1901ffe Add IOInterposerInit to the initialization of content processes; r=froydnj

Apparently we need to do something specific for the rdd (media decoding) process.

Removing the needinfo on Greg as he is traveling this week and I don't expect him to have time for this.

Flags: needinfo?(gtatum) → needinfo?(florian)
Backout by shindli@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/14f7842cde06 Backed out changeset 3719f1901ffe for mda assertion failures CLOSED TREE

The assertion on Windows is:

GECKO(880) | Assertion failure: ok, at z:/build/build/src/obj-firefox/dist/include/mozilla/interceptor/Trampoline.h:350
PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::interceptor::VMSharingPolicyShared<mozilla::interceptor::MMPolicyInProcess,128>::Items()]

Landing the patch here makes the mda test jobs on Windows Debug 64bit perma fail with the assertion in comment 30. I don't understand what the assertion means, so I'm afraid I'll need more help to move this forward.

Flags: needinfo?(florian)
Flags: needinfo?(davidp99)
Flags: needinfo?(aklotz)

I wonder if, given David's new patch, we end up in a scenario where we have reserved trampoline space but never end up committing anything. Perhaps that is why VirtualProtect fails.

David, thoughts?

Flags: needinfo?(aklotz)

I don't see a dangling Commit but I don't see another cause. Still looking.

Flags: needinfo?(davidp99)

(In reply to David Parks (dparks) [:handyman] from comment #33)

I don't see a dangling Commit but I don't see another cause. Still looking.

Hi David,

It looks like mozilla-group:active_scm_level_1 has scope generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win* so section 5.1 of the interactive loaner process shouldn't require any action. Make sure to assign the scopes generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win10-64-gpu (section 4) and generic-worker:os-group:aws-provisioner-v1/gecko-t-win10-64-gpu/Administrators (section 5.2) to the task, and to add the osGroups property (also section 5.2). Please report if you have any problems, providing the list of scopes you see under this page and the ClientId you are using (also shown on the same page).

Many thanks!

Flags: needinfo?(davidp99)
Depends on: 1546545
Flags: needinfo?(davidp99)
Pushed by florian@queze.net: https://hg.mozilla.org/integration/mozilla-inbound/rev/46179401c77b Add IOInterposerInit to the initialization of content processes; r=froydnj

Backed out for talos xperf crashes on Windows:

https://hg.mozilla.org/integration/mozilla-inbound/rev/0f8f181540096810ec288bcb0954a798c05491ee

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel%2Crunnable&group_state=expanded&revision=46179401c77b3f40291dea4cf636314399785c14

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=243672292&repo=mozilla-inbound

16:36:42 INFO - TEST-INFO | started process 5580 (Z:\task_1556639211\build\application\firefox\firefox -wait-for-browser -no-deelevate -profile c:\users\task_1556639211\appdata\local\temp\tmpujv3fx\profile)
16:36:56 INFO - PID 5580 | __WARNTimeout (1/3) exceeded on http://127.0.0.1:49333/tests/tp5n/thesartorialist.blogspot.com/thesartorialist.blogspot.com/index.html__WARN
16:37:06 INFO - PID 5580 | __WARNTimeout (2/3) exceeded on http://127.0.0.1:49333/tests/tp5n/thesartorialist.blogspot.com/thesartorialist.blogspot.com/index.html__WARN
16:37:16 INFO - PID 5580 | __FAILTimeout in tp5n__FAIL
16:37:16 INFO - PID 5580 | __FAILTimeout (3/3) exceeded on http://127.0.0.1:49333/tests/tp5n/thesartorialist.blogspot.com/thesartorialist.blogspot.com/index.html__FAIL
16:37:16 INFO - PID 5580 | console.error: "You should not call finishTest without having first initted the Profiler"
17:06:42 INFO - Timeout waiting for test completion; killing browser...
17:06:44 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/ODjp3r4mQ1KXMoxFh-b7aA/artifacts/public/build/target.crashreporter-symbols.zip
17:06:49 INFO - mozcrash Copy/paste: Z:\task_1556639211\build\win32-minidump_stackwalk.exe c:\users\task_1556639211\appdata\local\temp\tmpujv3fx\profile\minidumps\893003da-9996-4f87-8e36-7119db37a5a0.dmp c:\users\task_1556639211\appdata\local\temp\tmpjspbc6
17:06:51 INFO - mozcrash Saved minidump as Z:\task_1556639211\build\blobber_upload_dir\893003da-9996-4f87-8e36-7119db37a5a0.dmp
17:06:51 INFO - PROCESS-CRASH | tp5n | application crashed [unknown top frame]
17:06:51 INFO - Crash dump filename: c:\users\task_1556639211\appdata\local\temp\tmpujv3fx\profile\minidumps\893003da-9996-4f87-8e36-7119db37a5a0.dmp
17:06:51 INFO - Operating system: Windows NT
17:06:51 INFO - 6.1.7601 Service Pack 1
17:06:51 INFO - CPU: x86
17:06:51 INFO - GenuineIntel family 6 model 63 stepping 2
17:06:51 INFO - 8 CPUs
17:06:51 INFO - GPU: UNKNOWN
17:06:51 INFO - No crash
17:06:51 INFO - Process uptime: 1800 seconds
17:06:51 INFO - Thread 0
17:06:51 INFO - 0 ntdll.dll!KiFastSystemCallRet + 0x0
17:06:51 INFO - eip = 0x779370b4 esp = 0x001cf6f8 ebp = 0x001cf764 ebx = 0x00000000
17:06:51 INFO - esi = 0x00000038 edi = 0x00000000 eax = 0x000000f8 ecx = 0x804ddf8d
17:06:51 INFO - edx = 0x001cf6bc efl = 0x00000246
17:06:51 INFO - Found by: given as instruction pointer in context
17:06:51 INFO - 1 ntdll.dll!ZwWaitForSingleObject + 0xc
17:06:51 INFO - eip = 0x77936a24 esp = 0x001cf6fc ebp = 0x001cf764 ebx = 0x00000000
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 2 KERNELBASE.dll!ResCKeGetCacheIndices + 0x1f
17:06:51 INFO - eip = 0x75ac179c esp = 0x001cf700 ebp = 0x001cf764 ebx = 0x00000000
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 3 KERNELBASE.dll!_except_handler4 + 0x78
17:06:51 INFO - eip = 0x75ade165 esp = 0x001cf75c ebp = 0x001cf77c
17:06:51 INFO - Found by: stack scanning
17:06:51 INFO - 4 kernel32.dll! ?? ?? ::FNODOBFM::`string' + 0x48
17:06:51 INFO - eip = 0x7785baa2 esp = 0x001cf784 ebp = 0x001cf790 ebx = 0x001cf77c
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 5 firefox.exe!mozilla::LauncherMain(int &,wchar_t * * const,mozilla::StaticXREAppData const &) [LauncherProcessWin.cpp:46179401c77b3f40291dea4cf636314399785c14 : 350 + 0x9]
17:06:51 INFO - eip = 0x010b431d esp = 0x001cf798 ebp = 0x001cf918
17:06:51 INFO - Found by: previous frame's frame pointer
17:06:51 INFO - 6 firefox.exe!wmain [nsWindowsWMain.cpp:46179401c77b3f40291dea4cf636314399785c14 : 101 + 0xf]
17:06:51 INFO - eip = 0x010b1038 esp = 0x001cf920 ebp = 0x001cf948 ebx = 0x7ffdd000
17:06:51 INFO - esi = 0x64fff0b0 edi = 0x002a1ed8
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 7 firefox.exe!static int __scrt_common_main_seh() [exe_common.inl : 288 + 0x1c]
17:06:51 INFO - eip = 0x010ea81c esp = 0x001cf950 ebp = 0x001cf990 ebx = 0x7ffdd000
17:06:51 INFO - esi = 0x64fff0b0 edi = 0x002a1ed8
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 8 kernel32.dll!RtlMultiByteToUnicodeN + 0x73c
17:06:51 INFO - eip = 0x77863c45 esp = 0x001cf998 ebp = 0x001cf99c
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - 9 ntdll.dll!__RtlUserThreadStart + 0x27
17:06:51 INFO - eip = 0x779537f5 esp = 0x001cf9a4 ebp = 0x001cf9dc
17:06:51 INFO - Found by: previous frame's frame pointer
17:06:51 INFO - 10 ntdll.dll!_RtlUserThreadStart + 0x1b
17:06:51 INFO - eip = 0x779537c8 esp = 0x001cf9e4 ebp = 0x001cf9f4
17:06:51 INFO - Found by: call frame info
17:06:51 INFO - Loaded modules:
17:06:51 INFO - 0x010b0000 - 0x0112efff firefox.exe 68.0.0.7059 (main)
17:06:51 INFO - 0x64ef0000 - 0x6500dfff ucrtbase.dll 10.0.17134.12
17:06:51 INFO - 0x670a0000 - 0x6718afff dbghelp.dll 6.1.7601.17514
17:06:51 INFO - 0x67620000 - 0x67622fff api-ms-win-crt-utility-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67630000 - 0x67632fff api-ms-win-crt-environment-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67640000 - 0x67642fff api-ms-win-crt-filesystem-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67650000 - 0x676bffff msvcp140.dll 14.15.26706.0
17:06:51 INFO - 0x67730000 - 0x67732fff api-ms-win-crt-time-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67740000 - 0x67744fff api-ms-win-crt-math-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67750000 - 0x67752fff api-ms-win-crt-locale-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x67760000 - 0x677a1fff mozglue.dll 68.0.0.7059
17:06:51 INFO - 0x69150000 - 0x69153fff api-ms-win-crt-convert-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x69160000 - 0x69163fff api-ms-win-crt-stdio-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x69170000 - 0x69172fff api-ms-win-crt-heap-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x69180000 - 0x69193fff VCRUNTIME140.dll 14.15.26706.0
17:06:51 INFO - 0x6b6a0000 - 0x6b6a3fff api-ms-win-crt-string-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x6b6b0000 - 0x6b6b2fff api-ms-win-core-synch-l1-2-0.dll 10.0.17134.12
17:06:51 INFO - 0x71030000 - 0x71032fff api-ms-win-core-timezone-l1-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x717c0000 - 0x717c2fff api-ms-win-core-file-l2-1-0.dll 10.0.17134.12
17:06:51 INFO - 0x71920000 - 0x71922fff api-ms-win-core-processthreads-l1-1-1.dll 10.0.17134.12
17:06:51 INFO - 0x72d20000 - 0x72d22fff api-ms-win-core-file-l1-2-0.dll 10.0.17134.12
17:06:51 INFO - 0x73230000 - 0x73233fff api-ms-win-crt-runtime-l1-1-0.dll 10.0.17134.12
17:06:59 INFO - 0x734a0000 - 0x734a2fff api-ms-win-core-localization-l1-2-0.dll 10.0.17134.12
17:06:59 INFO - 0x74f30000 - 0x74f38fff version.dll 6.1.7600.16385
17:06:59 INFO - 0x75940000 - 0x7598bfff apphelp.dll 6.1.7601.17514
17:06:59 INFO - 0x75ac0000 - 0x75b09fff KERNELBASE.dll 6.1.7601.17514
17:06:59 INFO - 0x76fa0000 - 0x7704bfff msvcrt.dll 7.0.7600.16385
17:06:59 INFO - 0x77050000 - 0x77068fff sechost.dll 6.1.7600.16385
17:06:59 INFO - 0x77070000 - 0x77110fff rpcrt4.dll 6.1.7601.17514
17:06:59 INFO - 0x771f0000 - 0x7728ffff advapi32.dll 6.1.7601.17514
17:06:59 INFO - 0x77810000 - 0x778e3fff kernel32.dll 6.1.7601.17514
17:06:59 INFO - 0x778f0000 - 0x77a2bfff ntdll.dll 6.1.7601.17514
17:06:59 INFO - rmtree() failed for "('c:\users\task_1556639211\appdata\local\temp\tmpujv3fx',)". Reason: The process cannot access the file because it is being used by another process (13). Retrying...
17:06:59 INFO - rmtree() failed for "('c:\users\task_1556639211\appdata\local\temp\tmpujv3fx',)". Reason: The process cannot access the file because it is being used by another process (13). Retrying...
17:06:59 INFO - rmtree() failed for "('c:\users\task_1556639211\appdata\local\temp\tmpujv3fx',)". Reason: The process cannot access the file because it is being used by another process (13). Retrying...
17:06:59 INFO - rmtree() failed for "('c:\users\task_1556639211\appdata\local\temp\tmpujv3fx',)". Reason: The process cannot access the file because it is being used by another process (13). Retrying...
17:06:59 INFO - rmtree() failed for "('c:\users\task_1556639211\appdata\local\temp\tmpujv3fx',)". Reason: The process cannot access the file because it is being used by another process (13). Retrying...
17:06:59 INFO - Exception while removing profile directory: c:\users\task_1556639211\appdata\local\temp\tmpujv3fx
17:06:59 INFO - [Error 32] The process cannot access the file because it is being used by another process: 'c:\users\task_1556639211\appdata\local\temp\tmpujv3fx\profile\cert9.db'
17:06:59 INFO - TEST-UNEXPECTED-ERROR | tp5n | Found crashes after test run, terminating test
17:06:59 ERROR - Traceback (most recent call last):
17:06:59 INFO - File "Z:\task_1556639211\build\tests\talos\talos\run_tests.py", line 300, in run_tests
17:06:59 INFO - talos_results.add(mytest.runTest(browser_config, test))
17:06:59 INFO - File "Z:\task_1556639211\build\tests\talos\talos\ttest.py", line 64, in runTest
17:06:59 INFO - return self._runTest(browser_config, test_config, setup)
17:06:59 INFO - File "Z:\task_1556639211\build\tests\talos\talos\ttest.py", line 219, in _runTest
17:06:59 INFO - test_config['name'])
17:06:59 INFO - File "Z:\task_1556639211\build\tests\talos\talos\ttest.py", line 48, in check_for_crashes
17:06:59 INFO - raise TalosCrash('Found crashes after test run, terminating test')
17:06:59 INFO - TalosCrash: Found crashes after test run, terminating test

Flags: needinfo?(gtatum)
Flags: needinfo?(gtatum)

When running that test locally with the patch applied, I can reproduce the failure and see this in the terminal:
13:20:18 INFO - PID 8912 | ###!!! [Parent][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
13:20:18 INFO - PID 8912 |
13:20:18 INFO - PID 8912 | [GFX1-]: Failed to initialize GPU process
13:20:18 INFO - PID 8912 |
13:20:18 INFO - PID 8912 | ###!!! [Parent][MessageChannel] Error: (msgtype=0x4C001A,name=PGPU::Msg_ShutdownVR) Channel error: cannot send/recv

(In reply to Florian Quèze [:florian] from comment #37)

I tried making the patch apply only to content processes rather than all child processes. That made the messages in comment 37 disappear, but the tests still failed, without displaying any page, so I think we failed to start content processes. I wonder if xperf deadlocks with the IOInterposer.

(In reply to Florian Quèze [:florian] from comment #38)

I wonder if xperf deadlocks with the IOInterposer.

Sorry to bother you again with this, but I'm stuck again. Do you have any idea about how to verify this or make it actionable?

Flags: needinfo?(aklotz)

Running this tests in a debug build gives some potentially interesting output:

Lots of:
[Child 6396, Main Thread] WARNING: Resource acquired is being released in non-LIFO order; why?
: file c:/buildhg/mozilla/xpcom/threads/BlockingResourceBase.cpp, line 329
--- Mutex : mozilla::mscom::LiveSet::mMutex (currently acquired)
calling context
[stack trace unavailable]

Sometimes there's a stack:
#01: `anonymous namespace'::MainThreadIOLoggerImpl::sIOThreadFunc (c:\buildhg\mozilla\xpcom\build\MainThreadIOLogger.cpp:114)
#02: _PR_NativeRunThread (c:\buildhg\mozilla\nsprpub\pr\src\threads\combined\pruthr.c:406)
#03: pr_root (c:\buildhg\mozilla\nsprpub\pr\src\md\windows\w95thred.c:138)
#04: o_strcat_s[C:\WINDOWS\System32\ucrtbase.dll +0x1e3fe]
#05: BaseThreadInitThunk[C:\WINDOWS\System32\KERNEL32.DLL +0x14034]
#06: patched_BaseThreadInitThunk (c:\buildhg\mozilla\mozglue\build\WindowsDllBlocklist.cpp:713)
#07: RtlUserThreadStart[C:\WINDOWS\SYSTEM32\ntdll.dll +0x73691]

Assertion failure: sLogModuleManager != nullptr, at c:/buildhg/mozilla/xpcom/base/Logging.cpp:540

Several times this error:
minidump.cc:1571: ERROR: MinidumpThread has a memory region problem

If anybody wants to try to reproduce locally, the command to run this test is:
mach talos-test --suite xperf

(In reply to Florian Quèze [:florian] from comment #40)

Running this tests in a debug build gives some potentially interesting output:

Lots of:
[Child 6396, Main Thread] WARNING: Resource acquired is being released in non-LIFO order; why?
: file c:/buildhg/mozilla/xpcom/threads/BlockingResourceBase.cpp, line 329
--- Mutex : mozilla::mscom::LiveSet::mMutex (currently acquired)
calling context
[stack trace unavailable]

This definitely supports the theory if there being a deadlock. If you disable this line in theory you'll get a stack, let me know if that doesn't build though. Unfortunately that code tends to bitrot.

Sometimes there's a stack:
#01: `anonymous namespace'::MainThreadIOLoggerImpl::sIOThreadFunc (c:\buildhg\mozilla\xpcom\build\MainThreadIOLogger.cpp:114)
#02: _PR_NativeRunThread (c:\buildhg\mozilla\nsprpub\pr\src\threads\combined\pruthr.c:406)
#03: pr_root (c:\buildhg\mozilla\nsprpub\pr\src\md\windows\w95thred.c:138)
#04: o_strcat_s[C:\WINDOWS\System32\ucrtbase.dll +0x1e3fe]
#05: BaseThreadInitThunk[C:\WINDOWS\System32\KERNEL32.DLL +0x14034]
#06: patched_BaseThreadInitThunk (c:\buildhg\mozilla\mozglue\build\WindowsDllBlocklist.cpp:713)
#07: RtlUserThreadStart[C:\WINDOWS\SYSTEM32\ntdll.dll +0x73691]

Assertion failure: sLogModuleManager != nullptr, at c:/buildhg/mozilla/xpcom/base/Logging.cpp:540

This implies that MainThreadIOLoggerImpl is trying to log before we setup the log module manager. We should probably try to make sure logging / profiling is setup before this thing.

(In reply to Eric Rahm [:erahm] from comment #41)

(In reply to Florian Quèze [:florian] from comment #40)

Running this tests in a debug build gives some potentially interesting output:

Lots of:
[Child 6396, Main Thread] WARNING: Resource acquired is being released in non-LIFO order; why?
: file c:/buildhg/mozilla/xpcom/threads/BlockingResourceBase.cpp, line 329
--- Mutex : mozilla::mscom::LiveSet::mMutex (currently acquired)
calling context
[stack trace unavailable]

This definitely supports the theory if there being a deadlock. If you disable this line in theory you'll get a stack, let me know if that doesn't build though.

Thanks! When I disable that line, the first occurence fails to generate a stack, I get:
[Child 1672, Main Thread] WARNING: Resource acquired is being released in non-LIFO order; why?
: file c:/buildhg/mozilla/xpcom/threads/BlockingResourceBase.cpp, line 329
--- Mutex : mozilla::mscom::LiveSet::mMutex (currently acquired)
calling context

ERROR: SymInitialize: Param´┐¢tre incorrect.

 #01: ??? (???:???)

(and 23 more similar broken stack frames)

Then all the other cases that previous didn't have a stack now show a stack in a11y code, looking like this:
#01: mozilla::a11y::CreateHolderFromAccessible (c:\buildhg\mozilla\accessible\ipc\win\COMPtrTypes.cpp:50)
#02: mozilla::a11y::DocAccessible::DoInitialUpdate (c:\buildhg\mozilla\accessible\generic\DocAccessible.cpp:1372)
#03: mozilla::a11y::DocAccessibleWrap::DoInitialUpdate (c:\buildhg\mozilla\accessible\windows\msaa\DocAccessibleWrap.cpp:135)
#04: mozilla::a11y::NotificationController::WillRefresh (c:\buildhg\mozilla\accessible\base\NotificationController.cpp:662)
#05: nsRefreshDriver::Tick (c:\buildhg\mozilla\layout\base\nsRefreshDriver.cpp:1894)
#06: mozilla::RefreshDriverTimer::TickRefreshDrivers (c:\buildhg\mozilla\layout\base\nsRefreshDriver.cpp:319)
#07: mozilla::RefreshDriverTimer::Tick (c:\buildhg\mozilla\layout\base\nsRefreshDriver.cpp:344)
#08: mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver (c:\buildhg\mozilla\layout\base\nsRefreshDriver.cpp:708)
#09: mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync (c:\buildhg\mozilla\layout\base\nsRefreshDriver.cpp:604)
#10: mozilla::layout::VsyncChild::RecvNotify (c:\buildhg\mozilla\layout\ipc\VsyncChild.cpp:67)
#11: mozilla::layout::PVsyncChild::OnMessageReceived (c:\buildhg\mozilla\obj-browser-dbg\ipc\ipdl\PVsyncChild.cpp:168)
#12: mozilla::ipc::PBackgroundChild::OnMessageReceived (c:\buildhg\mozilla\obj-browser-dbg\ipc\ipdl\PBackgroundChild.cpp:4061)
#13: mozilla::ipc::MessageChannel::DispatchAsyncMessage (c:\buildhg\mozilla\ipc\glue\MessageChannel.cpp:2152)
#14: mozilla::ipc::MessageChannel::DispatchMessage (c:\buildhg\mozilla\ipc\glue\MessageChannel.cpp:2078)
#15: mozilla::ipc::MessageChannel::RunMessage (c:\buildhg\mozilla\ipc\glue\MessageChannel.cpp:1938)
#16: mozilla::ipc::MessageChannel::MessageTask::Run (c:\buildhg\mozilla\ipc\glue\MessageChannel.cpp:1970)
#17: nsThread::ProcessNextEvent (c:\buildhg\mozilla\xpcom\threads\nsThread.cpp:1167)
#18: NS_ProcessNextEvent (c:\buildhg\mozilla\xpcom\threads\nsThreadUtils.cpp:486)
#19: nsSyncStreamListener::Available (c:\buildhg\mozilla\netwerk\base\nsSyncStreamListener.cpp:120)
#20: NS_ImplementChannelOpen (c:\buildhg\mozilla\netwerk\base\nsNetUtil.cpp:990)
#21: nsBaseChannel::Open (c:\buildhg\mozilla\netwerk\base\nsBaseChannel.cpp:619)
#22: nsMessageManagerScriptExecutor::TryCacheLoadAndCompileScript (c:\buildhg\mozilla\dom\base\nsFrameMessageManager.cpp:1289)
#23: nsMessageManagerScriptExecutor::LoadScriptInternal (c:\buildhg\mozilla\dom\base\nsFrameMessageManager.cpp:1218)
#24: mozilla::dom::BrowserChild::RecvLoadRemoteScript (c:\buildhg\mozilla\dom\ipc\BrowserChild.cpp:2154)

Not exactly what I expected. Might be a red herring. But given that there's code that opens a file at the top of the stack, it might still be related.

The only way that I could see that a11y stack triggering I/O is the first time that code path is taken - COM might be lazily loading some DLLs.

Flags: needinfo?(aklotz)

Have you tried running this under WinDbg? If you reproduce it there, you can query the kernel for the wait chain.

Once you hit the deadlock, break into the debugger, switch to one of the stuck threads, and then type:

!analyze -hang -v

(In reply to Aaron Klotz [:aklotz] from comment #44)

Have you tried running this under WinDbg?

No, and I think anybody already familiar with WinDbg would do this much faster than it would take me to figure out how to do it, so any help here would be appreciated.

The steps to reproduce are simple:

  • make a local build with the patch from this bug applied.
  • run: mach talos-test --suite xperf

It's the child processes that deadlock, the parent process is working fine.

Assignee: gtatum → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: