Open Bug 1816848 Opened 3 years ago Updated 3 months ago

Startup crashes caused by incorrect static local initialization with a corrupted xul.dll following update

Categories

(Toolkit :: Application Update, defect)

Unspecified
Windows 10
defect

Tracking

()

Tracking Status
firefox-esr115 --- affected
firefox-esr128 --- affected
firefox125 --- wontfix
firefox128 --- affected
firefox129 --- affected
firefox130 --- affected

People

(Reporter: gsvelto, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, Whiteboard: [win:stability][tbird crash])

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/077ee1f9-d099-4d90-9ec3-cf8420230112

Reason: EXCEPTION_ACCESS_VIOLATION_WRITE

Top 10 frames of crashing thread:

0  ntdll.dll  RtlAcquireSRWLockExclusive  
1  xul.dll  mozilla::OffTheBooksMutex::Lock  xpcom/threads/Mutex.h:65
1  xul.dll  mozilla::detail::BaseAutoLock<mozilla::OffTheBooksMutex&>::BaseAutoLock  xpcom/threads/Mutex.h:236
1  xul.dll  nsThread::AddToThreadList  xpcom/threads/nsThread.cpp:330
1  xul.dll  nsThread::InitCommon  xpcom/threads/nsThread.cpp:537
2  xul.dll  nsThread::InitCurrentThread  xpcom/threads/nsThread.cpp:640
3  xul.dll  nsThreadManager::Init  xpcom/threads/nsThreadManager.cpp:304
4  xul.dll  mozilla::mscom::EnsureMTA::EnsureMTA  ipc/mscom/EnsureMTA.cpp:72
4  xul.dll  mozilla::mscom::ProcessRuntime::ProcessRuntime  ipc/mscom/ProcessRuntime.cpp:79
5  xul.dll  XREMain::XRE_main  toolkit/xre/nsAppRunner.cpp:5898

Note: there are no crashes under this signatures because we first need to land bug 1816846 for them to appear, the stack above however represents the problem. All the crashes like this one are caught by Windows Error Reporting and seem to happen very early during startup, before the static variable holding the mutex has been initialized. While this might seem very odd it might not be impossible with our very peculiar architecture.

The severity field is not set for this bug.
:Jamie, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jteh)

(In reply to Gabriele Svelto [:gsvelto] from comment #0)

All the crashes like this one are caught by Windows Error Reporting and seem to happen very early during startup, before the static variable holding the mutex has been initialized. While this might seem very odd it might not be impossible with our very peculiar architecture.

Can you say more about any thought you might have regarding how our architecture might trigger this? Given that we call a function here and it supposedly initialises a static, I can't understand why the static wouldn't be initialised yet. Do we tweak something that would prevent the C++ 11 magic static thread safety from working correctly at this point during startup?

Severity: -- → S3
Flags: needinfo?(jteh) → needinfo?(gsvelto)

The static function is within xul.dll so it but if you look at the modules tab in the crash report you'll notice that we're missing both the library version and the "Signed by" field. This might indicate that xul.dll isn't fully loaded yet, or we ran into some trouble while loading it. Either way it seems like we haven't run the static initializers (yet) or not in the order we expected. Creating the mutex depends on initializing another static (which is done here).

As we get more crash reports for this signature I hope to be able to glean more information about this behavior.

Flags: needinfo?(gsvelto)

(In reply to Gabriele Svelto [:gsvelto] from comment #3)

Creating the mutex depends on initializing another static (which is done here).

Isn't this only true for debug though? (in release we seem to do nothing here)

(In reply to Gabriele Svelto [:gsvelto] from comment #3)

Either way it seems like we haven't run the static initializers (yet) or not in the order we expected.

This is quite new for me, but looking at the crash, this variable is not initialized with the static initializers (listed as e.g. xul!mozilla::dom::FetchParent::`dynamic initializer for 'sActorTable'). It is initialized at function call time by fairly complex code inserted by the compiler that seems to do what's described here:

In the C++11 standard, block scope variables with static or thread storage duration must be zero-initialized before any other initialization takes place. Initialization occurs when control first passes through the declaration of the variable. If an exception is thrown during initialization, the variable is considered uninitialized, and initialization is re-attempted the next time control passes through the declaration. If control enters the declaration concurrently with initialization, the concurrent execution blocks while initialization is completed. The behavior is undefined if control re-enters the declaration recursively during initialization. [...]

Here is the relevant contiguous sequence of assembly instructions from xul!nsThread::InitCommon (quite long, it's all inlined, sorry).

xul!nsThread::InitCommon [/builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp @ 466]:
// ...
// mThreadId = uint32_t(PlatformThread::CurrentId());
// Note that this happens before the initialization of sGetStackLimits!
00000001`8083aee9 ff15f1349005    call    qword ptr [xul!_imp_GetCurrentThreadId (00000001`8613e3e0)]
00000001`8083aeef 8986b4000000    mov     dword ptr [rsi+0B4h],eax

// static const StaticDynamicallyLinkedFunctionPtr<GetCurrentThreadStackLimitsFn> sGetStackLimits(L"kernel32.dll", "GetCurrentThreadStackLimits");
// The next instructions determine whether the variable sGetStackLimits is already initialized
00000001`8083aef5 8b05253d9c05    mov     eax,dword ptr [xul!nsThread::InitCommon::sGetStackLimits+0x10 (00000001`861fec20)]
00000001`8083aefb 8b0d475a9b05    mov     ecx,dword ptr [xul!_tls_index (00000001`861f0948)]
00000001`8083af01 65488b142558000000 mov   rdx,qword ptr gs:[58h]
00000001`8083af0a 488b0cca        mov     rcx,qword ptr [rdx+rcx*8]
00000001`8083af0e 3b8104000000    cmp     eax,dword ptr [rcx+4]
// If not, we jump to initialization code, which will then jump back here
00000001`8083af14 0f8fe5000000    jg      xul!nsThread::InitCommon+0x12f (00000001`8083afff)
// Now sGetStackLimits should be initialized

// if (sGetStackLimits) {
00000001`8083af1a 488b05f73c9c05  mov     rax,qword ptr [xul!nsThread::InitCommon::sGetStackLimits+0x8 (00000001`861fec18)]
00000001`8083af21 4885c0          test    rax,rax
00000001`8083af24 742a            je      xul!nsThread::InitCommon+0x80 (00000001`8083af50)
// ULONG_PTR stackBottom, stackTop;
// sGetStackLimits(&stackBottom, &stackTop);
// mStackBase = reinterpret_cast<void*>(stackBottom);
// mStackSize = stackTop - stackBottom;
00000001`8083af26 488d7c2430      lea     rdi,[rsp+30h]
00000001`8083af2b 488d5c2428      lea     rbx,[rsp+28h]
00000001`8083af30 4889f9          mov     rcx,rdi
00000001`8083af33 4889da          mov     rdx,rbx
00000001`8083af36 ff15cca0cb05    call    qword ptr [xul!__guard_dispatch_icall_fptr (00000001`864f5008)]
00000001`8083af3c 488b07          mov     rax,qword ptr [rdi]
00000001`8083af3f 488986a8000000  mov     qword ptr [rsi+0A8h],rax
00000001`8083af46 8b0b            mov     ecx,dword ptr [rbx]
00000001`8083af48 29c1            sub     ecx,eax
00000001`8083af4a 898eb0000000    mov     dword ptr [rsi+0B0h],ecx
// }
// Here starts (inlined) AddToThreadList(); and (inlined) OffTheBooksMutexAutoLock mal(ThreadListMutex());

// The next instructions determine whether the variable sMutex is already initialized
00000001`8083af50 8b05f2599b05    mov     eax,dword ptr [xul!_tls_index (00000001`861f0948)]
00000001`8083af56 65488b0c2558000000 mov   rcx,qword ptr gs:[58h]
00000001`8083af5f 488b04c1        mov     rax,qword ptr [rcx+rax*8]
00000001`8083af63 c780b800000000000000 mov dword ptr [rax+0B8h],0
00000001`8083af6d 8b0d853c9c05    mov     ecx,dword ptr [xul!nsThread::ThreadListMutex::sMutex+0x8 (00000001`861febf8)]
00000001`8083af73 3b8804000000    cmp     ecx,dword ptr [rax+4]
// If not, we jump to initialization code, which will then jump back here
00000001`8083af79 0f8f35010000    jg      xul!nsThread::InitCommon+0x1e4 (00000001`8083b0b4)
// Now sMutex should be initialized

// Calls ((mozglue!mozilla::detail::MutexImpl*)sMutex.mRawPtr)->lock(), but somehow mRawPtr is nullptr
00000001`8083af7f 488b3d6a3c9c05  mov     rdi,qword ptr [xul!nsThread::ThreadListMutex::sMutex (00000001`861febf0)]
00000001`8083af86 4889f9          mov     rcx,rdi
00000001`8083af89 ff1579269005    call    qword ptr [xul!_imp_?lockMutexImpldetailmozillaIEAAXXZ (00000001`8613d608)]

// ...

The initialization code called from there for sMutex looks like this:

// Deal with potential parallel initializations
00000001`8083b0b4 488d0d3d3b9c05  lea     rcx,[xul!nsThread::ThreadListMutex::sMutex+0x8 (00000001`861febf8)]
00000001`8083b0bb e860078001      call    xul!_Init_thread_header (00000001`8203b820)
00000001`8083b0c0 833d313b9c05ff  cmp     dword ptr [xul!nsThread::ThreadListMutex::sMutex+0x8 (00000001`861febf8)],0FFFFFFFFh
00000001`8083b0c7 0f85b2feffff    jne     xul!nsThread::InitCommon+0xaf (00000001`8083af7f)

// Real initialization starts
00000001`8083b0cd b930000000      mov     ecx,30h
00000001`8083b0d2 ff1560279005    call    qword ptr [xul!_imp_moz_xmalloc (00000001`8613d838)]
00000001`8083b0d8 4889c7          mov     rdi,rax

// Call mozglue!mozilla::detail::MutexImpl::MutexImpl
00000001`8083b0db 4889c1          mov     rcx,rax
00000001`8083b0de ff15f4209005    call    qword ptr [xul!_imp_??0MutexImpldetailmozillaQEAAXZ (00000001`8613d1d8)]

00000001`8083b0e4 48893d053b9c05  mov     qword ptr [xul!nsThread::ThreadListMutex::sMutex (00000001`861febf0)],rdi

// Real initialization finished
00000001`8083b0eb 488d0d063b9c05  lea     rcx,[xul!nsThread::ThreadListMutex::sMutex+0x8 (00000001`861febf8)]
00000001`8083b0f2 e8c1078001      call    xul!_Init_thread_footer (00000001`8203b8b8)

// Jump back
00000001`8083b0f7 e983feffff      jmp     xul!nsThread::InitCommon+0xaf (00000001`8083af7f)

There is a fair amount of weirdness with respect to xul.dll in those crashes. I addition to what [:gsvelto] mentioned in comment 3, I also noted that in all the crashes, xul.dll seems to be loaded at the preferred base address 0x180000000 instead of a randomized base address (all other modules are loaded at presumably randomized base addresses). I have no idea what could have led to this situation in the first place. I wonder if _tls_index might be poorly initialized for the DLL as well, as I think that could potentially explain the crash (as explained with extra details below). But, I still don't know what could be the cause for that here, unfortunately.

thread_safe_statics.cpp, part of the CRT, defines helpers for implementing safe-thread static local variable initialization like described in comment 4 which are then used by the compiler. It is available publicly e.g. in VC/Tools/MSVC/<version>/crt/src/vcruntime/ under the Visual Studio installation folder (also available online e.g. here).

This code uses epochs, which are 32-bit signed numbers going from INT_MIN up to -2 (-1 and 0 have a special meaning):

static int const uninitialized = 0;
static int const being_initialized = -1;
static int const epoch_start = INT_MIN;

extern "C"
{
    int _Init_global_epoch = epoch_start;
    __declspec(thread) int _Init_thread_epoch = epoch_start;
}

_Init_global_epoch keeps track of the progress of variable initialization within the module (in our case, xul.dll) whereas _Init_thread_epoch is a thread local variable what amount of that progress has already been observed by the current thread. They are updated by the _Init_thread_header and _Init_thread_footer helpers (check the source for more details).

Here is what the offending code for the crash looks like in pseudo-code instead of raw assembly like previously:

// The static local variable sMutex from nsThread::ThreadListMutex()
// gets an associated epoch as a static global.
static int _sMutex_epoch = uninitialized;

void nsThread::InitCommon() {
  // ...

  // Skip if _sMutex_epoch is already initialized and the
  // current thread is already ahead of its initialization epoch.
  if (!(_Init_thread_epoch >= _sMutex_epoch)) {
    _Init_thread_header(&sMutex_epoch);

    if (_sMutex_epoch == being_initialized) {
      // Only the first thread reaching _Init_thread_header
      // goes through this path. Other threads are stuck in
      // _Init_thread_header until _sMutex_epoch is initialized.
      auto otbMutex = moz_xmalloc(0x30);
      otbMutex->mozilla::detail::MutexImpl::MutexImpl();
      sMutex.mRawPtr = otbMutex;

      _Init_thread_footer(&_sMutex_epoch);
    }
  }

  // This crashes because sMutex.mRawPtr is nullptr.
  sMutex.mRawPtr->mozilla::detail::MutexImpl::lock();

  // ...
}

Because it is __declspec(thread), accessing _Init_thread_epoch is itself more complicated than it seems, as detailed in this very good series of blogposts (here part 4 of 8). It requires reading the ThreadLocalStoragePointer in the current thread's TEB, then indexing it with the _tls_index value for xul.dll, then adding the offset for _Init_thread_epoch. This supposes that _tls_index for xul.dll and the ThreadLocalStoragePointer were set appropriately at process initialization / thread initialization / module load, which is quite complex (see part 5 and part 7 if interested).

Here the crashes look like _Init_thread_epoch >= _sMutex_epoch could have evaluated to true, which could potentially be a consequence of reading from the wrong _tls_index. This was a common problem before Vista (see part 6) but shouldn't be anymore (see part 7).

(In reply to Yannis Juglaret [:yannis] from comment #5)

There is a fair amount of weirdness with respect to xul.dll in those crashes. I addition to what [:gsvelto] mentioned in comment 3, I also noted that in all the crashes, xul.dll seems to be loaded at the preferred base address 0x180000000 instead of a randomized base address (all other modules are loaded at presumably randomized base addresses).

Could we be dealing with people that have explicitly disabled ASLR or have a similar non-standard Windows setup? I'm assuming that with ASLR on Windows loads a DLL directly into a randomized address.

I believe that bug 1851045 part 3 has changed the signature from [@ nsThread::AddToThreadList] (which still has ESR volume) to [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] for which I initially filed duplicate bug 1892983. As explained there, we are still crashing while adding the main thread to the thread list, but now because nsThreadManager::get().mThreadList.sentinel.mPrev appears to be nullptr, which is confusing because the constructor should have initialized it to point to &nsThreadManager::get().mThreadList.sentinel.

Here are the similarities between the two signatures:

  • they occur on the same code path (nsThreadManager::Init called from mozilla::mscom::EnsureMTA::EnsureMTA);
  • xul.dll is loaded at the default preferred address 0x180000000;
  • they seem like they could be explained by a C++ runtime bug where a block scope variable with static storage is not getting initialized (previously, sMutex from nsThread::ThreadListMutex(); now, sInstance from nsThreadManager::get());
  • the evolution of the crash volume over time looks somewhat similar.

These crash signatures currently have high volume coming from few users. Based on user comments and the amount of crashes per user, it seems their Firefox setup always runs into the crash. I'm increasing severity based on the volume.

Severity: S3 → S2
Crash Signature: [@ nsThread::AddToThreadList] → [@ nsThread::AddToThreadList] [@ mozilla::LinkedListElement<T>::setPreviousUnsafe]
Summary: Crash in [@ nsThread::AddToThreadList] → Crash in [@ nsThread::AddToThreadList] and [@ mozilla::LinkedListElement<T>::setPreviousUnsafe]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release

For more information, please visit BugBot documentation.

Keywords: topcrash

This bug seems to cause the install time to be 0 in most crashes. I believe that this results in invalid reporting of the number of unique users affected, which we should thus consider to be closer to the number of unique crashes. User comments show a variety of languages being used, which does not match either with the low number of affected users reported. These comments say that Firefox stopped working after updating. This looks like a serious issue.

It turns out that this is likely not a new issue with 124. This is a longer term issue that we finally start to fully notice with 124 thanks to the landing of bug 1874235, which included the following diff:

         let install_time = ApplicationInformation::get_install_time(
             &crash_reports_dir,
             &application_data.build_id,
-        )?;
+        )
+        .unwrap_or("0".to_string());

With the old code, the WER module was not reporting those crashes, because when it would fail to find an install time it would just exit. With the new code (so since 124), these crashes are now being reported, but with an install time of "0" that is incorrectly interpreted as coming from the same user by our crash infrastructure.

So the bump in volume is likely not an actual bump. The same volume likely existed but was not being reported. The volume is still concerning though.

Whiteboard: [win:stability]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 20 desktop browser crashes on release (startup)

For more information, please visit BugBot documentation.

Because only xul.dll is weird in the crashes (no ASLR, no initialization of block scoped static variables), I want to put XPCOMGlueLoad and mozilla::ReadAheadLib on the top of the list of suspects at the moment. AFAIK xul.dll is the only library that goes through this prefetching code. So I'll add some diagnostics code there next week. If the suspicion is confirmed that could give a strong incentive to reland bug 1878993 and use the system prefetcher, so that we can remove our legacy prefetching code (assuming no performance regression).

Thanks to the install time of 0, I found the attached signature which seems to be where people crash in Nightly and Early Beta. Example crash here. Call stack:

00 ntdll!RtlpWaitOnCriticalSection+0xa6
01 ntdll!RtlpEnterCriticalSectionContended+0x1c4
02 ntdll!RtlEnterCriticalSection+0x42
03 xul!AutoCriticalSection::AutoCriticalSection+0xd [/builds/worker/checkouts/gecko/xpcom/base/nsWindowsHelpers.h @ 27] 
04 xul!mozilla::interceptor::VMSharingPolicyShared::Reserve+0x122 [/builds/worker/workspace/obj-build/dist/include/mozilla/interceptor/VMSharingPolicies.h @ 211] 
05 xul!mozilla::interceptor::WindowsDllDetourPatcher<mozilla::interceptor::VMSharingPolicyShared>::DoReserve+0x131 [/builds/worker/workspace/obj-build/dist/include/mozilla/interceptor/PatcherDetour.h @ 558] 
06 xul!mozilla::interceptor::WindowsDllDetourPatcher<mozilla::interceptor::VMSharingPolicyShared>::AddHook+0x249 [/builds/worker/workspace/obj-build/dist/include/mozilla/interceptor/PatcherDetour.h @ 458] 
07 xul!mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>::AddDetour+0x3f [/builds/worker/workspace/obj-build/dist/include/nsWindowsDllInterceptor.h @ 497] 
08 xul!mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>::AddHook+0x110 [/builds/worker/workspace/obj-build/dist/include/nsWindowsDllInterceptor.h @ 435] 
09 xul!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(_OBJECT_ATTRIBUTES *, (anonymous namespace)::_FILE_NETWORK_OPEN_INFORMATION *)>::Apply+0x5 [/builds/worker/workspace/obj-build/dist/include/nsWindowsDllInterceptor.h @ 180] 
0a xul!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(_OBJECT_ATTRIBUTES *, (anonymous namespace)::_FILE_NETWORK_OPEN_INFORMATION *)>::InitOnceCallback+0x22 [/builds/worker/workspace/obj-build/dist/include/nsWindowsDllInterceptor.h @ 203] 
0b ntdll!RtlRunOnceExecuteOnce+0x90
0c KERNELBASE!InitOnceExecuteOnce+0xb
0d xul!mozilla::interceptor::FuncHook<mozilla::interceptor::WindowsDllInterceptor<mozilla::interceptor::VMSharingPolicyShared>,long (*)(void **, unsigned long, _OBJECT_ATTRIBUTES *, _IO_STATUS_BLOCK *, _LARGE_INTEGER *, unsigned long, unsigned long, unsigned long, unsigned long, void *, unsigned long)>::Set+0x5d [/builds/worker/workspace/obj-build/dist/include/nsWindowsDllInterceptor.h @ 131] 
0e xul!mozilla::InitPoisonIOInterposer+0x3d2 [/builds/worker/checkouts/gecko/xpcom/build/PoisonIOInterposerWin.cpp @ 482] 
0f xul!mozilla::IOInterposer::Init+0x588 [/builds/worker/checkouts/gecko/xpcom/build/IOInterposer.cpp @ 411] 
10 xul!mozilla::AutoIOInterposer::Init+0x5 [/builds/worker/workspace/obj-build/dist/include/mozilla/IOInterposer.h @ 269] 
11 xul!XREMain::XRE_main+0x279 [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 5921] 
12 xul!XRE_main+0x85 [/builds/worker/checkouts/gecko/toolkit/xre/nsAppRunner.cpp @ 6015] 
13 firefox!do_main+0xd2 [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 227] 
14 firefox!NS_internal_main+0x38e [/builds/worker/checkouts/gecko/browser/app/nsBrowserApp.cpp @ 445] 
15 firefox!wmain+0x2464 [/builds/worker/checkouts/gecko/toolkit/xre/nsWindowsWMain.cpp @ 151] 
16 firefox!invoke_main+0x22 [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 90] 
17 firefox!__scrt_common_main_seh+0x10c [D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl @ 288] 
18 kernel32!BaseThreadInitThunk+0x14
19 ntdll!RtlUserThreadStart+0x21

This crashing path is reached through IOInterposer code which is disabled in Late Beta and Release. The crash can also be explained by a block scope variable with static storage not getting initialized, and xul.dll is loaded at 0x180000000 here too.

Crash Signature: [@ nsThread::AddToThreadList] [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] → [@ nsThread::AddToThreadList] [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | AutoCriticalSection::AutoCriticalSection ]

Adding what looks like the signature for 32-bit early beta.

Crash Signature: [@ nsThread::AddToThreadList] [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | AutoCriticalSection::AutoCriticalSection ] → [@ nsThread::AddToThreadList] [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] [@ RtlpWaitOnCriticalSection | RtlpEnterCriticalSectionContended | RtlEnterCriticalSection | AutoCriticalSection::AutoCriticalSection ] [@ mozilla::interceptor::ReadOnly…

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Whiteboard: [win:stability] → [win:stability][tbird crash]

Thanks to bug 1897479 we now have Nightly crashes with single-step data for LoadLibrary, in signature mozilla::GetBootstrap::<T>::operator(). Thanks to :wsmwk we also have an actual user on Reddit, to whom I have asked if they could record a WinDbg time-travel execution, which will hopefully make investigation easier.

Crash Signature: mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] → mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] [@ mozilla::GetBootstrap::<T>::operator()]

(In reply to Yannis Juglaret [:yannis] from comment #18)

Thanks to bug 1897479 we now have Nightly crashes with single-step data for LoadLibrary, in signature mozilla::GetBootstrap::<T>::operator(). Thanks to :wsmwk we also have an actual user on Reddit, to whom I have asked if they could record a WinDbg time-travel execution, which will hopefully make investigation easier.

Did you get the windbg stack? Because the user reports "we figured out that the issue was that for whatever reason my xul.dll file in the Thunderbird install folder was corrupt, and replacing it with the fresh correct version resolved the problem"

Flags: needinfo?(yjuglaret)
Summary: Crash in [@ nsThread::AddToThreadList] and [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] → Startup Crash in [@ nsThread::AddToThreadList] and [@ mozilla::LinkedListElement<T>::setPreviousUnsafe]

Ah, I just noticed the steps listed in reddit.

Flags: needinfo?(yjuglaret)

I received a time travel recording from the user on Reddit and what I found there made me suspicious that their xul.dll file on disk would be corrupt. So they sent me their copy of the file. The file is indeed corrupt, in a very specific way that perfectly explains the crash. More precisely: the size of the file is correct, and the first 119 MB are correct, however every byte from 0x7700000 (119 MB) to the end of the file is a zero. By applying the exact same corruption to my xul.dll file, I can reproduce the issue on my machine.

This explains perfectly the weirdness in this crash signature as detailed in comment 5 for the following reasons:

  • the corruption makes the Authenticode signature at the end of the file invalid, but we do not actively check the signature at run time so this does not deviate us from the normal path;
  • the corruption leaves the most critical sections mostly unaffected because these sections are at the start of the file (in particular .text), so we do not crash immediately when running code from xul.dll;
  • the corruption yields an illegal all-zero .reloc section, so the Windows loader conservatively falls back to not applying ASLR yet that does not make loading fail either;
  • the corruption yields an all-zero .tls section and _Init_thread_epoch lives there, therefore _Init_thread_epoch is initialized to 0 instead of INT32_MIN, which breaks the invariants of thread_safe_statics.cpp in the CRT in a way that makes the code behave such that any block-scope local initialization code will be believed to have already run (_Init_thread_epoch >= _sMutex_epoch will always be true).

This is tremendous progress already. Now we should answer the following questions:

  • How did these users get a corrupt xul.dll? Can this be an issue in our updater?
  • Is it the corruption point at 119 MB for everybody, or does this vary across users?

This makes for an interesting case. How hard is testing the authenticode signatures? We could do that in the crash reporter client and inform the user that they need to reinstall Firefox if verification fails.

The example code below is able to detect that the corrupt file's Authenticode signature is not valid:

#include <windows.h>
#include <softpub.h>

#include <cstdio>

#pragma comment(lib, "wintrust")

int main() {
	WINTRUST_FILE_INFO fileInfo = { sizeof(fileInfo) };
	fileInfo.pcwszFilePath = LR"(C:\Program Files\Mozilla Thunderbird\xul.dll)";

	WINTRUST_DATA trustData = { sizeof(trustData) };
	trustData.dwUnionChoice = WTD_CHOICE_FILE;
	trustData.pFile = &fileInfo;

	trustData.dwUIChoice = WTD_UI_NONE;
	trustData.fdwRevocationChecks = WTD_REVOKE_NONE;
	trustData.dwStateAction = WTD_STATEACTION_VERIFY;
	trustData.dwProvFlags = WTD_CACHE_ONLY_URL_RETRIEVAL;
	GUID policyGuid = WINTRUST_ACTION_GENERIC_VERIFY_V2;

	auto result = WinVerifyTrust(static_cast<HWND>(INVALID_HANDLE_VALUE), &policyGuid, &trustData);
	bool hasValidSignature = result == 0;

	if (hasValidSignature) {
		wprintf(L"Valid signature for %s\n", fileInfo.pcwszFilePath);
	}
	else {
		wprintf(L"Invalid signature for %s (result: %lx)\n", fileInfo.pcwszFilePath, result);
	}

	return 0;
}

This example is inspired by our in-tree code for SignedBinary::VerifySignature, used by DllServices::GetBinaryOrgName. In our specific case here we only need the part about // First, try the binary itself.

This example yields a result value of TRUST_E_NOSIGNATURE (0x800b0100) for the user's corrupt DLL. The corrupt file has a Checksum value in IMAGE_OPTIONAL_HEADER64 and a IMAGE_DIRECTORY_ENTRY_SECURITY entry in IMAGE_DATA_DIRECTORY, like a legit signed file, but the entry points to 0x29a0 bytes located at the end of the file, which are all zero because of the corruption. So the entry is not valid because it does not respect the expected format ("The security information begins with a 32-bit length (which is ignored) and a 32-bit flags field which should be set to 0x00020200.") and therefore WinVerifyTrust treats this as having no signature.

Note that this code yields the same result value of TRUST_E_NOSIGNATURE (0x800b0100) for our unsigned DLLs such as served on our symbol server, even without corrupting them. Our unsigned files have a Checksum of 0 and no IMAGE_DIRECTORY_ENTRY_SECURITY entry in their IMAGE_DATA_DIRECTORY, so we could check these fields to distinguish a corrupt signed DLL from an unsigned DLL if required.

Our signed DLLs yield a result value of 0.

This is interesting. The all-zeros tail on the file definitely sounds like a download failed to finish (I'm not familiar with the updater code, but it's common practice to pre-allocate a file when downloading). I think we've thrown around the idea of the crash reporter checking baked-in hashes of distributed files to detect corruption. One could do that in the updater as well, but corruption might occur by other means.

Some comments indicate a correlation with updates. Graphing by version also shows this correlation, with high crash volume at release time followed by a decline. Changing component accordingly. Other comments indicate a correlation with unsollicited hard shutdowns such as power cuts, system crashes, forced shutdowns. So it feels like users end up in this situation when we get hard killed mid-update, and we would need some kind of post-update integrity check to avoid this.

Component: IPC: MSCOM → Application Update
Product: Core → Toolkit

I believe that this situation will be addressed via Bug 1891600. I don't really have a plan to address this in any other way in the meantime. In theory there are things we could try, but we'd kinda just be throwing things at the wall to see what sticks since we don't know the root cause of the updater failure.

When we were discussing this today, we realized that, since we do manage to get a crash reporter out of this situation at least sometimes, it could potentially be valuable to try to get more information on this problem that way. It's hard to be certain that the crash reporter will be in a functional state, but that's probably uncertainty we could live with. Potentially we could have the crash reporter verify our binaries' authenticode signatures to determine if the files are in-tact.

@gsvelto, you work on the crash reporter. What do you think about this idea?

Flags: needinfo?(gsvelto)

(In reply to Robin Steuber (they/them) [:bytesized] from comment #27)

we don't know the root cause of the updater failure.

Is there any data that we could collect in the crash dumps that could inform us about what the updater thinks the current status is?

Based on the user comments mentioned in comment 25 I believe these crashes could be a consequence of disk write caching. Notably (from File Caching):

The frequency at which flushing occurs is an important consideration that balances system performance with system reliability. If the system flushes the cache too often, the number of large write operations flushing incurs will degrade system performance significantly. If the system is not flushed often enough, then the likelihood is greater that either system memory will be depleted by the cache, or a sudden system failure (such as a loss of power to the computer) will happen before the flush. In the latter instance, the cached data will be lost.

[...]

Some applications, such as virus-checking software, require that their write operations be flushed to disk immediately; Windows provides this ability through write-through caching. A process enables write-through caching for a specific I/O operation by passing the FILE_FLAG_WRITE_THROUGH flag into its call to CreateFile. With write-through caching enabled, data is still written into the cache, but the cache manager writes the data immediately to disk rather than incurring a delay by using the lazy writer. A process can also force a flush of a file it has opened by calling the FlushFileBuffers function.

These crashes could be caused by sudden system failures occuring during or shortly after an update, before the data written by the updater has been flushed to disk. The updater doesn't appear to be using write-through caching at the moment, so I don't think that it can currently guarantee that all the data has correctly reached the disk. [:bobowen] noted that, if this is indeed what's happening, then it would make sense that xul.dll would be the most commonly affected file (like the current crashes suggest), since it's our biggest file by far. Even after bug 1891600, I don't think the updater would be able to guarantee that it reached a point where we would be safe against system failures, unless it uses write-through caching.

(In reply to Yannis Juglaret [:yannis] from comment #28)

(In reply to Robin Steuber (they/them) [:bytesized] from comment #27)

we don't know the root cause of the updater failure.

Is there any data that we could collect in the crash dumps that could inform us about what the updater thinks the current status is?

I don't really understand what you have in mind. But it doesn't seem likely that we could query any sort of updater status from a crashed Firefox process. The updater has already exited by the time that has crashed.

These crashes could be caused by sudden system failures occuring during or shortly after an update, before the data written by the updater has been flushed to disk. The updater doesn't appear to be using write-through caching at the moment, so I don't think that it can currently guarantee that all the data has correctly reached the disk. [:bobowen] noted that, if this is indeed what's happening, then it would make sense that xul.dll would be the most commonly affected file (like the current crashes suggest), since it's our biggest file by far. Even after bug 1891600, I don't think the updater would be able to guarantee that it reached a point where we would be safe against system failures, unless it uses write-through caching.

I feel like I'm missing something here. The only sort of system failure that is being called out explicitly here is a power failure. But if the power fails while the updater is running, it doesn't seem like the kind of caching we use really matters. That sort of thing definitely matters for long-running applications. If you call fwrite and then don't close the file handle for an hour, that data may well not get flushed to the disk for an hour. But if you call fwrite and then exit nearly immediately afterwards (like the updater does), I believe that the data should get flushed to the disk when the handle is closed and the program exits.

So, like, yeah. If there is a power failure while the updater is running (as things are now), we may have written (hypothetically) 60% of the data of which only 40% has made it to the disk so far. If we used write through caching, how does this change that situation? It seems to me that it means that we write somewhere between 40% and 60% of the data to the disk. That doesn't seem like an improvement that actually helps anyone. Firefox will probably still be unable to launch.

In both cases, we are racing to get the bytes on the disk as quickly as possible. In the current case, the window for something to go wrong is basically updater launch through updater exit. In the "write through caching" case, the window for something to go wrong is basically updater launch through the last fwrite call. But the time between the last fwrite call and updater exit is so small that this seems unlikely to gain us much, if anything.

In fact, as your "File Caching" quote points out, more frequently flushing to the disk decreases performance, meaning that the update would take longer. Consequently, the window for some sort of catastrophic system failure to overlap with update increases rather than decreasing.

So, to me, it doesn't seem like rethinking caching would help us here. But let me know if I'm missing something.

(In reply to Yannis Juglaret [:yannis] from comment #28)

Even after bug 1891600, I don't think the updater would be able to guarantee that it reached a point where we would be safe against system failures, unless it uses write-through caching.

Versioned installation directories will make it impossible to update in-place. You will always be making a copy of the installation into a versioned directory and then updating that. Then we do an atomic swap of the launcher binary.

So yeah, the failures could still happen. But the old version of Firefox will still be functional and the swap to the new version won't have happened. So users won't actually see the failure unless they look through their update history.

Yes, my question is whether there is any kind of updater status or logs that the Firefox instance could query so that we include it in the crash data, as it might help the investigation of this issue? (e.g. knowing if we crashed mid-update or if the updater thinks that the update finished)

No, the bytes are not flushed to disk automatically after you close a handle or exit a process. They will reach the disk at different nondeterministic points, when the OS decides now is a good time to do some I/O. Meanwhile if you try to read these bytes you'll still find the values that you expect, but only because the OS maintains this abstraction for you and will know they should be fetched from the system cache. If there is a system failure before all I/O has finished, the system cache will be lost and the data on disk will be left corrupt. On Windows we would need an explicit call to FlushFileBuffers to ensure that the data gets flushed to disk. And yes that would mean longer updates, but if that means guaranteed non-corrupt installs, that sounds like a good compromise to me.

However, yes indeed, flushing to disk is not enough alone. The updater must also be written in a way that ensures that it can always fall back to the previous version if the update did not complete to the end. I think bug 1891600 can provide this guarantee if we flush the new installation files to disk before doing the atomic swap. If the current installer cannot provide this guarantee then that would be a good reason to to wait for bug 1891600.

My main point here is that even after bug 1891600, even once the updater thinks it is done, and even if the new Firefox has already started running, a system failure could still lead to a corrupt install after reboot, because the bytes could still be in the system cache, unless we explicitely flushed the files to disk. And I believe that flushing will be required as part of the solution to fix the crashes we see in this bug.

In addition to Raymond Chen's blog entries that I've linked throughout this post because they seemed relevant for the specific point of discussion, he also has one about CopyFile leading to a file filled with zeros (so quite similar to our users' xul.dll), and one about tricks to flush while using Microsoft APIs like CopyFile.

(In reply to Robin Steuber (they/them) [:bytesized] from comment #27)

@gsvelto, you work on the crash reporter. What do you think about this idea?

It's a good idea. We were thinking of doing something along the lines for bug 1821414, and authenticode signatures sound like a good option we could use on Windows.

Flags: needinfo?(gsvelto)

Just to clarify, if there is no power failure we check the integrity of the files after download, right? Can we just write a marker that indicates that this check has succeeded and not use the new files if it failed? With bug 1891600 this makes sense to indicate that the old version should be used and without it, the user could be warned that they have to redownload.

Flags: needinfo?(yjuglaret)

Redirecting the question about the normal route to :bytesized. Note that until we flush to disk, the power (or system) failure case is hard to reason about. We can check some files, have the impression that everything is fine, create a file that supposedly confirms that everything is fine, but if we then run into a system failure without having flushed to disk, after reboot we can be in a situation where the confirmation file has been written to the disk yet the heavy files like xul.dll have not been fully pushed to the disk (there is no guarantee on the order of writing to the disk, you can think of this as what the progress UI shows when you copy multiples files to an USB disk) and if we were to redo a check at that point we wouldn't find the same results as before.

Flags: needinfo?(yjuglaret) → needinfo?(bytesized)

(In reply to Simon Friedberger (:simonf) from comment #33)

Just to clarify, if there is no power failure we check the integrity of the files after download, right?

No, I don't believe so. With a partial update, we verify the integrity before we start and then patch them deterministically. With a complete update, we basically end up throwing out everything in the existing installation and replacing them with the files we pull out of the update MAR.

I'm not especially keen on adding this either, since it is common that users would be actively waiting for Firefox to start at this point. If we are concerned about this, I would be much more interested in improving our MAR tests in automation.

Can we just write a marker that indicates that this check has succeeded and not use the new files if it failed? With bug 1891600 this makes sense to indicate that the old version should be used and without it, the user could be warned that they have to redownload.

Something like this is already part of the plan for Bug 1891600. The very last file written by the updater will be the version selector binary. Until that file replacement occurs, launching Firefox will result in the old version being launched, since that is where the old version selector binary will point. No re-download is necessary. We will simply clean up the failed update and report the error back to Firefox so it can handle it appropriately (possibly by asking the user to download the installer and update manually).

Flags: needinfo?(bytesized)
Depends on: 1926447
Blocks: 1926447
No longer depends on: 1926447
Duplicate of this bug: 1991378
Crash Signature: mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] [@ mozilla::GetBootstrap::<T>::operator()] → mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] [@ mozilla::GetBootstrap::<T>::operator()] [@ google::protobuf::EncodedDescriptorDatabase::Add ]
Summary: Startup Crash in [@ nsThread::AddToThreadList] and [@ mozilla::LinkedListElement<T>::setPreviousUnsafe] → Startup crashes caused by incorrect static local initialization with a corrupted xul.dll following update
See Also: → 1991427

(In reply to Robin Steuber (she/her) [:bytesized] from comment #35)

Something like this is already part of the plan for Bug 1891600. The very last file written by the updater will be the version selector binary.

(In reply to Yannis Juglaret [:yannis] from comment #34)

Note that until we flush to disk, the power (or system) failure case is hard to reason about. We can check some files, have the impression that everything is fine, create a file that supposedly confirms that everything is fine, but if we then run into a system failure without having flushed to disk, after reboot we can be in a situation where the confirmation file has been written to the disk yet the heavy files like xul.dll have not been fully pushed to the disk (there is no guarantee on the order of writing to the disk, you can think of this as what the progress UI shows when you copy multiples files to an USB disk) and if we were to redo a check at that point we wouldn't find the same results as before.

(In reply to Robin Steuber (she/her) [:bytesized] from comment #35)

Until that file replacement occurs, launching Firefox will result in the old version being launched, since that is where the old version selector binary will point. No re-download is necessary. We will simply clean up the failed update and report the error back to Firefox so it can handle it appropriately (possibly by asking the user to download the installer and update manually).

I am new to this bug and might have a partial understanding, but IIUC the mechanism :bytesized proposes in bug 1891600 would still potentially be subject to what Yannis describes in comment 34 ? I think we should try to identify a good moment to explicitly flush but should probably do so before replacing the version selector binary or whatever else is the moment of no return that will make us use the updated version. At least I see no other way to exclude this potential cause - which I assume we do not really know if it's the root cause of all of this, but it should be straight forward to try? Maybe we could have a nightly/beta only pref for this to see if it helps?

:yannis, please redirect if needed, google::protobuf::EncodedDescriptorDatabase::Add increased volume in Fx144 but mozilla::LinkedListElement<T>::setPreviousUnsafe decreased. Any ideas if there is anything concern in Fx144?

Flags: needinfo?(yjuglaret)

Hi :dmeehan! The transfer from one signature to the other is something that we were expecting. In bug 1991378 we have seen that corrupting xul.dll in the same way as detailed in comment 21 now resulted in crashing with google::protobuf::EncodedDescriptorDatabase::Add rather than mozilla::LinkedListElement<T>::setPreviousUnsafe. This change in signature originates with the landing of bug 1930954. So this issue is "as bad as before", neither better nor worse in 144 compared to previous versions.

Flags: needinfo?(yjuglaret)
Crash Signature: mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] [@ mozilla::GetBootstrap::<T>::operator()] [@ google::protobuf::EncodedDescriptorDatabase::Add ] → mozilla::interceptor::ReadOnlyTargetBytes<T>::IsValidAtOffset] [@ mozilla::GetBootstrap::<T>::operator()] [@ google::protobuf::EncodedDescriptorDatabase::Add ] [@ google::protobuf::internal::InitProtobufDefaultsSlow ] [@ shutdownhang | google::protob…
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: