Closed Bug 1351838 Opened 3 years ago Closed 3 years ago

BHR stackwalker and profiler stackwalker both call Windows CRT code that requires a critical section

Categories

(Core :: Gecko Profiler, enhancement)

52 Branch
enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: overholt, Assigned: dmajor)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 1 obsolete file)

One of my content processes was hung and dmajor helped figure out that the Telemetry stackwalker and the profiler stackwalker were both calling some Windows CRT function that was deadlocking on a critical section.

If it'll help, I can probably make it happen again and collect whatever info is useful :)
From my recollection of dmajor's description of his recollection:
The profiler suspended the main thread while the main thread was in some Windows function that grabbed the CRT memory allocator lock. Then the profiler itself tried to grab that same lock, by doing logging through some stdlib operator<< method that somehow chose to allocate memory through the CRT allocator instead of through jemalloc.
The background hang reporter might also have suspended and un-suspended the main thread briefly during this scenario, but that part doesn't seem relevant to the deadlock.
The take-away from this seems to be: Don't do anything on the sampler thread while another thread is suspended.
Blocks: 1329181
That should have been: don't do anything *interesting* on the sampler thread while another thread is suspended.

This is not very precise, and we'll need to gather more information about the relevant functions on Windows. Not sure how - we might even have to wait for Andrew to run into this hang again.
I just dumped the call tree of SuspendAndSampleAndResumeThread in my debugger and I don't see any logging or operator<< or anything in the CRT other than memset.

Andrew, if you could grab some stacks with "~*k" if this happens again, it would be helpful in jogging my memory!
Doh, you hit the function table lock, more similar to bug 1349444.

Markus, what do you think about doing AcquireStackWalkWorkaroundLock in our hook of LdrLoadDll? (And, for the shutdown bug, LdrUnloadDll?)
Flags: needinfo?(dmajor) → needinfo?(mstange)
Oh, if we already hook those functions, then that sounds like a great solution!
Flags: needinfo?(mstange)
Ok, let's move to bug 1349444 and leave this one open for the CRT issue.
Attached file threads.txt
dmajor, I grabbed all the thread's stacks with a currently-hung content process. Do you recall if it's the same situation we discussed a month or so ago?
Flags: needinfo?(dmajor)
Bingo -- thanks!

Callstack for Thread 1 (Thread Id: 4820 (0x12d4)):
 Index  Function
--------------------------------------------------------------------------------
 1      ucrtbase.dll!__acrt_GetLocaleInfoA()
 2      ucrtbase.dll!__acrt_locale_initialize_time()
 3      ucrtbase.dll!wcstol()
 4      ucrtbase.dll!_malloc_base()
 5      ucrtbase.dll!iswctype()
 6      ucrtbase.dll!iswctype()
 7      ucrtbase.dll!_malloc_base()
 8      ucrtbase.dll!iswctype()
 9      ucrtbase.dll!_wsetlocale()
 10     ucrtbase.dll!setlocale()
 11     ucrtbase.dll!_mbstowcs_s_l()
 12     ucrtbase.dll!setlocale()
 13     ucrtbase.dll!setlocale()
 14     msvcp140.dll!std::_Locinfo::_Locinfo_dtor(std::_Locinfo * pLocinfo)
 15     msvcp140.dll!std::_Locinfo::~_Locinfo()
 16     msvcp140.dll!std::numpunct<char>::_Getcat(const std::locale::facet * * _Ppf=0x00000006f83f9b10, const std::locale * _Ploc=0x00000006f83f9b50)
 17     msvcp140.dll!std::use_facet<std::numpunct<char> >(const std::locale & _Loc={...})
 18     msvcp140.dll!std::num_put<char,std::ostreambuf_iterator<char,std::char_traits<char> > >::_Iput(std::ostreambuf_iterator<char,std::char_traits<char> > _Dest={...}, std::ios_base & _Iosbase={...}, char _Fill=' ', char * _Buf=0x00000006f83f9c40, unsigned __int64 _Count=0x0000000000000010)
 19     msvcp140.dll!std::num_put<char,std::ostreambuf_iterator<char,std::char_traits<char> > >::do_put(std::ostreambuf_iterator<char,std::char_traits<char> > _Dest, std::ios_base & _Iosbase={...}, char _Fill=' ', const void * _Val=0x000001a6268a3140)
 20     msvcp140.dll!std::basic_ostream<char,std::char_traits<char> >::operator<<(const void * _Val=0x000001a6268a3140)
 21     xul.dll!mozilla::gfx::DrawTargetD2D1::operator std::basic_string<char,std::char_traits<char>,std::allocator<char> >()


Callstack for Thread 26 (Thread Id: 10476 (0x28ec)):
 Index  Function
--------------------------------------------------------------------------------
 1      ntdll.dll!NtWaitForAlertByThreadId()
 2      ntdll.dll!RtlpWaitOnAddressWithTimeout()
 3      ntdll.dll!RtlpWaitOnCriticalSection()
 4      ntdll.dll!RtlpEnterCriticalSectionContended()
 5      ntdll.dll!RtlEnterCriticalSection()
 6      ucrtbase.dll!__acrt_update_thread_locale_data()
 7      ucrtbase.dll!_LocaleUpdate::_LocaleUpdate(struct __crt_locale_pointers * const)
 8      ucrtbase.dll!__stdio_common_vsprintf()
 9      xul.dll!vsnprintf(char * const _Buffer=0x00000006fbff13c0, const unsigned __int64 _BufferCount=0x0000000000000080, const char * const _Format=0x00007ff98ed9bdb8, char * _ArgList=0x00000006fbff1390)
 10     xul.dll!SprintfLiteral<128>(char[128] & buffer=..., const char * format=0x00007ff98ed9bdb8, ...)
 11     xul.dll!AddPseudoEntry(const mozilla::BaseAutoLock<PSMutex> & aBuffer=0x000001a61daa6d00, ProfileBuffer * entry={...}, volatile js::ProfileEntry &)
 12     xul.dll!MergeStacksIntoProfile(const mozilla::BaseAutoLock<PSMutex> & aBuffer=0x000001a61daa6d00, ProfileBuffer * aSample={...}, const TickSample & aNativeStack={...}, NativeStack &)
 13     xul.dll!DoNativeBacktrace(const mozilla::BaseAutoLock<PSMutex> & aBuffer=0x000001a61daa6d00, ProfileBuffer * aSample={...}, const TickSample &)
 14     xul.dll!Tick(const mozilla::BaseAutoLock<PSMutex> & aBuffer=0x000001a61daa6d00, ProfileBuffer * aSample={...}, const TickSample &)
 15     xul.dll!SamplerThread::SuspendAndSampleAndResumeThread(const mozilla::BaseAutoLock<PSMutex> & aSample={...}, TickSample &)
 16     xul.dll!SamplerThread::Run()
Assignee: nobody → dmajor
Flags: needinfo?(dmajor)
WinDbg says only the main thread is suspended.

The BHR thread is also trying to acquire that CRT critical section, but I think that the appearance of this stack is just fallout from the fact that the main thread is hung and can't release the CS. So I don't think the BHR thread is to blame here. (This agrees with my recollection of comment 1.)

  12  Id: 1100.1c94 Suspend: 1 Teb: 00000006`f81e0000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 00000006`f9dfe648 00007ff9`d3f7beab ntdll!NtWaitForAlertByThreadId+0x14
01 00000006`f9dfe650 00007ff9`d3f7bd11 ntdll!RtlpWaitOnAddressWithTimeout+0x43
02 00000006`f9dfe680 00007ff9`d3f54571 ntdll!RtlpWaitOnCriticalSection+0x1a1
03 00000006`f9dfe730 00007ff9`d3f54490 ntdll!RtlpEnterCriticalSectionContended+0xd1
04 00000006`f9dfe760 00007ff9`c3247bf5 ntdll!RtlEnterCriticalSection+0x40
05 (Inline Function) --------`-------- MSVCP140!std::_Lockit::{ctor}+0xa [f:\dd\vctools\crt\crtw32\stdcpp\xlock.cpp @ 67]
06 00000006`f9dfe790 00007ff9`c323e44d MSVCP140!std::locale::_Init+0x25 [f:\dd\vctools\crt\crtw32\stdcpp\locale0.cpp @ 165]
07 (Inline Function) --------`-------- MSVCP140!std::locale::{ctor}+0x7 [f:\dd\vctools\crt\crtw32\stdhpp\xlocale @ 330]
08 00000006`f9dfe7d0 00007ff9`c323f6cb MSVCP140!std::ios_base::_Init+0x5d [f:\dd\vctools\crt\crtw32\stdhpp\xiosbase @ 548]
09 (Inline Function) --------`-------- MSVCP140!std::basic_ios<char,std::char_traits<char> >::init+0x8 [f:\dd\vctools\crt\crtw32\stdhpp\ios @ 171]
0a 00000006`f9dfe810 00007ff9`8d431f30 MSVCP140!std::basic_ostream<char,std::char_traits<char> >::basic_ostream<char,std::char_traits<char> >+0x7b [f:\dd\vctools\crt\crtw32\stdhpp\ostream @ 55]
0b 00000006`f9dfe860 00007ff9`8e681279 xul!std::basic_ostringstream<char,std::char_traits<char>,std::allocator<char> >::basic_ostringstream<char,std::char_traits<char>,std::allocator<char> >+0x48 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\vs2015u3\vc\include\sstream @ 492]
0c 00000006`f9dfe890 00007ff9`8e79a291 xul!SharedLibraryInfo::GetInfoForSelf+0x2c1 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\tools\profiler\core\shared-libraries-win32.cc @ 171]
0d 00000006`f9dff120 00007ff9`8d05d575 xul!mozilla::Telemetry::GetStackAndModules+0x149 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\toolkit\components\telemetry\telemetry.cpp @ 3172]
0e 00000006`f9dff200 00007ff9`8d00f2b1 xul!mozilla::HangMonitor::GetChromeHangReport+0xe1 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\threads\hangmonitor.cpp @ 179]
0f 00000006`f9dff7a0 00007ff9`c27b4fb2 xul!mozilla::HangMonitor::ThreadMain+0x2f40a9 [c:\builds\moz2_slave\m-cen-w64-ntly-000000000000000\build\src\xpcom\threads\hangmonitor.cpp @ 243]
Hah, I just hit this hang locally too. My lock was taken by different code but it's the same issue:

00 000000a8`aebfb2a0 00007ff8`f296fb83 KERNELBASE!LCMapStringEx+0x292
01 000000a8`aebfb430 00007ff8`f2971fc9 ucrtbase!wsetlocale+0x433
02 000000a8`aebfb8e0 00007ff8`f297e9d4 ucrtbase!_acrt_LCMapStringA+0x71
03 000000a8`aebfb960 00007ff8`f297e55f ucrtbase!_acrt_locale_initialize_ctype+0x1e4
04 000000a8`aebfba30 00007ff8`f296f692 ucrtbase!wcstol+0x21f
05 000000a8`aebfbda0 00007ff8`f296ec1a ucrtbase!malloc_base+0x142
06 000000a8`aebfbfe0 00007ff8`f296eac3 ucrtbase!iswctype+0x8fa
07 000000a8`aebfc030 00007ff8`f296f738 ucrtbase!iswctype+0x7a3
08 000000a8`aebfc060 00007ff8`f296ea71 ucrtbase!malloc_base+0x1e8
09 000000a8`aebfc0c0 00007ff8`f296f8c3 ucrtbase!iswctype+0x751
0a 000000a8`aebfc0f0 00007ff8`f29704b7 ucrtbase!wsetlocale+0x173
0b 000000a8`aebfc1a0 00007ff8`f29707bc ucrtbase!setlocale+0x147
0c 000000a8`aebfc1e0 00007ff8`f29703f7 ucrtbase!mbstowcs_s_l+0x2cc
0d 000000a8`aebfc260 00007ff8`f29703b7 ucrtbase!setlocale+0x87
0e 000000a8`aebfc290 00007ff8`dec978a0 ucrtbase!setlocale+0x47
0f 000000a8`aebfc2d0 00007ff8`dec89d7e MSVCP140!std::_Locinfo::_Locinfo_dtor+0x20
10 000000a8`aebfc300 00007ff8`dec93acd MSVCP140!std::_Locinfo::~_Locinfo+0xe
11 000000a8`aebfc330 00007ff8`dec9706e MSVCP140!std::num_put<char,std::ostreambuf_iterator<char,std::char_traits<char> > >::_Getcat+0x8d
12 000000a8`aebfc3d0 00007ff8`decc379d MSVCP140!std::use_facet<std::num_put<char,std::ostreambuf_iterator<char,std::char_traits<char> > > >+0xce
13 000000a8`aebfc430 00007ff8`eeb1f3df MSVCP140!std::basic_ostream<char,std::char_traits<char> >::operator<<+0x6d
14 000000a8`aebfc4e0 00007ff8`eeb1fe09 mozglue!mozToString+0x5b
15 000000a8`aebfc610 00007ff8`eeb2017e mozglue!blink::Decimal::toString+0x159
16 000000a8`aebfc6e0 00007ff8`b67cc858 mozglue!blink::Decimal::toString+0x2a
17 000000a8`aebfc760 00007ff8`b67cffb3 xul!mozilla::dom::HTMLInputElement::SanitizeValue+0x484
This comment sums it up very well. :-)

  // WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING
  //
  // The profiler's "critical section" begins here.  In the critical section,
  // we must not do any dynamic memory allocation, nor try to acquire any lock
  // or any other unshareable resource.  This is because the thread to be
  // sampled has been suspended at some entirely arbitrary point, and we have
  // no idea which unsharable resources (locks, essentially) it holds.  So any
  // attempt to acquire any lock, including the implied locks used by the
  // malloc implementation, risks deadlock.
This looked nicer in my first attempt with size-templated strcpy_s/strcat_s, but then I realized this function isn't limited to Windows builds.
Attachment #8862952 - Flags: review?(nfroyd)
Comment on attachment 8862952 [details] [diff] [review]
Don't sprintf in the profiler since it may take a lock

Review of attachment 8862952 [details] [diff] [review]:
-----------------------------------------------------------------

::: tools/profiler/core/platform.cpp
@@ +698,5 @@
>    char combinedStringBuffer[SAMPLER_MAX_STRING_LENGTH];
>  
>    if (entry.isCopyLabel() || dynamicString) {
>      if (dynamicString) {
> +      // Avoid sprintf because it can take a lock on Windows.

maybe add ', and this function runs during the profiler's "critical section", see the comment in SamplerThread::SuspendAndSampleAndResumeThread' here
Comment on attachment 8862952 [details] [diff] [review]
Don't sprintf in the profiler since it may take a lock

Review of attachment 8862952 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, curious what you think about the below.

::: tools/profiler/core/platform.cpp
@@ +703,5 @@
> +      size_t srcLen = strlen(sampleLabel) + strlen(dynamicString) + 2; // Space+null
> +      if (srcLen <= ArrayLength(combinedStringBuffer)) {
> +        strcpy(combinedStringBuffer, sampleLabel);
> +        strcat(combinedStringBuffer, " ");
> +        strcat(combinedStringBuffer, dynamicString);

WDYT about converting all this to use memcpy instead?

  size_t sampleLabelLen = strlen(sampleLabel);
  size_t dynamicStringLen = strlen(dynamicString);
  size_t srcLen = ...;
  if (...) {
    memcpy(&combinedStringBuffer[0], sampleLabel, sampleLabelLen);
    memcpy(&combinedStringBuffer[sampleLabelLen], " ", 1);
    memcpy(&combinedStringBuffer[sampleLabelLen+1], dynamicString, dynamicStringLen);
    combinedStringBuffer[srcLen - 1] = '\0';
  }

I know some people and static analyzers have a strong aversion to strcpy and strcat, at least, and this might (?) be faster because we have explicit lengths, rather than looking for terminating zeroes.
Attachment #8862952 - Flags: review?(nfroyd) → review+
Attached patch sprintf-podcopySplinter Review
This is based on the original PodCopy version from Markus at https://reviewboard.mozilla.org/r/112778/diff/4/#c160840

As much as I'd like to merge this with the code at profiler_get_backtrace_noalloc, I have too many patches in flight and just don't have the energy.
Attachment #8862952 - Attachment is obsolete: true
Attachment #8863493 - Flags: review?(nfroyd)
Comment on attachment 8863493 [details] [diff] [review]
sprintf-podcopy

Review of attachment 8863493 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with the below change.

::: tools/profiler/core/platform.cpp
@@ +698,5 @@
>    char combinedStringBuffer[SAMPLER_MAX_STRING_LENGTH];
>  
>    if (entry.isCopyLabel() || dynamicString) {
>      if (dynamicString) {
> +      // Create a string that is sampleLabel + ' ' + annotationString.

I think you had a comment here previously explaining that snprintf can malloc under Windows, and we don't want to do that.  Please put that back in.
Attachment #8863493 - Flags: review?(nfroyd) → review+
Pushed by dmajor@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a63da0aaaa0d
Don't sprintf within the profiler since it may take a lock. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/a63da0aaaa0d
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Blocks: 1366030
You need to log in before you can comment on or make changes to this bug.