Closed Bug 1547519 Opened 1 year ago Closed 5 months ago

mingwclang build: application crashed [@ RtlpLowFragHeapFree + 0x2c]

Categories

(Firefox Build System :: General: Unsupported Platforms, defect)

defect
Not set

Tracking

(firefox-esr6870+ fixed, firefox71 fixed)

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr68 70+ fixed
firefox71 --- fixed

People

(Reporter: tjr, Assigned: tjr)

References

Details

Attachments

(4 files)

For a sample, see the h5 test in https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ca6024f54a968e3b51420a1e2b009a040bd023d&selectedJob=243113787

Crash reason: EXCEPTION_ACCESS_VIOLATION_READ
Crash address: 0xffffffffaf9314c7
Assertion: Unknown assertion type 0x00000000
Process uptime: 31 seconds

Thread 0 (crashed)
0 ntdll.dll!RtlpLowFragHeapFree + 0x2c
eip = 0x778c2d37 esp = 0x0013d724 ebp = 0x0013d758 ebx = 0x0a3d9ed0
esi = 0xaf9314c3 edi = 0x0a3d9ec8 eax = 0xe5e50069 ecx = 0x00700000
edx = 0x0a3d9ed0 efl = 0x00010286
Found by: given as instruction pointer in context
1 ntdll.dll!RtlFreeHeap + 0x7e
eip = 0x778c2ce8 esp = 0x0013d760 ebp = 0x0013d770 ebx = 0x0a3d9ed0
Found by: call frame info
2 ucrtbase.dll!LdrpCleanUpTebData + 0x15
eip = 0x72c0f7fb esp = 0x0013d778 ebp = 0x0013d784 ebx = 0xfffffffe
Found by: call frame info
3 mozglue.dll + 0xa51dd
eip = 0x6ca051dd esp = 0x0013d78c ebp = 0x0013d7d4
Found by: previous frame's frame pointer
4 mozglue.dll + 0x89b93
eip = 0x6c9e9b93 esp = 0x0013d7dc ebp = 0x0013d850
Found by: previous frame's frame pointer
5 xul.dll + 0xfc9873
eip = 0x109a9873 esp = 0x0013d858 ebp = 0x0013d910
Found by: previous frame's frame pointer

There's a lot of these crashes in the tests, triggered by many different tests.

Getting this on file; no stacktrace or reproduction steps for it currently.

00 ntdll!RtlpLowFragHeapFree
01 ntdll!RtlFreeHeap
02 ucrtbase!free
03 mozglue!std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> >::~basic_string
04 mozglue!blink::Decimal::toString
05 mozglue!blink::Decimal::toDouble
06 xul!nsRangeFrame::GetValueAsFractionOfRange

Is there a malloc/free mismatch maybe? (These builds leave jemalloc enabled, right?)

(In reply to David Major [:dmajor] from comment #1)

00 ntdll!RtlpLowFragHeapFree
01 ntdll!RtlFreeHeap
02 ucrtbase!free
03 mozglue!std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> >::~basic_string
04 mozglue!blink::Decimal::toString
05 mozglue!blink::Decimal::toDouble
06 xul!nsRangeFrame::GetValueAsFractionOfRange

Is there a malloc/free mismatch maybe? (These builds leave jemalloc enabled, right?)

We do have jemalloc enabled yes. We had seen a mismatch with mingw-gcc in Bug 1466192 but in that instance it would crash on startup.

I'll have to see if I can figure out how to repro this locally and then do the same !heap analysis...

I believe nsRangeFrame is input type=range. I was able to repro this by opening layout/reftests/forms/input/range/max-prop.html directly (no mochitest environment needed) and playing with the range control.

I managed to catch this in a time travel trace.

Alloc:

00 006fdf20 6ba8b310 mozglue!je_malloc [/builds/worker/workspace/build/src/memory/build/malloc_decls.h @ 38] 
01 006fdf30 6ba7986e mozglue!moz_xmalloc+0x10 [/builds/worker/workspace/build/src/memory/mozalloc/mozalloc.cpp @ 68] 
02 006fdf3c 6ba863db mozglue!std::__1::__libcpp_allocate+0xe [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/new @ 239] 
03 006fdf50 6ba79593 mozglue!std::__1::allocator<char>::allocate+0x1b [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/memory @ 1800] 
04 006fdf60 6ba74601 mozglue!std::__1::allocator_traits<std::__1::allocator<char> >::allocate+0x13 [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/memory @ 1549] 
05 006fdf8c 6ba751c2 mozglue!std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> >::__init<char *>+0x81 [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/string @ 2065] 
06 006fdfa8 6ba5f2a1 mozglue!std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> >::basic_string<char *>+0x22 [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/string @ 2097] 
07 006fdfd8 6ba5f7b5 mozglue!std::__1::basic_stringbuf<char,std::__1::char_traits<char>,std::__1::allocator<char> >::str+0xa1 [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/sstream @ 450] 
08 006fdfe8 6ba25fae mozglue!std::__1::basic_ostringstream<char,std::__1::char_traits<char>,std::__1::allocator<char> >::str+0x15 [/builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/sstream @ 853] 
09 006fe08c 6ba59e46 mozglue!mozToString+0x4e [/builds/worker/workspace/build/src/mfbt/decimal/moz-decimal-utils.h @ 77] 
0a 006fe0e0 6ba59c5e mozglue!blink::Decimal::toString+0x1b6 [/builds/worker/workspace/build/src/mfbt/decimal/Decimal.cpp @ 990] 
0b 006fe104 61993d8e mozglue!blink::Decimal::toString+0x1e [/builds/worker/workspace/build/src/mfbt/decimal/Decimal.cpp @ 1036] 
0c 006fe140 6089b72d xul!NumericInputTypeBase::ConvertNumberToString+0x2e [/builds/worker/workspace/build/src/dom/html/input/NumericInputTypes.cpp @ 113]  

Free:

00 006fe090 6ba751dd ucrtbase!free
01 006fe098 6ba59fdb mozglue!std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> >::~basic_string+0xd [/builds/worker/workspace/moz-toolchain/src/libcxx/include/string @ 2134] 
02 006fe0e0 6ba59c5e mozglue!blink::Decimal::toString+0x34b [/builds/worker/workspace/build/src/mfbt/decimal/Decimal.cpp @ 1030] 
03 006fe104 61993d8e mozglue!blink::Decimal::toString+0x1e [/builds/worker/workspace/build/src/mfbt/decimal/Decimal.cpp @ 1036] 
04 006fe140 6089b72d xul!NumericInputTypeBase::ConvertNumberToString+0x2e [/builds/worker/workspace/build/src/dom/html/input/NumericInputTypes.cpp @ 113] 

So we're obviously not using the right free here, but why?

In the free I see the source file as moz-toolchain/src/libcxx/include/string but I don't see that file in the allocation. In the free the type is std::__1::basic_string<char,std::__1::char_traits<char>,std::__1::allocator<char> > which is in the allocation stack where the source file is /builds/worker/workspace/build/src/clang/bin/../i686-w64-mingw32/include/c++/v1/string

That's probably because part of libc++ is a static library. moz-toolchain/src/libcxx/include/string is the file used to build the static library. I guess linker trick to replace malloc/free doesn't work in this case because libc++ is later on linker list. Maybe explicitly specifying -lc++ before mozalloc would help.

(In reply to Jacek Caban from comment #6)

I guess linker trick to replace malloc/free doesn't work in this
case because libc++ is later on linker list.

How come that problem isn't symmetric? __libcpp_allocate goes to jemalloc but ~basic_string (which I believe has an inlined call to __libcpp_deallocate) doesn't.

I'm having trouble navigating the maze of templates and ifdefs, but I wonder if it's trying to use sized dealloc, and we didn't override that?

Martin mentioned the following on irc:

libc++ stl allocations can be done inline by the headers (i.e. affected by jemalloc) or in the libc++ prebuilt lib (which lacks jemalloc). and if alloc is done by header inline but dealloc by libc++ prebuilt obj, you have the issue

I think that worked. Trying the steps in Comment 3 I wasn't able to repro it with this build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ba5b6b853b09c00ff896626e3d84c4ae4b9e890

I added tests to it to see what try says...

(In reply to Tom Ritter [:tjr] from comment #9)

I think that worked. Trying the steps in Comment 3 I wasn't able to repro it with this build: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7ba5b6b853b09c00ff896626e3d84c4ae4b9e890

I added tests to it to see what try says...

I only tested x64 but just noticed the issue is only on x86.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=636e881e3e24b69356077ec02f1a04d336e5b172 shows that it did not fix it.

Depends on: 1554063

We synced up about this at All Hands; https://treeherder.mozilla.org/#/jobs?repo=try&revision=daabceae4b91bb8ccbaf9bb022b2712ccc680f7f is the most recent build that has both dependencies in it. It still doesn't work; and at this point we're not entirely sure a path forward for a fix. It may require a linker patch..

In https://hg.mozilla.org/mozilla-central/rev/1254bad83887 / Bug 1559379 we landed a change that inadvertently changed this from a runtime error to a compile-time:

[task 2019-06-27T17:42:56.938Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __Znwj in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.939Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZnwjRKSt9nothrow_t in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.939Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __Znaj in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.940Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZnajRKSt9nothrow_t in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.941Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdlPv in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.941Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdlPvRKSt9nothrow_t in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.942Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdlPvj in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.942Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdaPv in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.943Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdaPvRKSt9nothrow_t in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)
[task 2019-06-27T17:42:56.943Z] 17:42:56 INFO - lld-link: error: duplicate symbol: __ZdaPvj in ../../memory/mozalloc/cxxalloc.o and in libc++.a(new.cpp.obj)

Accordingly we had to disable jemalloc on the x86 and x64 builds.

Martin dug into this for me. Conversation slightly edits to take out irrelevant bits

14:59:12 <wbs> tjr: Unified_cpp_memory_build0.o contains the functions je_malloc.o and je_free.o, and mozglue.def contains export lines that say "malloc=je_malloc", so the built dll does export a function called "malloc", which redirects to the symbol je_malloc inside the dll
14:59:50 <wbs> tjr: however, any code within mozglue.dll which isn't built with the jemalloc header, will still end up referencing plain unprefixed malloc/free, this includes libc++.a, and those calls will end up directly to the crt
15:04:35 <tjr> It sounds like the fix may just be to include the jemalloc header inside files built inside mozglue? (like decimal.cpp)?
15:05:18 <wbs> it's not necessarily enough
15:05:46 <wbs> as mozglue gets libc++.a statically linked in, whatever bits that are from other object files in libc++.a won't get the right redirections, unless you build a separate libc++_jemalloc.a
15:09:15 <tjr> 1) does " libc++_jemalloc.a" mean "build libc++ with a jemalloc already provided to it"?
15:09:15 <tjr> 2) What do you mean by "whatever bits that are from other object files in libc++.a"?
15:10:21 <wbs> with libc++_jemalloc.a, I mean build libc++.a with the headers that do something akin to -Dmalloc=je_malloc etc, which makes any malloc reference end up pointing to the je_malloc symbol (which you presumabily use for building the rest of the object files in mozglue.dll)
15:11:55 <wbs> and with 2), I mean that for something like "class string", some bits are written inline in the <string> header and will be inlined in the calling object files. for those, adding the redirection header while building the mozglue.dll object files will be enough. but in some cases, use of the <string> header won't be inline function calls, but calls to functions defined in object files within libc++.a (which hasn't been built with the jemalloc redirection headers/defines)

Jacek had another idea too:

15:22:12 <jacekc> tjr: so the problem manifest itself only in code using libc++ inside mozglue? it works fine for all other modules?
15:22:55 <jacekc> in that case maybe you could try link mozglue with its own importlib imported...
15:23:23 <tjr> That is the only manifestation we have observed; yes
15:26:39 <tjr> If you can tell me how I'd try that I'm happy to give it a shot...
15:27:01 <jacekc> so an interesting experiment would be to link it to itself. it may be tricky with makefiles, but if you have a local build, it should be easy
15:27:22 <jacekc> provoke mozglue.dll to be re-linked, copy the command line and add -lmozglue

So I have three things to work on

  1. Try the trick to relink mozglue with -lmozglue. Jacek tried this with a test program and it seemed to work so that's promising.

And if that doesn't work:

  1. Add jemalloc headers to all the mozglue code. See if this fixes everything we know about. If not the problem is probably libc++ so I then have to:

  2. Fix the libc++ compile. This may be as simple as something like -DCMAKE_CXX_FLAGS="-Dmalloc=je_malloc -Dfree=je_free ..." around https://searchfox.org/mozilla-central/rev/f372470e10c8cb0691681603a1d6324dee5b3b8a/taskcluster/scripts/misc/build-clang-8-mingw.sh#259

(In reply to Tom Ritter [:tjr] from comment #13)

  1. Try the trick to relink mozglue with -lmozglue. Jacek tried this with a test program and it seemed to work so that's promising.

I tested this, and it seems to work. The problems with it are:

  1. Conceptually, kind of a hack.
  2. It might be ugly getting the build system to do the re-link needed
  3. I'm not sure how it will interact with https://hg.mozilla.org/mozilla-central/rev/1254bad83887 / Bug 1559379 - I'll need to investigate this.

(In reply to Tom Ritter [:tjr] from comment #14)

  1. It might be ugly getting the build system to do the re-link needed

FWIW, you don't need to re-link. Currently importlib is while the library is linked, but it doesn't have to be. Since we have a .def file, it could be done using only dlltool and .def file before mozglue.dll is linked.

I managed to get this working with considerable help from Jacek and Martin. It required a few steps: https://hg.mozilla.org/try/rev/db7e6df47ad6d35e9754786a56689436311d73e7

  1. If I link mozglue with it's own importlib imported; then all malloc calls will get redirected to je_malloc as desired. To do this I first use dlltool to generate an initial importlib and include that in the linking step.

  2. the operator new/delete functions declared in Bug 1559379 :

On Windows, because of how some CRT static libraries are, a non-inlined operator new (thanks to some disabled STL wrapping) would end up linked against the system malloc, causing problems.

This doesn't have to be the case; as the prior step showed. So I have turned them back into forceinline for MinGW.

  1. Additionally, I needed to update dependentlibs.py because we just created an infinite recursive loop

glandium - what do you think of this patch? Is there a better way to accomplish what I'm doing in rules.mk; or should I just clean it up to only apply to MinGW?

Flags: needinfo?(mh+mozilla)

Rather than creating the importlib during the linking step for mozglue,
we use dlltool to create it, and then provide it during linking. This
allows mozglue to know that it should redirect calls to malloc to je_malloc
as specified in the .def file.

I'm not familiar enough with Firefox moz.build/makefiles to provide detailed instructions, but it seems to me that it should be possible to express that inside mozglue instead of global makefiles. Maybe as a separated static library that would happen to share .def file? mozglue would then link against that static library (from build system's point of view).

Small update; I think I have a better solution to this using attribute alias: https://hg.mozilla.org/try/rev/4730165728f0b88b769c260bb469e7f18d5fd99a

I need to test and confirm, then clean up that version so it's applicable uniformly.

Okay, I confirmed the alias technique works and no longer repros the crash. It seems strongly preferable to the dlltool approach, so I've updated the patch to use it; I think it's ready for review.

Flags: needinfo?(mh+mozilla)
Pushed by cbrindusan@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/37276c204412
Rename NS_STRINGIFY to MOZ_STRINGIFY, move to mfbt, and unify stragglers r=glandium
https://hg.mozilla.org/mozilla-central/rev/139e9fd6792d
Fix jemalloc redirections for MinGW build r=glandium
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

[Tracking Requested - why for this release]: This prevents crashes on MinGW clang builds.

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Tor Browser is based on ESR68 and will need this patch to avoid crashing.
  • User impact if declined: Tor will have to carry the patch themselves.
  • Fix Landed on Version: 70
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We rename a macro across the codebase, which affects everything; but if it broke something it'd be a compiler error. The rest of the changes are MinGW-only.
  • String or UUID changes made by this patch:
Attachment #9092085 - Flags: approval-mozilla-esr68?
Comment on attachment 9092085 [details] [diff] [review]
Bug 1547519 - Rename NS_STRINGIFY to MOZ_STRINGIFY, move to mfbt, and unify stragglers (esr68 rebase) r=glandium

Basically a no-op change for official builds. Makes life easier for Tor. Approved for 68.2esr.
Attachment #9092085 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
Attachment #9092086 - Flags: approval-mozilla-esr68? → approval-mozilla-esr68+
You need to log in before you can comment on or make changes to this bug.