Closed Bug 1420350 Opened 3 years ago Closed 3 years ago

MinGW crashes on startup with --enable-jemalloc

Categories

(Core :: Memory Allocator, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla62
Tracking Status
firefox-esr60 61+ fixed
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- fixed

People

(Reporter: tjr, Assigned: tjr)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

15:28 <glandium> it's only enabled on windows only with msvc and clang-cl, not mingw
15:29 <glandium> I did a try build with it forced on mingw, and that didn't go well when I tried to run it
15:29 <glandium> tjr: https://treeherder.mozilla.org/#/jobs?repo=try&revision=804518c2c5999bb9a05a3d4830d9866856e54586
15:29 <glandium> that crashes on startup
15:30 <tjr> a) does that mean mingw by default doesn't use mozjemalloc and just uses the system malloc?
15:30 <glandium> yes
15:30 <tjr> b) are you sure it's crashing because of mozjemalloc? cause mingw builds crash on startup no matter what right now
15:30 <glandium> I tried a mingw build from central, and it got me the UI
15:30 <tjr> We're tracing that down here presently: https://bugzilla.mozilla.org/show_bug.cgi?id=1411401#c10
15:30 <firebot> Bug 1411401 — NEW, tom@mozilla.com — MinGW build does not run without `gfx.direct2d.disabled` and `layers.acceleration.disabled` due to w
15:31 <glandium> and content process crashes, but at least it got up to the UI
15:31 <glandium> with mozjemalloc it didn't even go that far
I recall talking with Jacek about the reason for that but I don't remember the details.
Flags: needinfo?(jacek)
Summary: --enable-jemalloc doesn't work with MinGW → MinGW crashes on startup with --enable-jemalloc
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=804518c2c5999bb9a05a3d4830d9866856e54586

This build definitely seems to have some allocator mismatches. In various attempts I got complaints at:
> HEAP[firefox.exe]: Invalid address specified to RtlSizeHeap( 02F80000, 0490B080 )
> Assertion failure: arena, at /builds/worker/workspace/build/src/memory/build/mozjemalloc.cpp:3588
> Assertion failure: arena->mMagic == 0x947d3d24, at /builds/worker/workspace/build/src/memory/build/mozjemalloc.cpp:3589

Also, I see a lot of:
> Error loading c:\Users\dmajor\Desktop\asanjemalloc\firefox\nss3.dll: A
Yes, cut off at the "A". I bet there's some narrow-vs-wide string confusion happening. https://dxr.mozilla.org/mozilla-central/rev/da90245d47b17c750560dedb5cbe1973181166e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#60 uses wprintf with "%s" and a wide string. Cppreference says %ls should be used, but MSDN says it's %s in the Microsoft extension. Is MinGW doing something weird here?
(In reply to Georg Koppen from comment #2)
> I recall talking with Jacek about the reason for that but I don't remember
> the details.

Sorry, but I don't remember. I recall trying it some years ago, but I think it didn't build for some reason. If it builds now, it's a step further now.
Flags: needinfo?(jacek)
(In reply to David Major [:dmajor] from comment #3)
> Also, I see a lot of:
> > Error loading c:\Users\dmajor\Desktop\asanjemalloc\firefox\nss3.dll: A
> Yes, cut off at the "A". I bet there's some narrow-vs-wide string confusion
> happening.
> https://dxr.mozilla.org/mozilla-central/rev/
> da90245d47b17c750560dedb5cbe1973181166e3/xpcom/glue/standalone/nsXPCOMGlue.
> cpp#60 uses wprintf with "%s" and a wide string. Cppreference says %ls
> should be used, but MSDN says it's %s in the Microsoft extension. Is MinGW
> doing something weird here?

Yes, it has its own implementation meant to be standard compliant. It's controlled by __USE_MINGW_ANSI_STDIO, see:

https://github.com/mirror/mingw-w64/blob/master/mingw-w64-headers/crt/_mingw.h.in#L418

I'm not sure how it's enabled in this case. This could be disabled globally with something like -D__USE_MINGW_ANSI_STDIO=0.
> > HEAP[firefox.exe]: Invalid address specified to RtlSizeHeap( 02F80000, 0490B080 )

This happens during the DLL entry point function for nss3 (and probably others).

During the entry routine, the mingw build calls malloc to get some memory for the dllonexit table, then passes it into msvcrt!__dllonexit to register some exit functions. msvcrt!__dllonexit expects to be able to call msize() on the table and grow it if necessary. But nss3's original malloc went to jemalloc, so kaboom.

On msvc builds this doesn't become an issue. The entry point _DllMainCRTStartup is much more lightweight. I don't see it calling any allocation functions. It does call ucrtbase!_initialize_onexit_table, but there's no malloc there; the table is baked into the nss3 binary (nss3!module_local_at_quick_exit_table). And the entry point doesn't register any onexit functions that I can see.

Jacek, do you happen to know where the source for `_DllMainCRTStartup` (or whatever mingw calls its analogous thing) lives?
Flags: needinfo?(jacek)
I think I found it...

This malloc gets redirected to jemalloc: https://github.com/mirror/mingw-w64/blob/ed415ac93648a91589551f3f07d1e3a0c414ce30/mingw-w64-crt/crt/crtdll.c#L73

But maybe it shouldn't have used jemalloc, because that pointer gets passed to __dllonexit which assumes it comes from the CRT heap: https://github.com/mirror/mingw-w64/blob/6172d2f87e426462f89785753f1be553bd10fe2f/mingw-w64-crt/crt/atonexit.c#L47
Flags: needinfo?(jacek)
Is it intentional to call msvcrt!__dllonexit, as opposed to something self-implemented like this? https://github.com/mirror/mingw-w64/blob/2e64b9e4537d564478f17b873b2f655f518325ed/mingw-w64-crt/crt/ucrtbase_compat.c#L99
Attached patch patch.diff (obsolete) — Splinter Review
Actually, it seems that problematic allocation should not really be needed and the attached patch should work (that's totally untested, based only on reading mingw and Wine sources). The whole handling of atexit in mingw should really be rewritten. I will try to take a look at it soon.
You'd need to enable jemalloc to check if that solves the problem, though.
Oh you did.
(In reply to Tom Ritter [:tjr] from comment #10)
> Here's a build with Jacek's patch:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=5d7827d9b6ac72c59b0859572f7de7c4b2d03777

This build crashes before showing the UI.
> Actually, it seems that problematic allocation should not really be needed
> and the attached patch should work (that's totally untested, based only on
> reading mingw and Wine sources). 

It crashes in msvcrt!__dllonexit reading a null pointer.

Again I wonder: is it intentional to call msvcrt's version of this function, rather than mingw's?
Yeah, the patch was wrong, sorry about that. I better solution upstream for review:

https://sourceforge.net/p/mingw-w64/mailman/message/36186685/
https://sourceforge.net/p/mingw-w64/mailman/message/36186686/
https://sourceforge.net/p/mingw-w64/mailman/message/36186687/
https://sourceforge.net/p/mingw-w64/mailman/message/36186688/
https://sourceforge.net/p/mingw-w64/mailman/message/36186689/

I don't know why __dllonexit was used. It's an ancient code (predating mingw-w64 fork so it's not even easy to check how ancient). Rewrite of it nicely aligns with recent efforts to support ucrt-based builds. My patches introduce a compatibility implementation of new functions available in ucrtbase.dll. msvcrt.dll-based builds (like currently Firefox) will use mingw implementation with regular allocators. ucrt-based builds will use _initialize_onexit_table and alike just like MSVC does. I hope it will fix the problem.
As of commit cf0c9d637b6c1c9667bdf1e6580af03e36c3ddd1, those patches are in mingw-w64 git.
Okay dmajor tells me that crashes with something I fixed (but did not include) from Bug 1411401 so I'm going to re-run with that included.
I think this is fixed, but we won't know for sure until we sort out Bug 1411401. I'm putting the patch here for safekeeping though.
Interesting.  Without jemalloc gtest crashes like

> 22:52:37     INFO -  out of memory: 0x00000000656E6308 bytes requested
> 22:52:37     INFO -  Hit MOZ_CRASH() at /builds/worker/workspace/build/src/memory/mozalloc/mozalloc_abort.cpp:33
> 22:52:38     INFO -  mozcrash INFO | Copy/paste: Z:\task_1520376292\build\win32-minidump_stackwalk.exe Z:\task_1520376292\build\tests\gtest\f523332b-0dc7-4ced-9699-0b121c9f6853.dmp Z:\task_1520376292\build\symbols

With jemalloc it fails earlier:

> 17:57:17     INFO - Calling ['Z:\\task_1520443696\\build\\venv\\Scripts\\python', '-u', 'Z:\\task_1520443696\\build\\tests\\gtest\\rungtests.py', '--xre-path=Z:\\task_1520443696\\build\\application\\firefox', '--cwd=Z:\\task_1520443696\\build\\tests\\gtest', '--symbols-path=Z:\\task_1520443696\\build\\symbols', '--utility-path=tests/bin', 'Z:\\task_1520443696\\build\\application\\firefox\\firefox.exe'] with output_timeout 1000
> 17:57:18     INFO -  gtest INFO | Running gtest
> 17:57:18     INFO -  DLL blocklist was unable to intercept AppInit DLLs.
> 17:57:18     INFO -  [4008, Unnamed thread 2828040] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 194
> 17:57:18     INFO -  [4008, Unnamed thread 2828040] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 194
> 17:57:18     INFO -  [4008, Unnamed thread 2828040] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /builds/worker/workspace/build/src/xpcom/base/nsTraceRefcnt.cpp, line 194
> 17:57:18     INFO -  Assertion failure: arena, at /builds/worker/workspace/build/src/memory/build/mozjemalloc.cpp:3494
> 17:57:18    ERROR -  gtest TEST-UNEXPECTED-FAIL | gtest | test failed with return code 3221225477
> Assertion failure: arena, at /builds/worker/workspace/build/src/memory/build/mozjemalloc.cpp:3494

That could be a double-free. Or an attempt to deallocate a pointer not owned by mozjemalloc.
Alright I tested x86 and x64 opt builds.  Both work. (x64 needs Bug 1460720)
Comment on attachment 8956872 [details]
Bug 1420350 Enable jemalloc on MinGW

https://reviewboard.mozilla.org/r/225828/#review249804

::: build/moz.configure/memory.configure:25
(Diff revision 3)
>          return bool(value) or None
>  
>      if target.kernel in ('Darwin', 'Linux'):
>          return True
>  
> -    if target.kernel == 'WINNT' and c_compiler.type in ('msvc', 'clang-cl'):
> +    if target.kernel == 'WINNT':

Please merge with the previous condition.
Attachment #8956872 - Flags: review?(mh+mozilla)
Comment on attachment 8956872 [details]
Bug 1420350 Enable jemalloc on MinGW

https://reviewboard.mozilla.org/r/225828/#review250202
Attachment #8956872 - Flags: review?(mh+mozilla) → review+
Comment on attachment 8956872 [details]
Bug 1420350 Enable jemalloc on MinGW

[Approval Request Comment]

This is one of several MinGW Build patches I'd like to land in esr60 for Tor. It will prevent them from carrying their own patches for the lifetime of esr60 and will enable us to keep the MinGW build functioning and know if/when/how it was broken by new commits into esr60.

This commit only affects the MinGW build configuration, so it is low-risk.
Attachment #8956872 - Flags: approval-mozilla-esr60?
Pushed by dluca@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/8d15ab11df3b
Enable jemalloc on MinGW r=glandium
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/8d15ab11df3b
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Comment on attachment 8956872 [details]
Bug 1420350 Enable jemalloc on MinGW

mingw fix for 60.1esr
Attachment #8956872 - Flags: approval-mozilla-esr60? → approval-mozilla-esr60+
You need to log in before you can comment on or make changes to this bug.