Closed
Bug 1333081
Opened 8 years ago
Closed 8 years ago
Application doesn't start when build with ac_add_options --disable-sandbox (Windows x64)
Categories
(Toolkit :: Startup and Profile System, defect, P5)
Toolkit
Startup and Profile System
Tracking
()
RESOLVED
FIXED
mozilla54
Tracking | Status | |
---|---|---|
firefox52 | --- | unaffected |
firefox53 | --- | affected |
firefox54 | --- | fixed |
People
(Reporter: jorgk-bmo, Assigned: glandium)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
+++ This bug was initially created as a clone of Bug #1332523 +++
From bug 1332523 comment #20:
Yesterday, after bug 1332523 had landed, I tried to compile Thunderbird with
ac_add_options --disable-sandbox
in my Mozconfig (which was really an undesired left-over).
The result was that TB didn't start, this was the stack:
mozglue.dll!arena_dalloc(void * ptr, unsigned __int64 offset) Line 4709 C
thunderbird.exe!mozilla::GetBootstrap(const char * aXPCOMFile) Line 417 C++
thunderbird.exe!InitXPCOMGlue(const char * argv0) Line 248 C++
thunderbird.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 297 C++
thunderbird.exe!wmain(int argc, wchar_t * * argv) Line 118 C++
I debugged it a bit and strangely it failed on the 'return' (!) statement here after successfully returning from (*func)(b):
https://dxr.mozilla.org/mozilla-central/rev/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#407
On the frozen console (couldn't copy/paste) I saw:
jemalloc.c:4710: Failed assertion: "arena != NULL"
Hit MOZ_CRASH() at jemalloc_config.cpp:163
So something went wrong in memory management.
I haven't tried this in Firefox, since it would cost me one hour to compile the potentially failing version and then another hour to return to a good version. However, TB and FF have a very similar main() so I assume FF wouldn't work either.
I'm not sure whether |ac_add_options --disable-sandbox| is still supported, if not, it should be removed instead of causing non-functioning builds.
Comment 1•8 years ago
|
||
You're going to need to debug this in detail. It's hard to tell whether this is related to --disable-sandbox at all, or something else entirely. But stepping through the code will tell you the exact control flow and what went wrong better than the stacktrace.
Priority: -- → P5
Reporter | ||
Comment 2•8 years ago
|
||
As I said, I debugged it a bit. The code at
https://dxr.mozilla.org/mozilla-central/rev/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#407
reads:
GetBootstrapType func = (GetBootstrapType)GetSymbol(sTop->libHandle, "XRE_GetBootstrap");
if (!func) {
return nullptr;
}
Bootstrap::UniquePtr b;
(*func)(b);
return b;
I dumped out the value of 'func', some pointer, and added debug before and after the call to (*func)(b);
In the end, the assert came within the 'return b;' which seemed to have triggered some memory management in jemalloc.c.
Sorry, I wouldn't know how to debug this further. I don't know which tricks are used for memory management. In fact, I was surprised that jemalloc.c was used, I hadn't come across this before.
I'm really a Thunderbird developer and I'm just reporting this since you seemed interested. After removing |ac_add_options --disable-sandbox| the program I build could be started and I'm afraid I have other duties on the Thunderbird project. Besides, someone who knows what they're doing and looking for should debug this, not someone who has no idea of those intricate internals.
I just have the vague feeling that sandboxing is somehow related to memory management, and recent reshuffling in bug 1332523 triggered this. I've had |ac_add_options --disable-sandbox| im my Mozconfig for a while (some months) and never had any problem until the landing of bug 1332523. I build TB every couple of days, so I would have noticed this earlier.
The fact that this appears to fail during a "return" suggests an RAII destructor, likely the Bootstrap::UniquePtr.
The "arena != NULL" complaint sounds like jemalloc trying to delete something that it didn't allocate, or possibly already freed.
Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that "return b" without a Move().
Flags: needinfo?(nfroyd)
Comment 4•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #3)
> The fact that this appears to fail during a "return" suggests an RAII
> destructor, likely the Bootstrap::UniquePtr.
>
> The "arena != NULL" complaint sounds like jemalloc trying to delete
> something that it didn't allocate, or possibly already freed.
>
> Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that
> "return b" without a Move().
No, that's perfectly legitimate; the compiler will automatically DTRT. Whatever pointer we allocated will be moved into the return value, leaving a null pointer in the local |b| variable.
Flags: needinfo?(nfroyd)
> > Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that
> > "return b" without a Move().
>
> No, that's perfectly legitimate; the compiler will automatically DTRT.
> Whatever pointer we allocated will be moved into the return value, leaving a
> null pointer in the local |b| variable.
I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)).
So, looking at the --disable-sandbox thing, IIRC in that config we glue the libraries together somehow, but my memory is fuzzy and it may have changed since I last looked. I wonder if jemalloc new+delete didn't get paired up properly? Though I'm still confused as to why we're deleting anything at all. That Bootstrap object ought to be merely moving around.
Reporter | ||
Comment 7•8 years ago
|
||
Maybe you guys have a faster machine that this humble TB developer and you can just compile without sandboxing and see whether it reproduces in FF.
Comment 8•8 years ago
|
||
(In reply to David Major [:dmajor] from comment #5)
> > No, that's perfectly legitimate; the compiler will automatically DTRT.
> > Whatever pointer we allocated will be moved into the return value, leaving a
> > null pointer in the local |b| variable.
>
> I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)).
I guess we could add one, but it's not really a property of the data structure; it's a property of the language. Besides, how would you test that, e.g. the local variable was nullptr?
Ok, I debugged locally. The UniquePtr stuff was a red herring; please ignore.
The crash is actually during the dtor for |std::string file(aXPCOMFile);|. We're asking jemalloc to free its buffer:
0:000> kL
# Child-SP RetAddr Call Site
00 00000092`71bff908 00007ff6`c16e1bbd mozglue!free_impl
01 (Inline Function) --------`-------- firefox!std::allocator<char>::deallocate+0xf
02 (Inline Function) --------`-------- firefox!std::_Wrap_alloc<std::allocator<char> >::deallocate+0xf
03 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Tidy+0x19
04 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::{dtor}+0x19
05 00000092`71bff910 00007ff6`c16e1181 firefox!mozilla::GetBootstrap+0x111
06 00000092`71bff9a0 00007ff6`c16e12cb firefox!InitXPCOMGlue+0x5d
07 00000092`71bffb00 00007ff6`c16e1a63 firefox!NS_internal_main+0xaf
08 00000092`71bffb70 00007ff6`c16e2e19 firefox!wmain+0x163
09 (Inline Function) --------`-------- firefox!invoke_main+0x22
0a 00000092`71bffbd0 00007ff8`4d0d8364 firefox!__scrt_common_main_seh+0x11d
0b 00000092`71bffc10 00007ff8`4fba70d1 KERNEL32!BaseThreadInitThunk+0x14
0c 00000092`71bffc40 00000000`00000000 ntdll!RtlUserThreadStart+0x21
But that buffer was allocated by the Windows system heap, not jemalloc:
0:000> !heap -p -a @rcx
address 0000023a78b65fc0 found in
_DPH_HEAP_ROOT @ 23a78431000
in busy allocation ( DPH_HEAP_BLOCK: UserAddr UserSize - VirtAddr VirtSize)
23a78447000: 23a78b65fc0 40 - 23a78b65000 2000
00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65
00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc
00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727
00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036
00007ff6c16e3295 firefox!operator new+0x0000000000000031
00007ff6c16e1fa0 firefox!std::_Allocate+0x0000000000000074
00007ff6c16e2020 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy+0x0000000000000074
00007ff6c16e22d9 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign+0x0000000000000095
00007ff6c16e1b0f firefox!mozilla::GetBootstrap+0x0000000000000063
00007ff6c16e1181 firefox!InitXPCOMGlue+0x000000000000005d
00007ff6c16e12cb firefox!NS_internal_main+0x00000000000000af
00007ff6c16e1a63 firefox!wmain+0x0000000000000163
00007ff6c16e2e19 firefox!__scrt_common_main_seh+0x000000000000011d
00007ff84d0d8364 KERNEL32!BaseThreadInitThunk+0x0000000000000014
00007ff84fba70d1 ntdll!RtlUserThreadStart+0x0000000000000021
So, somehow, std::_Dellocate got wired up to jemalloc, but the string allocation stuff didn't.
Assignee | ||
Comment 10•8 years ago
|
||
I'm ready to bet that porting the changes from bug 1332523 to nsBrowserApp.cpp to TB would fix it.
The only reasonable explanation I can think of is that what happens is that somehow it's trying to free() a stack pointer, the one for exePath. Which actually shouldn't be happening, because the std::string constructor is supposed to be a copy constructor, so freeing the std::string shouldn't try to free the original buffer, but a copy of it.
But that's only a theory. The best way to know for sure would be to print the address of the ptr the arena_dalloc call crashes for, and find what that pointer is for.
Assignee | ||
Comment 11•8 years ago
|
||
Didn't see comment 9. Oh my...
Assignee | ||
Comment 12•8 years ago
|
||
Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build.
Assignee | ||
Comment 13•8 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #12)
> Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build.
Doesn't work.
I'm mystified this actually doesn't bust Firefox the same way... The easy fix is to change the code in nsXPCOMGlue.cpp to match what used to be in nsBrowserApp.cpp and not use the STL for that.
On the long run, though, it should be possible to use the STL in non libxul code now, as long as libmozglue is linked.
Assignee: nobody → mh+mozilla
Comment hidden (mozreview-request) |
Comment 15•8 years ago
|
||
FWIW I was building Firefox (not TB), with --disable-sandbox.
With the patch I still get a mismatch. Here's the alloc:
00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65
00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc
00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727
00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036
00007ff74ecb26f5 firefox!operator new+0x0000000000000031 [f:\dd\vctools\crt\vcstartup\src\heap\new_scalar.cpp @ 19]
00007ff74ecb1afd firefox!mozilla::GetBootstrap+0x0000000000000051
And the free:
00 00000001`001ffb18 00007ff7`4ecb1b78 firefox!operator delete[]
01 (Inline Function) --------`-------- firefox!mozilla::DefaultDelete<char [0]>::operator()+0x8
02 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::reset+0xd
03 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::{dtor}+0xd
04 00000001`001ffb20 00007ff7`4ecb1181 firefox!mozilla::GetBootstrap+0xcc
05 00000001`001ffb50 00007ff7`4ecb12cb firefox!InitXPCOMGlue+0x5d
Comment 16•8 years ago
|
||
(In comment 15, "firefox!operator delete[]" goes to firefox!free, which goes to mozglue!free_impl.)
Comment 17•8 years ago
|
||
And in case anyone's worried about the scalar-new and array-delete mismatch: the CRT "operator new[]" does a tail call to "operator new", which doesn't show up in the stack.
Assignee | ||
Comment 18•8 years ago
|
||
David, can you try the attached patch? If that still crashes, I'll have to also change the new to use malloc and the UniquePtr to free with free(). But I think the mismatch was caused by new happening in the CRT through std::string. I'd expect an explicit new to be okay.
Flags: needinfo?(dmajor)
Comment 19•8 years ago
|
||
Yeah, comment 15 was based on having applied your patch from comment 14.
Flags: needinfo?(dmajor)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 24•8 years ago
|
||
Rebased on a revision of m-i that doesn't have valgrind regressions...
Comment 25•8 years ago
|
||
mozreview-review |
Comment on attachment 8829679 [details]
Bug 1333081 - Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows.
https://reviewboard.mozilla.org/r/106682/#review107924
So many hoops to jump through. Interested on your take on the below.
::: xpcom/glue/standalone/nsXPCOMGlue.cpp:405
(Diff revision 3)
> + strncpy(file.get(), aXPCOMFile, base_len);
> + strcpy(file.get() + base_len, XPCOM_DLL);
I wonder if it's not simpler to use `memcpy` here:
```
memcpy(file.get(), aXPCOMFile, base_len);
memcpy(file.get() + base_len, XPCOM_DLL, sizeof(XPCOM_DLL));
```
A little more verbose, but maybe more easily auditable?
Attachment #8829679 -
Flags: review?(nfroyd) → review+
Comment hidden (mozreview-request) |
Comment 27•8 years ago
|
||
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/9ada53480b75
Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows. r=froydnj
Updated•8 years ago
|
status-firefox52:
--- → unaffected
status-firefox54:
--- → affected
Comment 28•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Reporter | ||
Comment 29•8 years ago
|
||
Can someone kindly explain to me why this problem only showed up when building with "disable-sandbox".
(In reply to Mike Hommey [:glandium] from comment #10)
> I'm ready to bet that porting the changes from bug 1332523 to
> nsBrowserApp.cpp to TB would fix it.
How much would you like to bet? ;-)
We did port those changes to be able to compile (bug 1332898, mentioned in bug 1332523 comment #20).
That's when the trouble started ;-(
Comment 30•8 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #29)
> Can someone kindly explain to me why this problem only showed up when
> building with "disable-sandbox".
Glandium knows the details better, but my high-level hand-wavey understanding is that without the sandbox we send a different set of code to the linker and it interacts differently with our jemalloc function-substitution wizardry. I dealt with something similar in bug 1023941 comment 32 -- that specific case is no longer relevant to our codebase, but it gives you an idea of what kind of horrible stuff goes on under the hood.
Reporter | ||
Comment 31•8 years ago
|
||
Thanks. I'm glad I didn't debug the "horrible stuff ... under the hood" myself as was suggested in comment #1 ;-)
You need to log in
before you can comment on or make changes to this bug.
Description
•