Closed Bug 1333081 Opened 5 years ago Closed 5 years ago

Application doesn't start when build with ac_add_options --disable-sandbox (Windows x64)

Categories

(Toolkit :: Startup and Profile System, defect, P5)

defect

Tracking

()

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- affected
firefox54 --- fixed

People

(Reporter: jorgk-bmo, Assigned: glandium)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1332523 +++

From bug 1332523 comment #20:

Yesterday, after bug 1332523 had landed, I tried to compile Thunderbird with
  ac_add_options --disable-sandbox
in my Mozconfig (which was really an undesired left-over).

The result was that TB didn't start, this was the stack:

mozglue.dll!arena_dalloc(void * ptr, unsigned __int64 offset) Line 4709	C
thunderbird.exe!mozilla::GetBootstrap(const char * aXPCOMFile) Line 417	C++
thunderbird.exe!InitXPCOMGlue(const char * argv0) Line 248	C++
thunderbird.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 297	C++
thunderbird.exe!wmain(int argc, wchar_t * * argv) Line 118	C++

I debugged it a bit and strangely it failed on the 'return' (!) statement here after successfully returning from (*func)(b):
https://dxr.mozilla.org/mozilla-central/rev/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#407

On the frozen console (couldn't copy/paste) I saw:
jemalloc.c:4710: Failed assertion: "arena != NULL"
Hit MOZ_CRASH() at jemalloc_config.cpp:163

So something went wrong in memory management.

I haven't tried this in Firefox, since it would cost me one hour to compile the potentially failing version and then another hour to return to a good version. However, TB and FF have a very similar main() so I assume FF wouldn't work either.

I'm not sure whether |ac_add_options --disable-sandbox| is still supported, if not, it should be removed instead of causing non-functioning builds.
You're going to need to debug this in detail. It's hard to tell whether this is related to --disable-sandbox at all, or something else entirely. But stepping through the code will tell you the exact control flow and what went wrong better than the stacktrace.
Priority: -- → P5
As I said, I debugged it a bit. The code at 
https://dxr.mozilla.org/mozilla-central/rev/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#407
reads:

  GetBootstrapType func = (GetBootstrapType)GetSymbol(sTop->libHandle, "XRE_GetBootstrap");
  if (!func) {
    return nullptr;
  }

  Bootstrap::UniquePtr b;
  (*func)(b);

  return b;

I dumped out the value of 'func', some pointer, and added debug before and after the call to (*func)(b);

In the end, the assert came within the 'return b;' which seemed to have triggered some memory management in jemalloc.c.

Sorry, I wouldn't know how to debug this further. I don't know which tricks are used for memory management. In fact, I was surprised that jemalloc.c was used, I hadn't come across this before.

I'm really a Thunderbird developer and I'm just reporting this since you seemed interested. After removing |ac_add_options --disable-sandbox| the program I build could be started and I'm afraid I have other duties on the Thunderbird project. Besides, someone who knows what they're doing and looking for should debug this, not someone who has no idea of those intricate internals.

I just have the vague feeling that sandboxing is somehow related to memory management, and recent reshuffling in bug 1332523 triggered this. I've had |ac_add_options --disable-sandbox| im my Mozconfig for a while (some months) and never had any problem until the landing of bug 1332523. I build TB every couple of days, so I would have noticed this earlier.
The fact that this appears to fail during a "return" suggests an RAII destructor, likely the Bootstrap::UniquePtr.

The "arena != NULL" complaint sounds like jemalloc trying to delete something that it didn't allocate, or possibly already freed.

Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that "return b" without a Move().
Flags: needinfo?(nfroyd)
(In reply to David Major [:dmajor] from comment #3)
> The fact that this appears to fail during a "return" suggests an RAII
> destructor, likely the Bootstrap::UniquePtr.
> 
> The "arena != NULL" complaint sounds like jemalloc trying to delete
> something that it didn't allocate, or possibly already freed.
> 
> Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that
> "return b" without a Move().

No, that's perfectly legitimate; the compiler will automatically DTRT.  Whatever pointer we allocated will be moved into the return value, leaving a null pointer in the local |b| variable.
Flags: needinfo?(nfroyd)
> > Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that
> > "return b" without a Move().
> 
> No, that's perfectly legitimate; the compiler will automatically DTRT. 
> Whatever pointer we allocated will be moved into the return value, leaving a
> null pointer in the local |b| variable.

I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)).
So, looking at the --disable-sandbox thing, IIRC in that config we glue the libraries together somehow, but my memory is fuzzy and it may have changed since I last looked. I wonder if jemalloc new+delete didn't get paired up properly? Though I'm still confused as to why we're deleting anything at all. That Bootstrap object ought to be merely moving around.
Maybe you guys have a faster machine that this humble TB developer and you can just compile without sandboxing and see whether it reproduces in FF.
(In reply to David Major [:dmajor] from comment #5)
> > No, that's perfectly legitimate; the compiler will automatically DTRT. 
> > Whatever pointer we allocated will be moved into the return value, leaving a
> > null pointer in the local |b| variable.
> 
> I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)).

I guess we could add one, but it's not really a property of the data structure; it's a property of the language.  Besides, how would you test that, e.g. the local variable was nullptr?
Ok, I debugged locally. The UniquePtr stuff was a red herring; please ignore.

The crash is actually during the dtor for |std::string file(aXPCOMFile);|. We're asking jemalloc to free its buffer:

0:000> kL
 # Child-SP          RetAddr           Call Site
00 00000092`71bff908 00007ff6`c16e1bbd mozglue!free_impl
01 (Inline Function) --------`-------- firefox!std::allocator<char>::deallocate+0xf
02 (Inline Function) --------`-------- firefox!std::_Wrap_alloc<std::allocator<char> >::deallocate+0xf
03 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Tidy+0x19
04 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::{dtor}+0x19
05 00000092`71bff910 00007ff6`c16e1181 firefox!mozilla::GetBootstrap+0x111
06 00000092`71bff9a0 00007ff6`c16e12cb firefox!InitXPCOMGlue+0x5d
07 00000092`71bffb00 00007ff6`c16e1a63 firefox!NS_internal_main+0xaf
08 00000092`71bffb70 00007ff6`c16e2e19 firefox!wmain+0x163
09 (Inline Function) --------`-------- firefox!invoke_main+0x22
0a 00000092`71bffbd0 00007ff8`4d0d8364 firefox!__scrt_common_main_seh+0x11d
0b 00000092`71bffc10 00007ff8`4fba70d1 KERNEL32!BaseThreadInitThunk+0x14
0c 00000092`71bffc40 00000000`00000000 ntdll!RtlUserThreadStart+0x21

But that buffer was allocated by the Windows system heap, not jemalloc:

0:000> !heap -p -a @rcx
    address 0000023a78b65fc0 found in
    _DPH_HEAP_ROOT @ 23a78431000
    in busy allocation (  DPH_HEAP_BLOCK:         UserAddr         UserSize -         VirtAddr         VirtSize)
                             23a78447000:      23a78b65fc0               40 -      23a78b65000             2000
    00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65
    00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc
    00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727
    00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036
    00007ff6c16e3295 firefox!operator new+0x0000000000000031
    00007ff6c16e1fa0 firefox!std::_Allocate+0x0000000000000074
    00007ff6c16e2020 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy+0x0000000000000074
    00007ff6c16e22d9 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign+0x0000000000000095
    00007ff6c16e1b0f firefox!mozilla::GetBootstrap+0x0000000000000063
    00007ff6c16e1181 firefox!InitXPCOMGlue+0x000000000000005d
    00007ff6c16e12cb firefox!NS_internal_main+0x00000000000000af
    00007ff6c16e1a63 firefox!wmain+0x0000000000000163
    00007ff6c16e2e19 firefox!__scrt_common_main_seh+0x000000000000011d
    00007ff84d0d8364 KERNEL32!BaseThreadInitThunk+0x0000000000000014
    00007ff84fba70d1 ntdll!RtlUserThreadStart+0x0000000000000021

So, somehow, std::_Dellocate got wired up to jemalloc, but the string allocation stuff didn't.
I'm ready to bet that porting the changes from bug 1332523 to nsBrowserApp.cpp to TB would fix it.

The only reasonable explanation I can think of is that what happens is that somehow it's trying to free() a stack pointer, the one for exePath. Which actually shouldn't be happening, because the std::string constructor is supposed to be a copy constructor, so freeing the std::string shouldn't try to free the original buffer, but a copy of it.

But that's only a theory. The best way to know for sure would be to print the address of the ptr the arena_dalloc call crashes for, and find what that pointer is for.
Didn't see comment 9. Oh my...
Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build.
(In reply to Mike Hommey [:glandium] from comment #12)
> Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build.

Doesn't work.

I'm mystified this actually doesn't bust Firefox the same way... The easy fix is to change the code in nsXPCOMGlue.cpp to match what used to be in nsBrowserApp.cpp and not use the STL for that.

On the long run, though, it should be possible to use the STL in non libxul code now, as long as libmozglue is linked.
Assignee: nobody → mh+mozilla
FWIW I was building Firefox (not TB), with --disable-sandbox.

With the patch I still get a mismatch. Here's the alloc:

    00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65
    00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc
    00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727
    00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036
    00007ff74ecb26f5 firefox!operator new+0x0000000000000031 [f:\dd\vctools\crt\vcstartup\src\heap\new_scalar.cpp @ 19]
    00007ff74ecb1afd firefox!mozilla::GetBootstrap+0x0000000000000051

And the free:

00 00000001`001ffb18 00007ff7`4ecb1b78 firefox!operator delete[]
01 (Inline Function) --------`-------- firefox!mozilla::DefaultDelete<char [0]>::operator()+0x8
02 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::reset+0xd
03 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::{dtor}+0xd
04 00000001`001ffb20 00007ff7`4ecb1181 firefox!mozilla::GetBootstrap+0xcc
05 00000001`001ffb50 00007ff7`4ecb12cb firefox!InitXPCOMGlue+0x5d
(In comment 15, "firefox!operator delete[]" goes to firefox!free, which goes to mozglue!free_impl.)
And in case anyone's worried about the scalar-new and array-delete mismatch: the CRT "operator new[]" does a tail call to "operator new", which doesn't show up in the stack.
David, can you try the attached patch? If that still crashes, I'll have to also change the new to use malloc and the UniquePtr to free with free(). But I think the mismatch was caused by new happening in the CRT through std::string. I'd expect an explicit new to be okay.
Flags: needinfo?(dmajor)
Yeah, comment 15 was based on having applied your patch from comment 14.
Flags: needinfo?(dmajor)
Can you try the updated patch?
Flags: needinfo?(dmajor)
Patch v2 works.
Flags: needinfo?(dmajor)
Rebased on a revision of m-i that doesn't have valgrind regressions...
Comment on attachment 8829679 [details]
Bug 1333081 - Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows.

https://reviewboard.mozilla.org/r/106682/#review107924

So many hoops to jump through.  Interested on your take on the below.

::: xpcom/glue/standalone/nsXPCOMGlue.cpp:405
(Diff revision 3)
> +  strncpy(file.get(), aXPCOMFile, base_len);
> +  strcpy(file.get() + base_len, XPCOM_DLL);

I wonder if it's not simpler to use `memcpy` here:

```
  memcpy(file.get(), aXPCOMFile, base_len);
  memcpy(file.get() + base_len, XPCOM_DLL, sizeof(XPCOM_DLL));
```

A little more verbose, but maybe more easily auditable?
Attachment #8829679 - Flags: review?(nfroyd) → review+
Pushed by mh@glandium.org:
https://hg.mozilla.org/integration/autoland/rev/9ada53480b75
Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/9ada53480b75
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
Can someone kindly explain to me why this problem only showed up when building with "disable-sandbox".

(In reply to Mike Hommey [:glandium] from comment #10)
> I'm ready to bet that porting the changes from bug 1332523 to
> nsBrowserApp.cpp to TB would fix it.
How much would you like to bet? ;-)
We did port those changes to be able to compile (bug 1332898, mentioned in bug 1332523 comment #20).
That's when the trouble started ;-(
(In reply to Jorg K (GMT+1) from comment #29)
> Can someone kindly explain to me why this problem only showed up when
> building with "disable-sandbox".

Glandium knows the details better, but my high-level hand-wavey understanding is that without the sandbox we send a different set of code to the linker and it interacts differently with our jemalloc function-substitution wizardry. I dealt with something similar in bug 1023941 comment 32 -- that specific case is no longer relevant to our codebase, but it gives you an idea of what kind of horrible stuff goes on under the hood.
Thanks. I'm glad I didn't debug the "horrible stuff ... under the hood" myself as was suggested in comment #1 ;-)
You need to log in before you can comment on or make changes to this bug.