Bugzilla

Reporter

Comment 2

•

8 years ago

As I said, I debugged it a bit. The code at https://dxr.mozilla.org/mozilla-central/rev/3cedab21a7e65e6a1c4c2294ecfb5502575a46e3/xpcom/glue/standalone/nsXPCOMGlue.cpp#407 reads: GetBootstrapType func = (GetBootstrapType)GetSymbol(sTop->libHandle, "XRE_GetBootstrap"); if (!func) { return nullptr; } Bootstrap::UniquePtr b; (*func)(b); return b; I dumped out the value of 'func', some pointer, and added debug before and after the call to (*func)(b); In the end, the assert came within the 'return b;' which seemed to have triggered some memory management in jemalloc.c. Sorry, I wouldn't know how to debug this further. I don't know which tricks are used for memory management. In fact, I was surprised that jemalloc.c was used, I hadn't come across this before. I'm really a Thunderbird developer and I'm just reporting this since you seemed interested. After removing |ac_add_options --disable-sandbox| the program I build could be started and I'm afraid I have other duties on the Thunderbird project. Besides, someone who knows what they're doing and looking for should debug this, not someone who has no idea of those intricate internals. I just have the vague feeling that sandboxing is somehow related to memory management, and recent reshuffling in bug 1332523 triggered this. I've had |ac_add_options --disable-sandbox| im my Mozconfig for a while (some months) and never had any problem until the landing of bug 1332523. I build TB every couple of days, so I would have noticed this earlier.

Comment 3

•

8 years ago

The fact that this appears to fail during a "return" suggests an RAII destructor, likely the Bootstrap::UniquePtr. The "arena != NULL" complaint sounds like jemalloc trying to delete something that it didn't allocate, or possibly already freed. Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that "return b" without a Move().

Flags: needinfo?(nfroyd)

Nathan Froyd [:froydnj]

Comment 4

•

8 years ago

(In reply to David Major [:dmajor] from comment #3) > The fact that this appears to fail during a "return" suggests an RAII > destructor, likely the Bootstrap::UniquePtr. > > The "arena != NULL" complaint sounds like jemalloc trying to delete > something that it didn't allocate, or possibly already freed. > > Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that > "return b" without a Move(). No, that's perfectly legitimate; the compiler will automatically DTRT. Whatever pointer we allocated will be moved into the return value, leaving a null pointer in the local |b| variable.

Flags: needinfo?(nfroyd)

Comment 5

•

8 years ago

> > Nathan, does the UniquePtr usage look kosher? I'm a bit suspicious of that > > "return b" without a Move(). > > No, that's perfectly legitimate; the compiler will automatically DTRT. > Whatever pointer we allocated will be moved into the return value, leaving a > null pointer in the local |b| variable. I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)).

Comment 6

•

8 years ago

So, looking at the --disable-sandbox thing, IIRC in that config we glue the libraries together somehow, but my memory is fuzzy and it may have changed since I last looked. I wonder if jemalloc new+delete didn't get paired up properly? Though I'm still confused as to why we're deleting anything at all. That Bootstrap object ought to be merely moving around.

Reporter

Comment 7

•

8 years ago

Maybe you guys have a faster machine that this humble TB developer and you can just compile without sandboxing and see whether it reproduces in FF.

Nathan Froyd [:froydnj]

Comment 8

•

8 years ago

(In reply to David Major [:dmajor] from comment #5) > > No, that's perfectly legitimate; the compiler will automatically DTRT. > > Whatever pointer we allocated will be moved into the return value, leaving a > > null pointer in the local |b| variable. > > I didn't see any examples of that in TestUniquePtr.cpp (hint hint :)). I guess we could add one, but it's not really a property of the data structure; it's a property of the language. Besides, how would you test that, e.g. the local variable was nullptr?

Comment 9

•

8 years ago

Ok, I debugged locally. The UniquePtr stuff was a red herring; please ignore. The crash is actually during the dtor for |std::string file(aXPCOMFile);|. We're asking jemalloc to free its buffer: 0:000> kL # Child-SP RetAddr Call Site 00 00000092`71bff908 00007ff6`c16e1bbd mozglue!free_impl 01 (Inline Function) --------`-------- firefox!std::allocator<char>::deallocate+0xf 02 (Inline Function) --------`-------- firefox!std::_Wrap_alloc<std::allocator<char> >::deallocate+0xf 03 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Tidy+0x19 04 (Inline Function) --------`-------- firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::{dtor}+0x19 05 00000092`71bff910 00007ff6`c16e1181 firefox!mozilla::GetBootstrap+0x111 06 00000092`71bff9a0 00007ff6`c16e12cb firefox!InitXPCOMGlue+0x5d 07 00000092`71bffb00 00007ff6`c16e1a63 firefox!NS_internal_main+0xaf 08 00000092`71bffb70 00007ff6`c16e2e19 firefox!wmain+0x163 09 (Inline Function) --------`-------- firefox!invoke_main+0x22 0a 00000092`71bffbd0 00007ff8`4d0d8364 firefox!__scrt_common_main_seh+0x11d 0b 00000092`71bffc10 00007ff8`4fba70d1 KERNEL32!BaseThreadInitThunk+0x14 0c 00000092`71bffc40 00000000`00000000 ntdll!RtlUserThreadStart+0x21 But that buffer was allocated by the Windows system heap, not jemalloc: 0:000> !heap -p -a @rcx address 0000023a78b65fc0 found in _DPH_HEAP_ROOT @ 23a78431000 in busy allocation ( DPH_HEAP_BLOCK: UserAddr UserSize - VirtAddr VirtSize) 23a78447000: 23a78b65fc0 40 - 23a78b65000 2000 00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65 00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc 00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727 00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036 00007ff6c16e3295 firefox!operator new+0x0000000000000031 00007ff6c16e1fa0 firefox!std::_Allocate+0x0000000000000074 00007ff6c16e2020 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::_Copy+0x0000000000000074 00007ff6c16e22d9 firefox!std::basic_string<char,std::char_traits<char>,std::allocator<char> >::assign+0x0000000000000095 00007ff6c16e1b0f firefox!mozilla::GetBootstrap+0x0000000000000063 00007ff6c16e1181 firefox!InitXPCOMGlue+0x000000000000005d 00007ff6c16e12cb firefox!NS_internal_main+0x00000000000000af 00007ff6c16e1a63 firefox!wmain+0x0000000000000163 00007ff6c16e2e19 firefox!__scrt_common_main_seh+0x000000000000011d 00007ff84d0d8364 KERNEL32!BaseThreadInitThunk+0x0000000000000014 00007ff84fba70d1 ntdll!RtlUserThreadStart+0x0000000000000021 So, somehow, std::_Dellocate got wired up to jemalloc, but the string allocation stuff didn't.

Assignee

Comment 10

•

8 years ago

I'm ready to bet that porting the changes from bug 1332523 to nsBrowserApp.cpp to TB would fix it. The only reasonable explanation I can think of is that what happens is that somehow it's trying to free() a stack pointer, the one for exePath. Which actually shouldn't be happening, because the std::string constructor is supposed to be a copy constructor, so freeing the std::string shouldn't try to free the original buffer, but a copy of it. But that's only a theory. The best way to know for sure would be to print the address of the ptr the arena_dalloc call crashes for, and find what that pointer is for.

Assignee

Comment 11

•

8 years ago

Didn't see comment 9. Oh my...

Assignee

Comment 12

•

8 years ago

Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build.

Assignee

Comment 13

•

8 years ago

(In reply to Mike Hommey [:glandium] from comment #12) > Try removing DISABLE_STL_WRAPPING from xpcom/glue/standalone/moz.build. Doesn't work. I'm mystified this actually doesn't bust Firefox the same way... The easy fix is to change the code in nsXPCOMGlue.cpp to match what used to be in nsBrowserApp.cpp and not use the STL for that. On the long run, though, it should be possible to use the STL in non libxul code now, as long as libmozglue is linked.

Assignee: nobody → mh+mozilla

Comment hidden (mozreview-request)

Comment 15

•

8 years ago

FWIW I was building Firefox (not TB), with --disable-sandbox. With the patch I still get a mismatch. Here's the alloc: 00007ff84fc0fd99 ntdll!RtlDebugAllocateHeap+0x000000000003bf65 00007ff84fbfdb7c ntdll!RtlpAllocateHeap+0x0000000000083fbc 00007ff84fb78097 ntdll!RtlpAllocateHeapInternal+0x0000000000000727 00007ff84cfe0d16 ucrtbase!malloc_base+0x0000000000000036 00007ff74ecb26f5 firefox!operator new+0x0000000000000031 [f:\dd\vctools\crt\vcstartup\src\heap\new_scalar.cpp @ 19] 00007ff74ecb1afd firefox!mozilla::GetBootstrap+0x0000000000000051 And the free: 00 00000001`001ffb18 00007ff7`4ecb1b78 firefox!operator delete[] 01 (Inline Function) --------`-------- firefox!mozilla::DefaultDelete<char [0]>::operator()+0x8 02 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::reset+0xd 03 (Inline Function) --------`-------- firefox!mozilla::UniquePtr<char [0],mozilla::DefaultDelete<char [0]> >::{dtor}+0xd 04 00000001`001ffb20 00007ff7`4ecb1181 firefox!mozilla::GetBootstrap+0xcc 05 00000001`001ffb50 00007ff7`4ecb12cb firefox!InitXPCOMGlue+0x5d

Comment 16

•

8 years ago

(In comment 15, "firefox!operator delete[]" goes to firefox!free, which goes to mozglue!free_impl.)

Comment 17

•

8 years ago

And in case anyone's worried about the scalar-new and array-delete mismatch: the CRT "operator new[]" does a tail call to "operator new", which doesn't show up in the stack.

Assignee

Comment 18

•

8 years ago

David, can you try the attached patch? If that still crashes, I'll have to also change the new to use malloc and the UniquePtr to free with free(). But I think the mismatch was caused by new happening in the CRT through std::string. I'd expect an explicit new to be okay.

Flags: needinfo?(dmajor)

Comment 19

•

8 years ago

Yeah, comment 15 was based on having applied your patch from comment 14.

Flags: needinfo?(dmajor)

Comment hidden (mozreview-request)

Assignee

Comment 21

•

8 years ago

Can you try the updated patch?

Flags: needinfo?(dmajor)

Comment 22

•

8 years ago

Patch v2 works.

Flags: needinfo?(dmajor)

Comment hidden (mozreview-request)

Ryan VanderMeulen [:RyanVM]

Assignee

Comment 24

•

8 years ago

Rebased on a revision of m-i that doesn't have valgrind regressions...

Nathan Froyd [:froydnj]

Comment 25

•

8 years ago

mozreview-review

Comment on attachment 8829679 [details] Bug 1333081 - Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows. https://reviewboard.mozilla.org/r/106682/#review107924 So many hoops to jump through. Interested on your take on the below. ::: xpcom/glue/standalone/nsXPCOMGlue.cpp:405 (Diff revision 3) > + strncpy(file.get(), aXPCOMFile, base_len); > + strcpy(file.get() + base_len, XPCOM_DLL); I wonder if it's not simpler to use `memcpy` here: ``` memcpy(file.get(), aXPCOMFile, base_len); memcpy(file.get() + base_len, XPCOM_DLL, sizeof(XPCOM_DLL)); ``` A little more verbose, but maybe more easily auditable?

Attachment #8829679 - Flags: review?(nfroyd) → review+

Comment hidden (mozreview-request)

Pulsebot

Comment 27

•

8 years ago

Pushed by mh@glandium.org: https://hg.mozilla.org/integration/autoland/rev/9ada53480b75 Avoid using the STL in nsXPCOMGlue.cpp to avoid allocator mismatch on Windows. r=froydnj

Updated

•

8 years ago

status-firefox52: --- → unaffected

status-firefox54: --- → affected

Phil Ringnalda (:philor)

Comment 28

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9ada53480b75

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox54: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla54

Reporter

Comment 29

•

8 years ago

Can someone kindly explain to me why this problem only showed up when building with "disable-sandbox". (In reply to Mike Hommey [:glandium] from comment #10) > I'm ready to bet that porting the changes from bug 1332523 to > nsBrowserApp.cpp to TB would fix it. How much would you like to bet? ;-) We did port those changes to be able to compile (bug 1332898, mentioned in bug 1332523 comment #20). That's when the trouble started ;-(

Comment 30

•

8 years ago

(In reply to Jorg K (GMT+1) from comment #29) > Can someone kindly explain to me why this problem only showed up when > building with "disable-sandbox". Glandium knows the details better, but my high-level hand-wavey understanding is that without the sandbox we send a different set of code to the linker and it interacts differently with our jemalloc function-substitution wizardry. I dealt with something similar in bug 1023941 comment 32 -- that specific case is no longer relevant to our codebase, but it gives you an idea of what kind of horrible stuff goes on under the hood.