Don't use --enable-trace-malloc on debug desktop tinderbox builds

RESOLVED FIXED in mozilla32

Status

defect
RESOLVED FIXED
5 years ago
2 years ago

People

(Reporter: njn, Assigned: njn)

Tracking

unspecified
mozilla32
x86_64
Linux
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments, 4 obsolete attachments)

On TBPL, debug desktop builds are built with --enable-trace-malloc. This means they don't use jemalloc, because TraceMalloc is incompatible with jemalloc. But as far as I know, the TraceMalloc enabling serves no purpose, other than perhaps to provide compile coverage.

I propose changing these debug builds to use --enable-dmd. DMD is a tool that actually gets some use, and so it would be nice to have compile coverage. DMD is also compatible with jemalloc.
The alternative is bug 999869. (Not necessarily an alternative, an --enable-dmd build would just as well be able to use trace-malloc)
I'd prefer to get rid of trace-malloc, and I'm thinking about how to use DMD to achieve the rump functionality that people still use trace-malloc for.
Hmm:
https://tbpl.mozilla.org/?tree=Try&rev=a9eb7983884f

It fails to build on Windows because apparently to build DMD on WinXP with MOZ_OPTIMIZE also requires MOZ_PROFILING. Something to do with frame pointers, see bug 948621.

And then I get a failure in Mochi-4:

> 113 INFO TEST-UNEXPECTED-FAIL | /tests/dom/tests/mochitest/general/test_interfaces.html | If this is failing: DANGER, are you sure you want to expose the new interface DMDReportAndDump to all webpages as a property on the window (XBL: false)? Do not make a change to this file without a review from a DOM peer for that specific change!!! (or a JS peer for changes to ecmaGlobals)

Maybe I'll just settle for disabling trace-malloc for now.
Attachment #8425267 - Flags: review?(mh+mozilla)
Assignee: nobody → n.nethercote
Status: NEW → ASSIGNED
Try looks good for just disabling trace-malloc:
https://tbpl.mozilla.org/?tree=Try&rev=aa7a265b2bbb
Attachment #8425267 - Flags: review?(mh+mozilla) → review+
If you do this, as a followup we should turn off the "leak tests", since all they do is load a few pages and look for leaks:
http://mxr.mozilla.org/mozilla-central/source/build/leaktest.py.in

We have refcounted leak checks on all our debug unittests, so the only value here was for tracemalloc.
I thought we already turned them off, no?
Now I'm wondering if leaktest.py.in can be removed. It's only use of note is within the |leaktest| target in testing/testsuite-targets.mk, and I don't see any invocations of that target within m-c, though there could be some from other repos. Anybody know?
Nothing else uses it. I filed bug 1013670 to remove it. Apparently I missed the bug that stopped running them!
sorry had to back this out. We had failing jsrefest like https://tbpl.mozilla.org/php/getParsedLog.php?id=40074441&tree=Mozilla-Inbound starting on https://hg.mozilla.org/integration/mozilla-inbound/rev/efa8579fe73a (which is completely unrelated to that failure :/) - so glandium suggested in order to fix this failing tests to back this out

00:14 < glandium> Tomcat|sheriffduty: yeah, it touches absolutely no files that are used on mac except configure.in and js/src/configure.in, both on which it only touches non mac code paths
00:15 < glandium> Tomcat|sheriffduty: and the build log agrees the configure change didn't change how it's built
00:18 < glandium> and it's a crash in jemalloc...
00:18 < glandium> it could very well have been a latent issue that turning tracemalloc off turned on
00:21 < glandium> backing out 1013014 definitely would make this go away, since jemalloc would be disabled and as such, won't be able to crash
I've encountered that jsreftest issue before in bug 969692. Jemalloc in debug builds enables opt_junk, so it's doing similar work that memory poisoning would do:

http://dxr.mozilla.org/mozilla-central/source/memory/mozjemalloc/jemalloc.c#5272
I wonder if we could just turn opt_junk off. It would paper over the problem, admittedly.
There are several options. One is to turn opt_junk off by default (still can be reenabled at runtime). Another is to turn opt_junk off for that test only with an environment variable.

We've never actually relied on opt_junk being on on debug builds, so I don't really care either way.
Another option is to remove MOZ_MEMORY_DEBUG from configure.in, which would have the side effect of forcing a reconfigure, which this landing didn't do, which is why the failure didn't happen on it, but on another push that did force a reconfigure. The alternative is to update CLOBBER, but that would trigger clobbers for local builds too, which are not impacted by this bug. Or trigger the clobberer on all branches at merge time (tedious).
IMO that opt_junk memset in huge_dalloc doesn't make a lot of sense. Everywhere else opt_junk controls whether to junk on *allocation*, not deallocation (there's opt_poison for that). And there's no need to junk/poison in huge_dalloc since we're going to unmap immediately afterwards.

In bug 860254 comment 29, we left that memset in place since nobody was using the opt_junk path. If that path is now being used, maybe we should revisit the decision to keep the memset.
> IMO that opt_junk memset in huge_dalloc doesn't make a lot of sense.
> Everywhere else opt_junk controls whether to junk on *allocation*, not
> deallocation (there's opt_poison for that).

Yes... 0xa5 is the junk value, and 0x5a is the poison value, except in this case.

> And there's no need to
> junk/poison in huge_dalloc since we're going to unmap immediately afterwards.

I agree. I'll remove it and do a test run on try.
(In reply to David Major [:dmajor] (UTC+12) from comment #17)
> IMO that opt_junk memset in huge_dalloc doesn't make a lot of sense.
> Everywhere else opt_junk controls whether to junk on *allocation*, not
> deallocation (there's opt_poison for that). And there's no need to
> junk/poison in huge_dalloc since we're going to unmap immediately afterwards.
> 
> In bug 860254 comment 29, we left that memset in place since nobody was
> using the opt_junk path. If that path is now being used, maybe we should
> revisit the decision to keep the memset.

r+

Note jemalloc3 made this memset optional to whether the allocation is going to be unmapped or not (there are conditions where it won't be unmapped, but i think the only condition, really, is using DSS chunks, and we don't do that anyways)
For reference, here's the patch I sent to try.
Please do a dummy configure.in change (like adding/removing a blank line) when landing to trigger a reconfigure.
glandium asked me to add explicit --enable-jemalloc to the windows debug
mozconfigs.
Attachment #8426733 - Flags: review?(mh+mozilla)
Attachment #8426667 - Attachment is obsolete: true
Attachment #8425267 - Attachment is obsolete: true
Comment on attachment 8426733 [details] [diff] [review]
(attempt 2) - Disable Trace Malloc on TBPL debug builds

Review of attachment 8426733 [details] [diff] [review]:
-----------------------------------------------------------------

::: configure.in
@@ -3207,5 @@
>  if test "$ac_cv_cpp_dynamic_cast_void_ptr" = yes ; then
>     AC_DEFINE(HAVE_CPP_DYNAMIC_CAST_TO_VOID_PTR)
>  fi
>  
> -

Thank you :)
Attachment #8426733 - Flags: review?(mh+mozilla) → review+
Good news: the opt_junk change appears to have fixed the Mac jsreftest hangs: https://tbpl.mozilla.org/?tree=Try&rev=6240da929257

Bad news: win32 xpcshell tests are consistently failing in test_TelemetryLateWrites.js: https://tbpl.mozilla.org/?tree=Try&rev=ea44dc0cf6d0. On WinXP it's due to hitting jemalloc_crash(). On Win7 it's due to a timeout. On Win8 I'm not sure.
Since it's an easy patch, I went ahead and built it locally. It fails in different ways but always in arena_dalloc. 

Sometimes a crash on this dereference: arena = chunk->arena;
http://hg.mozilla.org/mozilla-central/file/a7d685480bf2/memory/mozjemalloc/jemalloc.c#l4576

Otherwise this assert trips: RELEASE_ASSERT(arena->magic == ARENA_MAGIC);
http://hg.mozilla.org/mozilla-central/file/a7d685480bf2/memory/mozjemalloc/jemalloc.c#l4578

Perhaps the argument to je_free() came from a different allocator? WinDbg claims that the address is in one of the standard (i.e. not jemalloc) heaps.

mozglue!arena_dalloc
mozglue!je_free
xul!std::_DebugHeapDelete<std::locale::facet>
xul!std::_Fac_node::~_Fac_node
xul!std::_Fac_node::`scalar deleting destructor'
xul!std::_DebugHeapDelete<std::_Fac_node>
xul!_Fac_tidy
xul!std::_Fac_tidy_reg_t::~_Fac_tidy_reg_t
xul!std::`dynamic atexit destructor for '_Fac_tidy_reg''
xul!_CRT_INIT
xul!__DllMainCRTStartup
xul!_DllMainCRTStartup
ntdll!LdrpCallInitRoutine
ntdll!LdrShutdownProcess
ntdll!RtlExitUserProcess
kernel32!ExitProcessStub
ntdll!RtlAllocateMemoryZone
MSVCR100D!__crtExitProcess
MSVCR100D!__freeCrtMemory
MSVCR100D!exit
xpcshell!__tmainCRTStartup
xpcshell!wmainCRTStartup
kernel32!BaseThreadInitThunk
ntdll!__RtlUserThreadStart
ntdll!_RtlUserThreadStart
So now the question is why these chunks of memory weren't allocated with jemalloc if they properly are hooked to free with je_free.
gflags FTW, here's the allocation stack:

0:000> !heap -p -a 60c690
    address 0060c690 found in
    _HEAP @ 590000
      HEAP_ENTRY Size Prev Flags    UserPtr UserSize - state
        0060c658 000b 0000  [00]   0060c670    0003c - (busy)
        77e0dff2 ntdll!RtlAllocateHeap+0x00000274
        65858343 MSVCR100D!_heap_alloc_base+0x00000053
        6586697c MSVCR100D!_nh_malloc_dbg+0x000002dc
        6586671f MSVCR100D!_nh_malloc_dbg+0x0000007f
        658666cc MSVCR100D!_nh_malloc_dbg+0x0000002c
        65866681 MSVCR100D!_malloc_dbg+0x00000021
        66ec2f10 MSVCP100D!operator new+0x00000020
        66e6c08a MSVCP100D!std::locale::facet::operator new+0x0000001a
        66e6de11 MSVCP100D!std::ctype<char>::_Getcat+0x00000061
        5b4e5b68 xul!std::use_facet<std::ctype<char> >+0x0000004b
        5b4e6a10 xul!std::operator>><char,std::char_traits<char>,std::allocator<char> >+0x00000048
        5b8107d3 xul!`anonymous namespace'::ReadStack+0x000000bf
        5b810b58 xul!`anonymous namespace'::TelemetryImpl::ReadLateWritesStacks+0x000000f5
        5b810cec xul!`anonymous namespace'::nsFetchTelemetryData::Run+0x00000084
        59e9cd93 xul!nsThreadPool::Run+0x000001cc
        59e9e34b xul!nsThread::ProcessNextEvent+0x0000026b
        59e37dea xul!NS_ProcessNextEvent+0x00000044
        5a0fd618 xul!mozilla::ipc::MessagePumpForNonMainThreads::Run+0x000000ff
        5a0bcb91 xul!MessageLoop::RunInternal+0x0000003c
        5a0bddfc xul!MessageLoop::RunHandler+0x0000004d
        5a0bef1b xul!MessageLoop::Run+0x00000019
        59e9c7bc xul!nsThread::ThreadFunc+0x000000d2
        655ff04f nss3!_PR_NativeRunThread+0x000000cf
        65604a2f nss3!pr_root+0x0000000f
        6579a273 MSVCR100D!_beginthreadex+0x00000243
        6579a204 MSVCR100D!_beginthreadex+0x000001d4
        76c3338a kernel32!BaseThreadInitThunk+0x0000000e
        77dc9f72 ntdll!__RtlUserThreadStart+0x00000070
        77dc9f45 ntdll!_RtlUserThreadStart+0x0000001b
In fact, the answer to that might be in the big comment in mozglue/build/Makefile.in (and the linkage failure on win64 is related to that too)
So, combining comment 25 and comment 27, it would seem the CRT itself has an allocator mismatch, allocating with malloc_dbg and freeing with free instead of free_dbg.
Actually, here's something more plausible:
mozglue/build/fixcrt.py is breaking free_dbg (which, in fact, is what the linkage failure on win64 is about).
Testing a fix:
https://tbpl.mozilla.org/?tree=Try&rev=61916fb86d9d

I'm not sure it will work for xpcshell, but at the very least it should fix the win64 linkage.
> I'm not sure it will work for xpcshell

In fact, thinking about it some more, I have serious doubts it will work, and I'm more and more inclined to go with comment 29 :-/
(In reply to Mike Hommey from comment #31)
> Testing a fix:
> https://tbpl.mozilla.org/?tree=Try&rev=61916fb86d9d
> 
> I'm not sure it will work for xpcshell, but at the very least it should fix
> the win64 linkage.

I added _frex_dbg=dumb_free_thunk to my mozglue.def.in to work a debug win64 linking error I was having; maybe this fix will make it unnecessary?
(In reply to neil@parkwaycc.co.uk from comment #33)
> I added _frex_dbg=dumb_free_thunk to my mozglue.def.in to work a debug win64
> linking error I was having; maybe this fix will make it unnecessary?

Yes. See the try build, it's green on win64 debug. But as I feared, xpcshell is still orange.
Since this does fix a real problem, even though it doesn't fix the crashes... let get this reviewed.
Attachment #8426918 - Flags: review?(khuey)
Assignee: n.nethercote → mh+mozilla
Assignee: mh+mozilla → n.nethercote
Let's fix dependencies, too.
Attachment #8426927 - Flags: review?(khuey)
Attachment #8426918 - Attachment is obsolete: true
Attachment #8426918 - Flags: review?(khuey)
Guess what, comment 29 was right :(

locale0_implib.obj imports free(), which ends up being jemalloc's, while it's not allocating with malloc().
The msvc_removed.lib rule is really awful, but short of rewriting in python or doing fancy pants complex make rules, i have nothing better.

https://tbpl.mozilla.org/?tree=Try&rev=313cc2fd528e
Attachment #8427037 - Flags: review?(khuey)
Comment on attachment 8427037 [details] [diff] [review]
Also fixup locale0_implib.obj in mozcrt to avoid allocator mismatch in DebugHeapDelete

It's unfortunately not working because somehow the DebugHeapDelete@Vfacet@locale@std@@@std@@YAXPAVfacet@locale@0@@Z symbol in xul.dll is not coming from mozcrt.lib.
Attachment #8427037 - Flags: review?(khuey)
... and that symbol happens to come from Unified_cpp_ipc_chromium0.cpp, but in fact, it comes from the xdebug header, so it's likely to be everywhere, and the linker just picks one... we're so of screwed :(

I hope we can find a new workaround for this.
Here's the version I'm going to land, which glandium blessed via IRC.
Attachment #8426733 - Attachment is obsolete: true
Attachment #8427415 - Flags: review+
Attempt 2:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0c5047d370fc
Summary: Replace --enable-trace-malloc with --enable-dmd in debug desktop builds → Don't use --enable-trace-malloc on debug desktop tinderbox builds
Blocks: 1014976
https://hg.mozilla.org/mozilla-central/rev/0c5047d370fc
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
glandium, is your r+'d patch still relevant?
(In reply to Nicholas Nethercote [:njn] from comment #44)
> glandium, is your r+'d patch still relevant?

Mm? It's the one that landed, isn't it?
> > glandium, is your r+'d patch still relevant?
> 
> Mm? It's the one that landed, isn't it?

No, that was mine. AFAICT yours hasn't landed.
Ah, my patch with r+, not the patch with my r+. It's kind of still relevant but really only a theoretical problem now, and there are other problems with -MDd+jemalloc anyways.
Blocks: 1022563
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.