Closed Bug 549561 Opened 14 years ago Closed 14 years ago

tracemalloc is very slow on win32

Categories

(Testing :: General, defect)

x86
Windows Server 2003
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: dbaron)

References

Details

Attachments

(6 files, 1 obsolete file)

The time to run the AliveTest for TraceMalloc, aka
  python leaktest.py -- --trace-malloc malloc.log --shutdown-leaks=sdleak.log
    in dir obj-firefox/_leaktest
is about 90 seconds for linux and mac, but is over 70 minutes on the new hardware slaves for win32. I've also seen looked at the Task Manager while the AliveTest is running and the machine is mostly idle, with Firefox occasionally using < 5% of CPU. Could it be blocking on something ? Could a developer take a debugger to one of the builds in 
  http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-win32-debug/
and try to reproduce this ?

Complete guess at the right place to file this, please fix as required.
I think this is largely a function of the Windows stack walking code.  I think it's possible that:
 * the locking may be too fine-grained
 * we may be doing some work in NS_StackWalk that we could defer until NS_DescribeCodeAddress
 * other things

Profiling is really the way to find out.
Well, let me approach this from a different angle. Are the MH, A, and Lk values valuable enough to spend 70-90 minutes generating them ? Do you know if the leak regressions in
 http://graphs.mozilla.org/graph.html#tests=[{%22test%22:%2228%22,%22branch%22:%221%22,%22machine%22:%22107%22}]&sel=1262036479,1268200095
have been noticed and are filed ?
I think they probably haven't been filed because catlee's scripts aren't pointing them out, and people have stopped looking at the graphs and are relying on the scripts.
Also, that graphs.m.o URL you give doesn't actually load (I've been waiting since before I wrote the previous comment), so I can't see what the regressions are.
To fix this, we should either:
 * find somebody who knows how to use a Windows profiling tool, or
 * add an option to make trace-malloc not generate stacks, and add that option to the windows build configs

The second is pretty easy, and should make the performance problem pretty much go away.
Filed bug 551530 on the largest of the regressions in the graph you pointed to, and bug 551531 on the lack of regression emails.
This isn't a huge win, but it's a small perf win I noticed.  There's only one caller of calltree(), and that caller aquires the lock right after calling calltree, and calltree also acquires the lock at the beginning of the function and releases it at the end, so we may as well change calltree to require its caller to be holding the lock.
Attachment #466372 - Flags: review?(benjamin)
This is the main performance win here; we'll need to change the relevant mozconfigs to benefit.

This adds a configure option to disable the stack-walking part of trace-malloc.  I stubbed things out so that everything that would have a stack has a bogus single-frame stack, to avoid having to hack tmreader to deal with empty stacks.
Attachment #466374 - Flags: review?(benjamin)
Attachment #466372 - Flags: review?(benjamin) → review+
Attachment #466683 - Flags: review?(benjamin) → review+
http://hg.mozilla.org/mozilla-central/rev/ea54a11bd987
http://hg.mozilla.org/mozilla-central/rev/60944ddadc17

Now I should investigated whether the scripts in which we should set
NS_TRACE_MALLOC_DISABLE_STACKS=1
are in the tree (I think they are...).
Great to see progress on this bug. Any news on comment #11 ?
Blocks: 593545
OK, build/leaktest.py.in uses automation.py, so I think I just want to patch automation.py.  That will affect unit tests too, which is good.
Attachment #474059 - Flags: review?(ted.mielczarek) → review+
So it turns out that while this works fine on all platforms other than windows, it causes a crash on shutdown on the Windows debug leak test machines.  (And since we don't have Windows debug unit tests on try, I can't tell if it does anything to them; if we did, they might provide more useful information that would let me figure out what the problem is.)

So this is on hold for now.
Filed bug 595730 for the test coverage regression on try.
Depends on: 595730
I got a unit test run with all the unit tests orange as well; it requires --shutdown-leaks=<file> to cause the orange.  None of the orange runs had a minidump (they seem to be broken on Windows debug these days; there's a bug somewhere), but perhaps that'll be enough information to figure out what's up.
Attached file stack of crash
I can reproduce the crash in a Windows VM, but I haven't found anything enlightening.
This fixes a single missing line from patch 2 that caused an uninitialized variable and led to the orange when I set the env var (presumably because we'd randomly miss both mallocs and frees, and thus have dangling pointers).

With this, attachment 474059 [details] [diff] [review] is ready to land, and should make Windows debug builds quite a bit faster.
Attachment #477246 - Flags: review?(benjamin)
Attachment #477246 - Flags: review?(benjamin) → review+
http://hg.mozilla.org/mozilla-central/rev/0778c7931695
http://hg.mozilla.org/mozilla-central/rev/d20b0bf1cf3a
Assignee: nobody → dbaron
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
I reverted the change for Mac only:
http://hg.mozilla.org/mozilla-central/rev/6a3efef10e69
since it caused the Mac OS X 10.5 32-bit leak test trace-malloc leak test to hang; the 10.6 64-bit run of the same test was fine.

(My guess is it might be related to bug 478195.)
You need to log in before you can comment on or make changes to this bug.