Closed Bug 411171 Opened 17 years ago Closed 16 years ago

Thunderbird Mac tinderbox crashing in dump_syms

Categories

(Toolkit :: Crash Reporting, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: philor, Assigned: gozer)

References

()

Details

(Whiteboard: [needs info re 3.0a build process])

Apparently for quite a while now, the Thunderbird Mac tinderbox has been crashing while trying to dump symbols - Ted says "2007100103 is the last one to have symbols for thunderbird-bin" and logs back into December show the same

dump_syms(25544) malloc: *** vm_allocate(size=1406185472) failed (error code=3)
dump_syms(25544) malloc: *** error: can't allocate region
dump_syms(25544) malloc: *** set a breakpoint in szone_error to debug
2008-01-06 03:50:04.817 dump_syms[25544] *** Uncaught exception: <NSInvalidArgumentException> *** NSCopyMemoryPages(0x2008000, 0x0, 1124945920) failed
dump_syms(25546) malloc: *** vm_allocate(size=1406185472) failed (error code=3)
dump_syms(25546) malloc: *** error: can't allocate region
dump_syms(25546) malloc: *** set a breakpoint in szone_error to debug
2008-01-06 03:50:05.336 dump_syms[25546] *** Uncaught exception: <NSInvalidArgumentException> *** NSCopyMemoryPages(0x2008000, 0x0, 1124945920) failed
cf is going to grab thunderbird-bin from this box for examination.
Oops, I might have accidentally fixed this. I was looking at the log for the (clobbered) first build after I checked in bug 414515, http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1202072340.1202075003.20596.gz&fulltext=1 and noticed it survived dump_syms, while the previous nightly didn't.
And the next real nightly worked, too, so apparently whatever broke it (probably me, 20071001 being when I turned on SVG in Thunderbird), switching from --enable-optimize="-O2 -g" to export C(XX)FLAGS="-g -gfull" did fix it.
Going to resolve this for now, if it pops back up we'll reopen. Without being able to reproduce it locally it's hard to fix.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → WORKSFORME
This seems to have returned:

buildid: 2008043003
Uploading nightly release build
make -C /builds/tinderbox/Tb-Trunk/Darwin_8.8.4_Depend/mozilla/../build/universal/ppc buildsymbols
echo building symbol store
building symbol store
mkdir -p ./dist/crashreporter-symbols/2008043003
/usr/bin/python /builds/tinderbox/Tb-Trunk/Darwin_8.8.4_Depend/mozilla/toolkit/crashreporter/tools/symbolstore.py    \
  -a "ppc i386" --vcs-info -s /builds/tinderbox/Tb-Trunk/Darwin_8.8.4_Depend/mozilla ./dist/host/bin/dump_syms     \
  ./dist/crashreporter-symbols/2008043003                    \
  ./dist/universal >                                    \
  ./dist/crashreporter-symbols/2008043003/thunderbird-3.0a1pre-Darwin-2008043003-symbols.txt
dump_syms(15531) malloc: *** vm_allocate(size=1331736576) failed (error code=3)
dump_syms(15531) malloc: *** error: can't allocate region
dump_syms(15531) malloc: *** set a breakpoint in szone_error to debug
2008-04-30 03:48:36.662 dump_syms[15531] *** Uncaught exception: <NSInvalidArgumentException> *** NSCopyMemoryPages(0x2008000, 0x0, 1065385984) failed
dump_syms(15533) malloc: *** vm_allocate(size=1331736576) failed (error code=3)
dump_syms(15533) malloc: *** error: can't allocate region
dump_syms(15533) malloc: *** set a breakpoint in szone_error to debug
2008-04-30 03:48:37.245 dump_syms[15533] *** Uncaught exception: <NSInvalidArgumentException> *** NSCopyMemoryPages(0x2008000, 0x0, 1065385984) failed

Marking as blocking-3.0a1 because we want to be able to get useful crash-data out of 3.0a1.
Flags: blocking-thunderbird3.0a1+
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
If someone can reproduce this locally, we can probably get a fix (or at least a workaround).
nthomas says that the symbols for thunderbird-bin disappeared between 2008041303 and 2008041403.

A snippet from IRC about reproducing locally:

[09:53am] dmose: ted: on that dump_syms crashiness; is reproducing locally like just a matter of building with an identical mozconfig as the tbox?  or is there more to it?
[09:53am] ted: i only have leopard here, so you probably need to be on tiger, maybe with the same version of xcode
[09:53am] ted: i couldn't repro on leopard
[09:54am] dmose: ted: interesting.  but just building with the same mozconfig should be enough?  or need to invoke via tinderbox clients scripts?
[09:54am] ted: same mozconfig should be enough, hopefully
[09:55am] dmose: i guess we'll find out!
[09:55am] ted: yeah
[09:55am] ted: tinderbox doesn't call it in any particularly fancy way
[09:55am] ted: it just builds, then runs |make buildsymbols| in the objdir
[09:56am] dmose: ok, good to know
[09:56am] dmose: the tinderbox is PPC or intel?
[09:56am] ted: intel
[09:56am] ted: (pretty sure)
<http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-04-13+01%3A00%3A00&maxdate=2008-04-14+04%3A00%3A00&cvsroot=%2Fcvsroot> is the checkins around that time (with a little slop on the edges for good measure).  At first glance, nothing jumps out at me as an obvious candidate for having caused this...  I wonder if there is any chance that something on that machine's configuration changed during that time frame.
Given the fact that it showed up earlier and disappeared, it's probably a bug in dump_syms that's triggered by a a particular set of compiler output. While there may be some patches that we could back out to get it to go away, there's no guarantee it wouldn't resurface again.
We have similar problem with the SeaMonkey Mac tinderbox on and on, see bug 395664
Can't see anything wrong with this machine - it has at least half of it's 4GB of it's RAM free as I look at it now - and I can't find any record of any config changes.

If dump_syms is asking for more than a TB of memory then that's always going to be a big ask. :-)
smichaud was gracious enough to try this on his Tiger machine, and the dump_syms worked fine there, both in the i386 and ppc directories.  So we still need to figure out how to reproduce...
Re-assigning to rick in the hopes that he has or can get access to the actual machine where this is happening and can reproduce it / catch it in the debugger there...
Assignee: nobody → rick.tessner
Status: REOPENED → NEW
Did a build on bm-xserve07 in ~cltbld/rick-411171 and could not reproduce.  I then realized that I'd only built i386 and the dump_syms works fine for that.

However, the nightly builds are a universal build.  I am currently retrying the build as a universal (i386 and ppc) and we'll see whether I can reproduce the problem.
I would have thought that using the mozconfig from that Tinderbox would have forced a universal build (it did for me on my local Leopard machine), where I also couldn't reproduce.
It does if you have the mozilla/build/macosx/universal/mozconfig checked out as well.  That's dotted by the mozconfig.  Since I started with a clean check-out, I did not have the universal/mozconfig checked out.

Once I had that checked out as well, the universal build proceeded.

That still did not reproduce the problem tho.  ie. The build and buildsymbols step was completely successful.

Below are the steps I did to do the build directly on bm-xserve07 (as the user cltbld):

mkdir ~/rick-411171
cp /builds/tinderbox/Tb-Trunk/mozconfig ./
export MOZCONFIG=$PWD/mozconfig

cvs -d :ext:tbirdbld@cvs.mozilla.org:/cvsroot co mozilla/client.mk mozilla/build/macosx/universal/mozconfig
cd mozilla
/usr/bin/make -f client.mk MOZ_OBJDIR=../build/universal checkout
mkdir -p -m 0777 ../build/universal
/usr/bin/make -f client.mk MOZ_OBJDIR=../build/universal CONFIGURE_ENV_ARGS='CC=cc CXX=c++' build_all_depend

And once that completed, I ran the buildsymbols step with:

make -C /Users/cltbld/rick-411171/build/ universal/ppc buildsymbols

And that did not come up with the dump_syms error at all.  The log of this can be seen on bm-xserve07:~cltbld/rick-4111171/screenlog.0

If we can't get this crash to repro (and hence fix) soon, I'm tempted to not consider it a blocker for 3.0a1.  Not having crash data for mac is problematic, but not as problematic as not having any feedback.
The only other thing I could suggest would be to stop the nightly tinderbox after it's built a nightly, and run dump_syms manually to reproduce the problem.
One other random thing to try: after your build finishes, but before you buildsymbols, run and kill dist/Thunderbird.app, to imitate having run MozillaAliveTest. (Okay, more than one, since you could also run the make package equivalent, and the regxpcom test, and codesighs, but I'm more suspicious of having run the build.)
(In reply to comment #0)
> Apparently for quite a while now, the Thunderbird Mac tinderbox has been
> crashing while trying to dump symbols - Ted says "2007100103 is the last one to
> have symbols for thunderbird-bin" and logs back into December show the same
> 
> dump_syms(25544) malloc: *** vm_allocate(size=1406185472) failed (error code=3)
> dump_syms(25544) malloc: *** error: can't allocate region
> dump_syms(25544) malloc: *** set a breakpoint in szone_error to debug
> 2008-01-06 03:50:04.817 dump_syms[25544] *** Uncaught exception:

Is any change needing in monitoring symbols upload?
ref: bug 401808 nagios monitoring for breakpad symbol upload
Wayne: it's sort of tricky, we could probably monitor specific important files like thunderbird-bin/libxul etc, but that doesn't really guarantee that everything was good. I guess ideally we should make the buildsymbols step just fail if dump_syms crashes anywhere.
want a new bug on that, or should I reopen bug 401808?
Alrighty then, the nightly build that I ran explicitly on bm-xserve07 did die with the dump_syms error:

./dist/crashreporter-symbols/2008050210/thunderbird-3.0a1pre-Darwin-2008050210-symbols.txt
dump_syms(5496) malloc: *** vm_allocate(size=1331904512) failed (error code=3)
dump_syms(5496) malloc: *** error: can't allocate region

I grabbed the directory /builds/tinderbox/Tb-Trunk and tar'd it into ~cltbld/rick-411171/Tb-Trunk

I then ran, while cd'd to this *copy* of the Tb-Trunk directory,

make -C /Users/cltbld/rick-411171/Tb-Trunk/Darwin_8.8.4_Depend/build/universal/ppc buildsymbols

and it ran just fine.  I'm at a loss at what this could be at this point.  That malloc is trying to allocate 1.3Gb.  Could we be that border-line on available memory that while running inside the perl build-seamonkey.pl script we fail adn when just running the make, it passes?

Any ideas out there?

Is it possible that make saw that the file already existed since it had been run once, decided that that made the dependency up-to-date, and didn't try to run dump_syms again?

I assume the original crash was while generating PPC symbols?
buildsymbols doesn't have any dependencies, so that's not it. Also, buildsymbols runs dump_syms once per-arch per-file on a universal build.
While we'd very much like to see this fixed for 3.0a1, we're not going to block on it.

Rick, it might be worth playing around with the suggestions Phil had in comment 20.

Explicitly bumping the OS virtual memory / swap size might also be worth looking into.
Flags: blocking-thunderbird3.0a2+
Flags: blocking-thunderbird3.0a1-
Flags: blocking-thunderbird3.0a1+
I've been searching about trying to figure out how to create more swap space on mac osx.  Articles that I've read seem to indicate that the OS itself takes care of creating swap file in /var/vm.

To test this, I put togther a little C program that just keeps grabbing 1/2 Gb of memory and sleeps for 2 seconds.  Once it gets up to 2.5 Gb total allocation, I get the nice error:

a.out(15125) malloc: *** vm_allocate(size=536870912) failed (error code=3)
a.out(15125) malloc: *** error: can't allocate region
a.out(15125) malloc: *** set a breakpoint in szone_error to debug

which seems to indicate that swap is not created automatically by the OS.

On bm-xserve07, a |df -h /var/vm| shows that it's on the root partition and that there's plenty (21G) of space available.

So, does anyone have any idea on how to create swap on OSX?  Or is there some limit on OSX about how much swap can be created?

I'm almost tempted just to reboot the box and force a nightly build.  (I'm wondering if there might be memory leaks somewhere that has led to a mem shortage ... it has been up for about 125 days at this point)
Rebooted bm-xserve07 ahead of today's nightly to test the memory shortage hypothesis. It'll do an some hourly builds before hitting the nightly.
top is logging into ~/nthomas/top.log on a 5 second interval, for the two hours from 2:48 PDT. Hopefully there will be some clues there; alternatively we could sample with a much smaller interval given some trigger for the symbol collection.
rick, nick -- at some point it'd be good to get an update on this bug, as it's marked blocking-tb3a2, and if anyone has made progress on figuring out what happened, that'd be good to know.
Whiteboard: [status unknown]
I'm not aware of any progress on this.
Blocks: 439142
Assignee: rick.tessner → gozer
I've been able to get a new osx buildbot running (still in testing/debugging mode) and added the buildsymbols step to it. So far, it has successfully completed that step every single time.

Regarding the swapping, on OS X, you pretty much have as much swap space as free disk space in /, unless otherwise modified in /etc/rc (read dynamic_pager(8) for all the details). You might also have per-user limits (see ulimit -a)

Not sure if this qualifies as progress, but I can report being unable to reproduce this failure.
Presumably, if we build 3.0a2 on gozer's buildbot and then use that machine instead of our current tinderbox going forward, we could declare victory here.  Any chance of either of one or both of those things happening?
Whiteboard: [status unknown] → [needs info re 3.0a build process]
The impression I got from joduinn is that bhearsum is planning to do 3.0a2 builds on brand new VMs.  This may just go away w/ the new VMs.
I can clearly reproduce such a crash with the steps mentioned in bug 444211 comment 3. It's an official nightly build and no debug build. So if I can help please give me note.
Blocks: 445090
No longer blocks: 445090
Not that this is marked as blocking bug 439142 that users are likely to see fairly frequently - i.e. its a repeatable crashing bug based on messages sent to you. 
For the record, this was also a problem for the 3.0a2 release builds that we did with tinderbox, but isn't for 3.1b1pre nightlies under buildbot. I'll manually correct that for the release, but this is probably WONTFIX now that Thunderbird 3.x development moved away from tinderbox.
Would be nice to figure out the underlying cause here, but given how hard this is to reproduce, I don't think it's worth the effort right now.
Status: NEW → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → INCOMPLETE
So if this is not going to be fixed, is there antoher way to fix Bug 439142 - a bad, repeatable, crashing bug, which is still marked dependent on this. 
See bug 439142 comment 10 - we should have symbols now (ie by moving away from the system that was failing to generate them).
No longer blocks: 439142
You need to log in before you can comment on or make changes to this bug.