Closed Bug 532501 Opened 15 years ago Closed 15 years ago

pm01 died

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

Details

Attachments

(1 file)

production-master died right at 1:25pm PDT. The last bit of twistd.log looks like:
2009-12-02 13:25:10-0800 [-] Starting factory <HTTPClientFactory: http://hg.mozilla.org/l10n-central/ja/pushlog?fromchange=f03196c701a94fda1d8baba67838fc9467971d49>
2009-12-02 13:25:10-0800 [Broker,client] callback called again: (None,) {}
2009-12-02 13:25:10-0800 [Broker,7389,10.2.71.254] <RemoteShellCommand '['python', 'reftest/runreftest.py', '--appname=firefox/firefox.exe', '--utility-path=bin', '--symbols-path=symbols', '--extra-profile-file=jsreftest/tests/user.js', 'jsreftest/tests/jstests.list']'> rc=0
2009-12-02 13:25:10-0800 [Broker,7593,10.2.90.21] <RemoteShellCommand '['bash', '-c', 'if [ ! -d tools ]; then hg clone http://hg.mozilla.org/build/tools tools;fi']'> rc=0
2009-12-02 13:25:10-0800 [Broker,7327,10.2.71.140] <RemoteShellCommand '['wget', '-O', 'malloc.log.old', 'http://stage.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx//malloc.log']'> rc=0
2009-12-02 13:25:11-0800 [Broker,7522,10.2.71.244] <RemoteShellCommand '['python', 'mochitest/runtests.py', '--appname=firefox/firefox.exe', '--utility-path=bin', '--extra-profile-file=bin/plugins', '--certificate-path=certs', '--autorun', '--close-when-done', '--console-level=INFO', '--symbols-path=symbols', '--total-chunks=5', '--this-chunk=4', '--chunk-by-dir=4']'> rc=0
2009-12-02 13:25:11-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0x510fadac>

/var/log/messages has nothing interesting in it.

We're working on getting it back up now.
It was brought back up a while back, but has fallen over again:

2009-12-02 15:46:38-0800 [Broker,69,10.2.71.118] <RemoteShellCommand '['python', 'mochitest/runtests.py', '--appname=firefox/firefox.exe', '--utility-path=bin', '--extra-profile-file=bin/plugins', '--certificate-path=certs', '--autorun', '--close-when-done', '--console-level=INFO', '--symbols-path=symbols', '--total-chunks=5', '--this-chunk=4', '--chunk-by-dir=4']'> rc=0
2009-12-02 15:46:38-0800 [-] closing log <buildbot.status.builder.LogFile instance at 0xb2b3d2cc>
Curious that the exact same command had just returned both times. We should try and get the contents of that log, if possible.
I looked around the directories on pm and found a couple of builds with huge log files that matches the timestamps of the logs directly before the crashes:
-rw------- 1 cltbld cltbld 8605696 Dec  2 13:25 649-log-mochitest-plain-4-stdio
-rw------- 1 cltbld cltbld 16271726 Dec  2 15:46 578-log-mochitest-plain-4-stdio
FTR, the master was back up at 16:25. 

If it fails again we should definitely run fsck on the /builds partition (which requires renaming {moz2-master,1.9-unittes}/buildbot.tac so they don't start on boot).
I'd still like to dig into this more. I'll grab this bug for now
Assignee: nobody → bhearsum
Severity: blocker → major
I don't think I'll get a chance to dig into this any more.
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Died again today around 16:27 PST, although without the same symptoms. Log at p-m:/tmp/twistd.log.9. Not an oomkill, perhaps compressing a jsreftest log but it's not clear.
Attached file twistd.log excerpt
And again at 14:00 today. The twisted log because it's full of "WEIRD" messages for moz2-linux-slave23, which ran out of disk space while trying to create a places nightly build. An exception too. Pretty sure I stopped buildbot on the slave, cleaned up the disk properly and then restarted the slave, so the state in the master was broken.
And again at about 23:16 PST on the 23rd. No WEIRD's this time, no oomkill, twistd.log just stops.
We need to change how buildbot is launched so we can get stdout/stderr and change the coresize limit so we can have a chance of getting a core file.
pm02 died today at 00:33, log just stops like pm01 after a win32 debug build did an alive_test_2 (which just launches the app normally I think). Last log is copied to /tmp/twistd.log.
(In reply to comment #10)
> We need to change how buildbot is launched so we can get stdout/stderr and
> change the coresize limit so we can have a chance of getting a core file.

Any ideas about how to achieve this ?
It died again around 3:30am PST. It was down for 2 hours before we noticed. I adjusted the nrpe.cfg to throw CRITICAL errors rather than WARNINGS.
One of the crashes left behind a core file, which indicated a crash in listiter_next (listobject.c).  This led me to http://bugs.python.org/issue5328.  The proposed solution is to upgrade glibc on the machine.

I managed to reproduce the crash from http://bugs.python.org/issue4732 on staging master using the scripts provided 3 times in less than an hour.  This crash is similar, but not identical, to our crash, or the crash from issue 5328.

I upgraded glibc on staging master to glibc-2.5-42.el5_4.3 via yum, and restarted the test script.  Several hours later, still no crash.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: