qm-mini-vista02, qm-pmac05, qm-mini-xp04, qm-pxp-fast02, qm-mini-vista04, qm-mini-ubuntu02, qm-pmac02, qm-mini-ubuntu05, qm-mini-ubuntu04, qm-pmac04 Talos machines report: remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] buildbot perfmaster reports messages: <snip> 2008/04/16 14:52 PDT Buildslave qm-pxp-fast01 detached from WINNT 5.1 mini talos trunk fast 2008/04/16 14:52 PDT BotPerspective.detached(qm-pxp-fast01) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista03 2008/04/16 14:52 PDT Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista03) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista01 2008/04/16 14:52 PDT Buildslave qm-mini-vista01 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista01) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista02 2008/04/16 14:52 PDT Buildslave qm-mini-vista02 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista02) 2008/04/16 14:52 PDT <Build WINNT 6.0 talos trunk>.lostRemote 2008/04/16 14:52 PDT stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb6c10a6c> 2008/04/16 14:52 PDT addCompleteLog(interrupt) </snip> Looks to be a network hiccup of some sort, should fix itself in the next cycle now that connectivity is re-established.
10 years ago
Also took out Mozilla2 talos boxes qm-plinux-trunk01,02,03 and qm-pxp-trunk02.
MacOSX Darwin 8.8.4 qm-xserve01 also shows the same connectivity error and is burning.
All machines *except* qm-xserve01 came back online, reconnected, and went back to green again, without any manual intervention. qm-xserve01 was still showing offline. Logging into qm-xserve01, I confirmed that the buildbot slave was not running. Nothing interesting in logs. Unclear from command shell history if this slave was manually stopped or had crashed out. Nothing interesting in logs. I restarted slave, watched it start a job, and crash out again. Nothing interesting in logs. Cleaned out log files and changed buildbot.tac from "umask=None" to "umask=002". I restarted slave, watched it start a job, and crash out again. Found the following at the end of the build log: ... tools_tier_gecko deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead. === BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment === Checking Dependencies... make: *** [build-plugin] Bus error make: *** [tools] Error 2 make: *** [tools_tier_gecko] Error 2 make: *** [tier_toolkit] Error 2 make: *** [default] Error 2 make: *** [build] Error 2 program finished with exit code 2 "Bus error" looks scary. We'll try rebooting.
Changing component as its not just talos machines.
Killed any remaining build threads, stopped slave, started slave and tried building again. No "bus error" this time, but failed with the "lost connection" problem again: tools_tier_gecko deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead. === BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment === Checking Dependencies... ** BUILD SUCCEEDED ** +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/layout/tools/reftest => ../../../dist/bin/chrome/reftest.jar +++ updating chrome ../../../dist/bin/chrome/reftest.manifest +++ overriding content/quit.js content/reftest.js updating: content/quit.js (stored 0%) updating: content/reftest.js (stored 0%) tools_tier_toolkit +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/toolkit/components/alerts => ../../../dist/bin/chrome/toolkit.jar +++ overriding content/global/alerts/alert.xul content/global/alerts/alert.js updating: content/global/alerts/alert.xul (stored 0%) updating: content/global/alerts/alert.js (stored 0%) tier_app: extensions browser export_tier_app Creating ../../dist/include/browsercomps Creating ../../../../dist/include/microsummaries Creating ../../../../dist/include/migration Creating ../../../dist/include/browsersearch Creating ../../../dist/include/sessionstore Creating ../../../../dist/include/shellservice Creating ../../../../dist/include/browser-feeds Creating ../../../../dist/include/browserplaces Creating ../../../dist/include/fuel libs_tier_app +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter/locales => ../../../dist/bin/chrome/en-US.jar +++ updating chrome ../../../dist/bin/chrome/en-US.manifest +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter => ../../dist/bin/chrome/reporter.jar +++ updating chrome ../../dist/bin/chrome/reporter.manifest +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/browser/base => ../../dist/bin/chrome/browser.jar +++ updating chrome ../../dist/bin/chrome/browser.manifest +++ overriding content/browser/aboutDialog.xul content/browser/aboutDialog.js content/browser/aboutRobots.xhtml content/browser/browser.css content/browser/browser.js content/browser/browser.xul content/browser/credits.xhtml content/browser/EULA.js content/browser/EULA.xhtml content/browser/EULA.xul content/browser/metaData.js content/browser/metaData.xul content/browser/pageinfo/pageInfo.xul content/browser/pageinfo/pageInfo.js content/browser/pageinfo/pageInfo.css content/browser/pageinfo/feeds.js content/browser/pageinfo/feeds.xml content/browser/pageinfo/permissions.js content/browser/pageinfo/security.js content/browser/openLocation.js content/browser/openLocation.xul content/browser/pageReport.js content/browser/pageReport.xul content/browser/pageReportFirstTime.xul content/browser/safeMode.js content/browser/safeMode.xul content/browser/sanitize.js content/browser/sanitize.xul content/browser/tabbrowser.css content/browser/tabbrowser.xml content/browser/urlbarBindings.xml content/browser/utilityOverlay.js content/browser/web-panels.js content/browser/web-panels.xul content/browser/baseMenuOverlay.xul content/browser/nsContextMenu.js content/browser/hiddenWindow.xul content/browser/macBrowserOverlay.xul content/browser/downloadManagerOverlay.xul content/browser/extensionsManagerOverlay.xul content/browser/jsConsoleOverlay.xul content/browser/softwareUpdateOverlay.xul content/browser/customizeToolbarSheet.js content/browser/viewSourceOverlay.xul content/browser/license.html remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ]
After a reboot, restarted buildbot slave. This time failed out with: checkout start: Wed Apr 16 18:46:31 PDT 2008 cvs -d :ext:firstname.lastname@example.org:/cvsroot -q -z 3 co mozilla/client.mk mozilla/browser/config/mozconfig mozilla/browser/config/version.txt mozilla/build/unix/uniq.pl mozilla/calendar/sunbird/config/version.txt mozilla/mail/config/version.txt mozilla/suite/config/version.txt dm-cvs01.mozilla.org: Connection refused cvs [checkout aborted]: end of file from server (consult above messages if any) make: *** [checkout] Error 1 (cc-ing mrz, as he was also working on the machine this afternoon)
Bumping to blocker, since we've been down for ~5 hours now.
As I mentioned to John, the only thing we did was to take an unsuccessful clone with SuperDuper! and then two additional clones with Disk Utility and CCC. None of which would have resulted in "dm-cvs01.mozilla.org: Connection refused".
This is holding the tree closed, is it still being investigated?
No, this is not a server-ops bug - it's rel-eng, where it was assigned.
So, what's the process of getting somebody to fix the remaining tier 1 unit test machine and get it back running again? As gavin said, this is keeping the tree closed. :(
Per comments 0-6, John has been working on it. Not sure what his current status is.
(In reply to comment #10) > No, this is not a server-ops bug - it's rel-eng, where it was assigned. Just trying to understand here - is it a rel-eng bug because it was determined to be beyond server-ops' purview and escalated, or is it a rel-eng bug because server-ops no longer handle tier 1 support for these machines?
Opened by rel-eng, handled all day by them, something we were never involved and nothing we can fix (especially if rebooting didn't fix it or john hasn't fixed it yet). We still handle tier 1, but they already have escalated to their team by opening the bug and assigning it to them. If opened on server-ops, we would have escalated it...
Justin, mrz: Everything I've done so far is already in this bug (comment#1-comment#6). Dont know anything more I can do about this. Its throwing bus errors (see comment#3) and after reboot has been hitting cvs permission errors (see comment#6). I can leave the bug assigned to me if you like, rather then tossing it back and forth, but I could really do with some help here. 1) Is there any way the cloning could have changed something on this machine? 2) Any idea what caused all these machines to go offline at the same time this afternoon?
10 years ago
I've tried rebooting again and still hit the same cvs connection refused problem. Even running cvs on command line hits the same problem. Opened serverops bug#429453 for the cvs problem.
Nothing we did today would have any affect on the network or bus - all we did was read the hard drive.
Per bug https://bugzilla.mozilla.org/show_bug.cgi?id=429453 this was a .profile misconfiguration, and no issue with the network. Is this fixed now and (hopefully) tree re-opened?
The CVS_RSH fix from bug 429453 got qm-xserve01 green again, and the tree reopened about 1 am. (There was unrelated bustage following that from checkins).