Closed
Bug 429406
Opened 16 years ago
Closed 16 years ago
talos/unit test machines burning due to loss of connectivity with master
Categories
(Release Engineering :: General, defect, P1)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: anodelman, Assigned: joduinn)
References
Details
qm-mini-vista02, qm-pmac05, qm-mini-xp04, qm-pxp-fast02, qm-mini-vista04, qm-mini-ubuntu02, qm-pmac02, qm-mini-ubuntu05, qm-mini-ubuntu04, qm-pmac04 Talos machines report: remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] buildbot perfmaster reports messages: <snip> 2008/04/16 14:52 PDT Buildslave qm-pxp-fast01 detached from WINNT 5.1 mini talos trunk fast 2008/04/16 14:52 PDT BotPerspective.detached(qm-pxp-fast01) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista03 2008/04/16 14:52 PDT Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista03) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista01 2008/04/16 14:52 PDT Buildslave qm-mini-vista01 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista01) 2008/04/16 14:52 PDT <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista02 2008/04/16 14:52 PDT Buildslave qm-mini-vista02 detached from WINNT 6.0 talos trunk 2008/04/16 14:52 PDT BotPerspective.detached(qm-mini-vista02) 2008/04/16 14:52 PDT <Build WINNT 6.0 talos trunk>.lostRemote 2008/04/16 14:52 PDT stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb6c10a6c> 2008/04/16 14:52 PDT addCompleteLog(interrupt) </snip> Looks to be a network hiccup of some sort, should fix itself in the next cycle now that connectivity is re-established.
Assignee | ||
Updated•16 years ago
|
Priority: -- → P1
Reporter | ||
Comment 1•16 years ago
|
||
Also took out Mozilla2 talos boxes qm-plinux-trunk01,02,03 and qm-pxp-trunk02.
Reporter | ||
Comment 2•16 years ago
|
||
MacOSX Darwin 8.8.4 qm-xserve01 also shows the same connectivity error and is burning.
Reporter | ||
Updated•16 years ago
|
Summary: talos machines burning due to loss of connectivity with master → talos/unit test machines burning due to loss of connectivity with master
Assignee | ||
Comment 3•16 years ago
|
||
All machines *except* qm-xserve01 came back online, reconnected, and went back to green again, without any manual intervention. qm-xserve01 was still showing offline. Logging into qm-xserve01, I confirmed that the buildbot slave was not running. Nothing interesting in logs. Unclear from command shell history if this slave was manually stopped or had crashed out. Nothing interesting in logs. I restarted slave, watched it start a job, and crash out again. Nothing interesting in logs. Cleaned out log files and changed buildbot.tac from "umask=None" to "umask=002". I restarted slave, watched it start a job, and crash out again. Found the following at the end of the build log: ... tools_tier_gecko deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead. === BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment === Checking Dependencies... make[5]: *** [build-plugin] Bus error make[4]: *** [tools] Error 2 make[3]: *** [tools_tier_gecko] Error 2 make[2]: *** [tier_toolkit] Error 2 make[1]: *** [default] Error 2 make: *** [build] Error 2 program finished with exit code 2 "Bus error" looks scary. We'll try rebooting.
Assignee | ||
Comment 4•16 years ago
|
||
Changing component as its not just talos machines.
Component: Release Engineering: Talos → Release Engineering: Maintenance
Assignee | ||
Comment 5•16 years ago
|
||
Killed any remaining build threads, stopped slave, started slave and tried building again. No "bus error" this time, but failed with the "lost connection" problem again: tools_tier_gecko deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead. === BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment === Checking Dependencies... ** BUILD SUCCEEDED ** +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/layout/tools/reftest => ../../../dist/bin/chrome/reftest.jar +++ updating chrome ../../../dist/bin/chrome/reftest.manifest +++ overriding content/quit.js content/reftest.js updating: content/quit.js (stored 0%) updating: content/reftest.js (stored 0%) tools_tier_toolkit +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/toolkit/components/alerts => ../../../dist/bin/chrome/toolkit.jar +++ overriding content/global/alerts/alert.xul content/global/alerts/alert.js updating: content/global/alerts/alert.xul (stored 0%) updating: content/global/alerts/alert.js (stored 0%) tier_app: extensions browser export_tier_app Creating ../../dist/include/browsercomps Creating ../../../../dist/include/microsummaries Creating ../../../../dist/include/migration Creating ../../../dist/include/browsersearch Creating ../../../dist/include/sessionstore Creating ../../../../dist/include/shellservice Creating ../../../../dist/include/browser-feeds Creating ../../../../dist/include/browserplaces Creating ../../../dist/include/fuel libs_tier_app +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter/locales => ../../../dist/bin/chrome/en-US.jar +++ updating chrome ../../../dist/bin/chrome/en-US.manifest +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter => ../../dist/bin/chrome/reporter.jar +++ updating chrome ../../dist/bin/chrome/reporter.manifest +++ making chrome /builds/slave/trunk_osx/mozilla/objdir/browser/base => ../../dist/bin/chrome/browser.jar +++ updating chrome ../../dist/bin/chrome/browser.manifest +++ overriding content/browser/aboutDialog.xul content/browser/aboutDialog.js content/browser/aboutRobots.xhtml content/browser/browser.css content/browser/browser.js content/browser/browser.xul content/browser/credits.xhtml content/browser/EULA.js content/browser/EULA.xhtml content/browser/EULA.xul content/browser/metaData.js content/browser/metaData.xul content/browser/pageinfo/pageInfo.xul content/browser/pageinfo/pageInfo.js content/browser/pageinfo/pageInfo.css content/browser/pageinfo/feeds.js content/browser/pageinfo/feeds.xml content/browser/pageinfo/permissions.js content/browser/pageinfo/security.js content/browser/openLocation.js content/browser/openLocation.xul content/browser/pageReport.js content/browser/pageReport.xul content/browser/pageReportFirstTime.xul content/browser/safeMode.js content/browser/safeMode.xul content/browser/sanitize.js content/browser/sanitize.xul content/browser/tabbrowser.css content/browser/tabbrowser.xml content/browser/urlbarBindings.xml content/browser/utilityOverlay.js content/browser/web-panels.js content/browser/web-panels.xul content/browser/baseMenuOverlay.xul content/browser/nsContextMenu.js content/browser/hiddenWindow.xul content/browser/macBrowserOverlay.xul content/browser/downloadManagerOverlay.xul content/browser/extensionsManagerOverlay.xul content/browser/jsConsoleOverlay.xul content/browser/softwareUpdateOverlay.xul content/browser/customizeToolbarSheet.js content/browser/viewSourceOverlay.xul content/browser/license.html remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ]
Assignee | ||
Comment 6•16 years ago
|
||
After a reboot, restarted buildbot slave. This time failed out with: checkout start: Wed Apr 16 18:46:31 PDT 2008 cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3 co mozilla/client.mk mozilla/browser/config/mozconfig mozilla/browser/config/version.txt mozilla/build/unix/uniq.pl mozilla/calendar/sunbird/config/version.txt mozilla/mail/config/version.txt mozilla/suite/config/version.txt dm-cvs01.mozilla.org: Connection refused cvs [checkout aborted]: end of file from server (consult above messages if any) make: *** [checkout] Error 1 (cc-ing mrz, as he was also working on the machine this afternoon)
Comment 7•16 years ago
|
||
Bumping to blocker, since we've been down for ~5 hours now.
Severity: normal → blocker
Comment 8•16 years ago
|
||
As I mentioned to John, the only thing we did was to take an unsuccessful clone with SuperDuper! and then two additional clones with Disk Utility and CCC. None of which would have resulted in "dm-cvs01.mozilla.org: Connection refused".
Comment 9•16 years ago
|
||
This is holding the tree closed, is it still being investigated?
Updated•16 years ago
|
Assignee: anodelman → server-ops
Comment 10•16 years ago
|
||
No, this is not a server-ops bug - it's rel-eng, where it was assigned.
Assignee: server-ops → anodelman
Comment 11•16 years ago
|
||
So, what's the process of getting somebody to fix the remaining tier 1 unit test machine and get it back running again? As gavin said, this is keeping the tree closed. :(
Comment 12•16 years ago
|
||
Per comments 0-6, John has been working on it. Not sure what his current status is.
Comment 13•16 years ago
|
||
(In reply to comment #10) > No, this is not a server-ops bug - it's rel-eng, where it was assigned. Just trying to understand here - is it a rel-eng bug because it was determined to be beyond server-ops' purview and escalated, or is it a rel-eng bug because server-ops no longer handle tier 1 support for these machines?
Comment 14•16 years ago
|
||
Opened by rel-eng, handled all day by them, something we were never involved and nothing we can fix (especially if rebooting didn't fix it or john hasn't fixed it yet). We still handle tier 1, but they already have escalated to their team by opening the bug and assigning it to them. If opened on server-ops, we would have escalated it...
Assignee | ||
Comment 15•16 years ago
|
||
Justin, mrz: Everything I've done so far is already in this bug (comment#1-comment#6). Dont know anything more I can do about this. Its throwing bus errors (see comment#3) and after reboot has been hitting cvs permission errors (see comment#6). I can leave the bug assigned to me if you like, rather then tossing it back and forth, but I could really do with some help here. 1) Is there any way the cloning could have changed something on this machine? 2) Any idea what caused all these machines to go offline at the same time this afternoon?
Assignee: anodelman → joduinn
Assignee | ||
Comment 16•16 years ago
|
||
I've tried rebooting again and still hit the same cvs connection refused problem. Even running cvs on command line hits the same problem. Opened serverops bug#429453 for the cvs problem.
Comment 17•16 years ago
|
||
Nothing we did today would have any affect on the network or bus - all we did was read the hard drive.
Comment 18•16 years ago
|
||
Per bug https://bugzilla.mozilla.org/show_bug.cgi?id=429453 this was a .profile misconfiguration, and no issue with the network. Is this fixed now and (hopefully) tree re-opened?
Comment 19•16 years ago
|
||
The CVS_RSH fix from bug 429453 got qm-xserve01 green again, and the tree reopened about 1 am. (There was unrelated bustage following that from checkins).
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•