Closed Bug 429406 Opened 16 years ago Closed 16 years ago

talos/unit test machines burning due to loss of connectivity with master

Categories

(Release Engineering :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: anodelman, Assigned: joduinn)

References

Details

qm-mini-vista02, qm-pmac05, qm-mini-xp04, qm-pxp-fast02, qm-mini-vista04, qm-mini-ubuntu02, qm-pmac02, qm-mini-ubuntu05, qm-mini-ubuntu04, qm-pmac04

Talos machines report:
remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]

buildbot perfmaster reports messages:
<snip>
2008/04/16 14:52 PDT  Buildslave qm-pxp-fast01 detached from WINNT 5.1 mini talos trunk fast
2008/04/16 14:52 PDT  BotPerspective.detached(qm-pxp-fast01)
2008/04/16 14:52 PDT  <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista03
2008/04/16 14:52 PDT  Buildslave qm-mini-vista03 detached from WINNT 6.0 talos trunk
2008/04/16 14:52 PDT  BotPerspective.detached(qm-mini-vista03)
2008/04/16 14:52 PDT  <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista01
2008/04/16 14:52 PDT  Buildslave qm-mini-vista01 detached from WINNT 6.0 talos trunk
2008/04/16 14:52 PDT  BotPerspective.detached(qm-mini-vista01)
2008/04/16 14:52 PDT  <Builder 'WINNT 6.0 talos trunk' at -1450101012>.detached qm-mini-vista02
2008/04/16 14:52 PDT  Buildslave qm-mini-vista02 detached from WINNT 6.0 talos trunk
2008/04/16 14:52 PDT  BotPerspective.detached(qm-mini-vista02)
2008/04/16 14:52 PDT  <Build WINNT 6.0 talos trunk>.lostRemote
2008/04/16 14:52 PDT   stopping currentStep <perfrunner.MozillaRunPerfTests instance at 0xb6c10a6c>
2008/04/16 14:52 PDT  addCompleteLog(interrupt)
</snip>

Looks to be a network hiccup of some sort, should fix itself in the next cycle now that connectivity is re-established.
Priority: -- → P1
Also took out Mozilla2 talos boxes qm-plinux-trunk01,02,03 and qm-pxp-trunk02.
MacOSX Darwin 8.8.4 qm-xserve01 also shows the same connectivity error and is burning.
Summary: talos machines burning due to loss of connectivity with master → talos/unit test machines burning due to loss of connectivity with master
All machines *except* qm-xserve01 came back online, reconnected, and went back to green again, without any manual intervention.


qm-xserve01 was still showing offline. Logging into qm-xserve01, I confirmed that the buildbot slave was not running. Nothing interesting in logs. Unclear from command shell history if this slave was manually stopped or had crashed out. Nothing interesting in logs. 

I restarted slave, watched it start a job, and crash out again. Nothing interesting in logs. Cleaned out log files and changed buildbot.tac from "umask=None" to "umask=002". I restarted slave, watched it start a job, and crash out again. Found the following at the end of the build log:
...
tools_tier_gecko
deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead.
=== BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment ===
Checking Dependencies...
make[5]: *** [build-plugin] Bus error
make[4]: *** [tools] Error 2
make[3]: *** [tools_tier_gecko] Error 2
make[2]: *** [tier_toolkit] Error 2
make[1]: *** [default] Error 2
make: *** [build] Error 2
program finished with exit code 2


"Bus error" looks scary. We'll try rebooting.
Changing component as its not just talos machines.
Component: Release Engineering: Talos → Release Engineering: Maintenance
Killed any remaining build threads, stopped slave, started slave and tried building again. No "bus error" this time, but failed with the "lost connection" problem again:

tools_tier_gecko
deprecated option: -buildstyle is no longer supported in xcodebuild. Use -configuration instead.
=== BUILDING NATIVE TARGET Default Plugin WITH CONFIGURATION Deployment ===

Checking Dependencies...
** BUILD SUCCEEDED **
+++ making chrome /builds/slave/trunk_osx/mozilla/objdir/layout/tools/reftest  => ../../../dist/bin/chrome/reftest.jar
+++ updating chrome ../../../dist/bin/chrome/reftest.manifest
+++ overriding content/quit.js content/reftest.js 
updating: content/quit.js (stored 0%)
updating: content/reftest.js (stored 0%)
tools_tier_toolkit
+++ making chrome /builds/slave/trunk_osx/mozilla/objdir/toolkit/components/alerts  => ../../../dist/bin/chrome/toolkit.jar
+++ overriding content/global/alerts/alert.xul content/global/alerts/alert.js 
updating: content/global/alerts/alert.xul (stored 0%)
updating: content/global/alerts/alert.js (stored 0%)
tier_app:  extensions browser
export_tier_app
Creating ../../dist/include/browsercomps
Creating ../../../../dist/include/microsummaries
Creating ../../../../dist/include/migration
Creating ../../../dist/include/browsersearch
Creating ../../../dist/include/sessionstore
Creating ../../../../dist/include/shellservice
Creating ../../../../dist/include/browser-feeds
Creating ../../../../dist/include/browserplaces
Creating ../../../dist/include/fuel
libs_tier_app
+++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter/locales  => ../../../dist/bin/chrome/en-US.jar
+++ updating chrome ../../../dist/bin/chrome/en-US.manifest
+++ making chrome /builds/slave/trunk_osx/mozilla/objdir/extensions/reporter  => ../../dist/bin/chrome/reporter.jar
+++ updating chrome ../../dist/bin/chrome/reporter.manifest
+++ making chrome /builds/slave/trunk_osx/mozilla/objdir/browser/base  => ../../dist/bin/chrome/browser.jar
+++ updating chrome ../../dist/bin/chrome/browser.manifest
+++ overriding content/browser/aboutDialog.xul content/browser/aboutDialog.js content/browser/aboutRobots.xhtml content/browser/browser.css content/browser/browser.js content/browser/browser.xul content/browser/credits.xhtml content/browser/EULA.js content/browser/EULA.xhtml content/browser/EULA.xul content/browser/metaData.js content/browser/metaData.xul content/browser/pageinfo/pageInfo.xul content/browser/pageinfo/pageInfo.js content/browser/pageinfo/pageInfo.css content/browser/pageinfo/feeds.js content/browser/pageinfo/feeds.xml content/browser/pageinfo/permissions.js content/browser/pageinfo/security.js content/browser/openLocation.js content/browser/openLocation.xul content/browser/pageReport.js content/browser/pageReport.xul content/browser/pageReportFirstTime.xul content/browser/safeMode.js content/browser/safeMode.xul content/browser/sanitize.js content/browser/sanitize.xul content/browser/tabbrowser.css content/browser/tabbrowser.xml content/browser/urlbarBindings.xml content/browser/utilityOverlay.js content/browser/web-panels.js content/browser/web-panels.xul content/browser/baseMenuOverlay.xul content/browser/nsContextMenu.js content/browser/hiddenWindow.xul content/browser/macBrowserOverlay.xul content/browser/downloadManagerOverlay.xul content/browser/extensionsManagerOverlay.xul content/browser/jsConsoleOverlay.xul content/browser/softwareUpdateOverlay.xul content/browser/customizeToolbarSheet.js content/browser/viewSourceOverlay.xul content/browser/license.html 

remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]
After a reboot, restarted buildbot slave. This time failed out with:

checkout start: Wed Apr 16 18:46:31 PDT 2008
cvs -d :ext:unittest@cvs.mozilla.org:/cvsroot -q -z 3  co    mozilla/client.mk mozilla/browser/config/mozconfig mozilla/browser/config/version.txt mozilla/build/unix/uniq.pl mozilla/calendar/sunbird/config/version.txt mozilla/mail/config/version.txt mozilla/suite/config/version.txt
dm-cvs01.mozilla.org: Connection refused
cvs [checkout aborted]: end of file from server (consult above messages if any)
make: *** [checkout] Error 1

(cc-ing mrz, as he was also working on the machine this afternoon)
Bumping to blocker, since we've been down for ~5 hours now.
Severity: normal → blocker
As I mentioned to John, the only thing we did was to take an unsuccessful clone with SuperDuper! and then two additional clones with Disk Utility and CCC.  

None of which would have resulted in "dm-cvs01.mozilla.org: Connection refused".
This is holding the tree closed, is it still being investigated?
Assignee: anodelman → server-ops
No, this is not a server-ops bug - it's rel-eng, where it was assigned.
Assignee: server-ops → anodelman
So, what's the process of getting somebody to fix the remaining tier 1 unit test machine and get it back running again? As gavin said, this is keeping the tree closed. :(
Per comments 0-6, John has been working on it.  Not sure what his current status is.
(In reply to comment #10)
> No, this is not a server-ops bug - it's rel-eng, where it was assigned.

Just trying to understand here - is it a rel-eng bug because it was determined to be beyond server-ops' purview and escalated, or is it a rel-eng bug because server-ops no longer handle tier 1 support for these machines?
Opened by rel-eng, handled all day by them, something we were never involved and nothing we can fix (especially if rebooting didn't fix it or john hasn't fixed it yet). 

We still handle tier 1, but they already have escalated to their team by opening the bug and assigning it to them.  If opened on server-ops, we would have escalated it...
Justin, mrz: 

Everything I've done so far is already in this bug (comment#1-comment#6). Dont know anything more I can do about this. Its throwing bus errors (see comment#3) and after reboot has been hitting cvs permission errors (see comment#6). 

I can leave the bug assigned to me if you like, rather then tossing it back and forth, but I could really do with some help here. 

1) Is there any way the cloning could have changed something on this machine? 

2) Any idea what caused all these machines to go offline at the same time this afternoon?

Assignee: anodelman → joduinn
I've tried rebooting again and still hit the same cvs connection refused problem. Even running cvs on command line hits the same problem. 

Opened serverops bug#429453 for the cvs problem.
Nothing we did today would have any affect on the network or bus - all we did was read the hard drive.
Per bug https://bugzilla.mozilla.org/show_bug.cgi?id=429453 this was a .profile misconfiguration, and no issue with the network.  Is this fixed now and (hopefully) tree re-opened?
The CVS_RSH fix from bug 429453 got qm-xserve01 green again, and the tree reopened about 1 am. (There was unrelated bustage following that from checkins).

Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Maintenance → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.