Closed
Bug 693470
Opened 13 years ago
Closed 13 years ago
talos-r3-leopard-033 has busted DNS
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: coop)
References
Details
(Whiteboard: [buildduty][badslave])
https://build.mozilla.org/buildapi/recent/talos-r3-leopard-033 says it's doing what it can to get through the backlog, by taking between 2 seconds and 3 minutes to break jobs with a "abort: error: Temporary failure in name resolution" while cloning build tools, but nice as that is for clearing out the pending, there's the risk that people on try will see what happened and retrigger twice instead of just once, and then we go backward instead of forward.
Reporter | ||
Comment 1•13 years ago
|
||
Not so funny when it picks up my retrigger of a job it broke, and breaks it again, though :|
Severity: normal → blocker
Comment 2•13 years ago
|
||
I've tried a graceful shutdown on the buildbot master, the machine is refusing ssh and vnc.
Comment 3•13 years ago
|
||
Seems to have stopped taking work now. Disabled in slave-alloc and added to bug 687837 for a reboot. We should take a look when it comes back to see if it's OK.
Comment 4•13 years ago
|
||
Not at all clear what went wrong here. The last green build was a Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-2/5 which initiated a reboot at 14:58:28. The log looks fine on the buildbot master, but in the system.log we have: Oct 10 14:58:28 talos-r3-leopard-033 reboot[313]: rebooted by cltbld Oct 10 14:58:28 talos-r3-leopard-033 reboot[313]: SHUTDOWN_TIME: 1318283908 780407 Oct 10 14:58:28 talos-r3-leopard-033 mDNSResponder mDNSResponder-176.3 (Jun 17 2009 18:57:49)[16]: stopping Oct 10 14:58:29 talos-r3-leopard-033 /System/Library/CoreServices/Finder.app/Contents/MacOS/Finder[124]: dnssd_clientstub read_all(10) failed 0/28 0 then nothing until it actually reboots at 16:34 by bug 687837: Oct 10 16:34:51 localhost kernel[0]: npvhash=4095 Oct 10 16:34:41 localhost com.apple.launchctl.System[2]: fsck_hfs: Volume is journaled. No checking performed. The dnssd_clientstub error appears to be 'normal', appearing on more reboots than not. On the next job, a Rev3 MacOSX Leopard 10.5.8 try debug test mochitest-other starting at 15:14, there is an error cloning build/tools abort: error: Temporary failure in name resolution Subsequently it fails to clean up space and reboot, because the scripts for that are in the missing clone. By the next job, a Rev3 MacOSX Leopard 10.5.8 try talos svg also at 15:14, the response to nohup rm -vrf * was nohup: can't migrate to background session: Inappropriate ioctl for device followed by another DNS failure, and this on when asked to reboot: sudo: uid 502 does not exist in the passwd file! Since the reboot the box seems responsive and vaguely normally.
Comment 5•13 years ago
|
||
We also had issues with talos-r4-snow-008 today. The earliest failure was hdiutil: detach: could not eject /dev/disk1s2: Inappropriate ioctl for device 2011-10-10 18:29:37.689 diskimages-helper[303:171b] -processKernelRequest: will sleep received when unpacking an Aurora dmg, followed by a failure to reboot. Aki rebooted it, and it seems to be behaving on the first job since. He speculates that 'they both lost console. not sure if it's for the same reasons'. talos-r3-leopard-033 is still disabled.
Comment 6•13 years ago
|
||
Nick what would you like to do with this slave? Shall we put it back to the pool and watch?
Comment 7•13 years ago
|
||
jhford, could this have been the screen resolution forcing that got removed to fix up the new snow leopard slaves ?
Comment 8•13 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #7) > jhford, could this have been the screen resolution forcing that got removed > to fix up the new snow leopard slaves ?
Assignee: nobody → jhford
Comment 9•13 years ago
|
||
At this point, I haven't touched anything that is deployed to the leopard slaves. The entirety of the r4 manifest changes are in talos_osx_rev4.pp These errors look like buildbot was started improperly, and don't look like they are related to screen resolution.
Assignee: jhford → nobody
Updated•13 years ago
|
Assignee: nobody → bhearsum
Comment 10•13 years ago
|
||
Didn't get to this last week, back to the pool with 'ye.
Assignee: bhearsum → nobody
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → coop
Assignee | ||
Comment 11•13 years ago
|
||
I've re-enabled this slave in slavealloc.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•