Closed Bug 693470 Opened 13 years ago Closed 13 years ago

talos-r3-leopard-033 has busted DNS

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: coop)

References

Details

(Whiteboard: [buildduty][badslave])

https://build.mozilla.org/buildapi/recent/talos-r3-leopard-033 says it's doing what it can to get through the backlog, by taking between 2 seconds and 3 minutes to break jobs with a "abort: error: Temporary failure in name resolution" while cloning build tools, but nice as that is for clearing out the pending, there's the risk that people on try will see what happened and retrigger twice instead of just once, and then we go backward instead of forward.
Not so funny when it picks up my retrigger of a job it broke, and breaks it again, though :|
Severity: normal → blocker
I've tried a graceful shutdown on the buildbot master, the machine is refusing ssh and vnc.
Seems to have stopped taking work now. Disabled in slave-alloc and added to bug 687837 for a reboot. We should take a look when it comes back to see if it's OK.
Severity: blocker → normal
Depends on: 687837
Priority: -- → P3
Whiteboard: [buildduty][badslave]
Not at all clear what went wrong here. The last green build was a
  Rev3 MacOSX Leopard 10.5.8 mozilla-central opt test mochitests-2/5 
which initiated a reboot at 14:58:28. The log looks fine on the buildbot master, but in the system.log we have:
  Oct 10 14:58:28 talos-r3-leopard-033 reboot[313]: rebooted by cltbld
  Oct 10 14:58:28 talos-r3-leopard-033 reboot[313]: SHUTDOWN_TIME: 1318283908 780407
  Oct 10 14:58:28 talos-r3-leopard-033 mDNSResponder mDNSResponder-176.3 (Jun 17 2009 18:57:49)[16]: stopping
  Oct 10 14:58:29 talos-r3-leopard-033 /System/Library/CoreServices/Finder.app/Contents/MacOS/Finder[124]: dnssd_clientstub read_all(10) failed 0/28 0 
then nothing until it actually reboots at 16:34 by bug 687837:
  Oct 10 16:34:51 localhost kernel[0]: npvhash=4095
  Oct 10 16:34:41 localhost com.apple.launchctl.System[2]: fsck_hfs: Volume is journaled.  No checking performed.

The dnssd_clientstub error appears to be 'normal', appearing on more reboots than not.

On the next job, a 
  Rev3 MacOSX Leopard 10.5.8 try debug test mochitest-other
starting at 15:14, there is an error cloning build/tools
  abort: error: Temporary failure in name resolution
Subsequently it fails to clean up space and reboot, because the scripts for that are in the missing clone.

By the next job, a
  Rev3 MacOSX Leopard 10.5.8 try talos svg
also at 15:14, the response to
  nohup rm -vrf *
was 
  nohup: can't migrate to background session: Inappropriate ioctl for device
followed by another DNS failure, and this on when asked to reboot:
  sudo: uid 502 does not exist in the passwd file!

Since the reboot the box seems responsive and vaguely normally.
We also had issues with talos-r4-snow-008 today. The earliest failure was 
  hdiutil: detach: could not eject /dev/disk1s2: Inappropriate ioctl for device
  2011-10-10 18:29:37.689 diskimages-helper[303:171b] -processKernelRequest: will sleep received
when unpacking an Aurora dmg, followed by a failure to reboot. 

Aki rebooted it, and it seems to be behaving on the first job since. He speculates that 'they both lost console. not sure if it's for the same reasons'.

talos-r3-leopard-033 is still disabled.
Nick what would you like to do with this slave?
Shall we put it back to the pool and watch?
jhford, could this have been the screen resolution forcing that got removed to fix up the new snow leopard slaves ?
(In reply to Nick Thomas [:nthomas] from comment #7)
> jhford, could this have been the screen resolution forcing that got removed
> to fix up the new snow leopard slaves ?
Assignee: nobody → jhford
At this point, I haven't touched anything that is deployed to the leopard slaves.  The entirety of the r4 manifest changes are in talos_osx_rev4.pp

These errors look like buildbot was started improperly, and don't look like they are related to screen resolution.
Assignee: jhford → nobody
Assignee: nobody → bhearsum
Didn't get to this last week, back to the pool with 'ye.
Assignee: bhearsum → nobody
Assignee: nobody → coop
I've re-enabled this slave in slavealloc.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.