Closed Bug 896948 (buildbot-master79) Opened 11 years ago Closed 11 years ago

buildbot-master79 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: rail)

Details

(Whiteboard: [buildduty][buildslaves][capacity])

First nagios alert:

Tue 02:20:58 PDT [472] buildbot-master79.srv.releng.usw2.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Disabled in slavealloc to get slaves on the other five mac tests masters.
* confirmed unreachable via ssh
* No events found in console for usw2
* AWS instance monitoring detected failure of reachability at 2013-07-23 02:15 GMT-0700
* console says this under the Status Tab

We are actively trying to resolve this issue. You can try one of the following options:
    Stop and start the instance (if EBS-backed AMI).
    Terminate the instance and launch a replacement (if instance-store backed AMI).
    Wait for us to resolve the issue.
Attempted to Stop the instance using the console, went into stopping state but no further progress after a few minutes, so Force Stop by using Stop again. Still in a 'stopping' stage.

Over to rail as the resident expert on these VMs. The obvious courses at this point are opening a support ticket to try to recover the existing VM, trying the API to see if it's more responsive or detailed than the console, or terminating the instance and recreating.
Assignee: nobody → rail
The list of 29 builds that buildapi said were running on this master have already gone away, presumably to be rerun on other masters.
It's back now. I'm going to investigate a bit this issue before enabling it in slavealloc. 

For build/test machines we check for so called "impaired" status (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html) and it's quite common (5-10 times a day). 

 Probably for other types of instances we should report impaired status events.

FTR, it would be also useful to get console output (Get System Log) before you stop.
Nothing interesting in the logs:


Jul 23 01:48:25 buildbot-master79 puppet-agent[8709]: Finished catalog run in 59.09 seconds
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPREQUEST on eth0 to 10.132.49.1 port 67 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPACK from 10.132.49.1 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: bound to 10.132.49.117 -- renewal in 1738 seconds.
Jul 23 05:02:43 buildbot-master79 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Jul 23 05:02:43 buildbot-master79 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="863" x-info="http://www.rsyslog.com"] (re)start
Jul 23 05:02:43 buildbot-master79 kernel: Initializing cgroup subsys cpuset


I enabled the master and watching the master logs.
The master looks good so far. I filed bug 897002 to be more proactive here.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
I did have a look at the console log and it was all the usual stuff you get in dmesg at  boot time, then the login prompt, then a couple of lines which weren't obviously errors. Should have saved those last two as they've been garbage collected from my brain now.
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.