Closed
Bug 896948
(buildbot-master79)
Opened 12 years ago
Closed 12 years ago
buildbot-master79 problem tracking
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: rail)
Details
(Whiteboard: [buildduty][buildslaves][capacity])
First nagios alert:
Tue 02:20:58 PDT [472] buildbot-master79.srv.releng.usw2.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
Disabled in slavealloc to get slaves on the other five mac tests masters.
| Reporter | ||
Comment 1•12 years ago
|
||
* confirmed unreachable via ssh
* No events found in console for usw2
* AWS instance monitoring detected failure of reachability at 2013-07-23 02:15 GMT-0700
* console says this under the Status Tab
We are actively trying to resolve this issue. You can try one of the following options:
Stop and start the instance (if EBS-backed AMI).
Terminate the instance and launch a replacement (if instance-store backed AMI).
Wait for us to resolve the issue.
| Reporter | ||
Comment 2•12 years ago
|
||
Attempted to Stop the instance using the console, went into stopping state but no further progress after a few minutes, so Force Stop by using Stop again. Still in a 'stopping' stage.
Over to rail as the resident expert on these VMs. The obvious courses at this point are opening a support ticket to try to recover the existing VM, trying the API to see if it's more responsive or detailed than the console, or terminating the instance and recreating.
Assignee: nobody → rail
| Reporter | ||
Comment 3•12 years ago
|
||
The list of 29 builds that buildapi said were running on this master have already gone away, presumably to be rerun on other masters.
| Assignee | ||
Comment 4•12 years ago
|
||
It's back now. I'm going to investigate a bit this issue before enabling it in slavealloc.
For build/test machines we check for so called "impaired" status (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html) and it's quite common (5-10 times a day).
Probably for other types of instances we should report impaired status events.
FTR, it would be also useful to get console output (Get System Log) before you stop.
| Assignee | ||
Comment 5•12 years ago
|
||
Nothing interesting in the logs:
Jul 23 01:48:25 buildbot-master79 puppet-agent[8709]: Finished catalog run in 59.09 seconds
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPREQUEST on eth0 to 10.132.49.1 port 67 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPACK from 10.132.49.1 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: bound to 10.132.49.117 -- renewal in 1738 seconds.
Jul 23 05:02:43 buildbot-master79 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Jul 23 05:02:43 buildbot-master79 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="863" x-info="http://www.rsyslog.com"] (re)start
Jul 23 05:02:43 buildbot-master79 kernel: Initializing cgroup subsys cpuset
I enabled the master and watching the master logs.
| Assignee | ||
Comment 6•12 years ago
|
||
The master looks good so far. I filed bug 897002 to be more proactive here.
| Assignee | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 7•12 years ago
|
||
I did have a look at the console log and it was all the usual stuff you get in dmesg at boot time, then the login prompt, then a couple of lines which weren't obviously errors. Should have saved those last two as they've been garbage collected from my brain now.
Updated•12 years ago
|
Product: mozilla.org → Release Engineering
Updated•8 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•6 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•