Closed Bug 896948 (buildbot-master79) Opened 11 years ago Closed 11 years ago

buildbot-master79 problem tracking

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: rail)

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

11 years ago

First nagios alert:

Tue 02:20:58 PDT [472] buildbot-master79.srv.releng.usw2.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%

Disabled in slavealloc to get slaves on the other five mac tests masters.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

11 years ago

* confirmed unreachable via ssh
* No events found in console for usw2
* AWS instance monitoring detected failure of reachability at 2013-07-23 02:15 GMT-0700
* console says this under the Status Tab

We are actively trying to resolve this issue. You can try one of the following options:
    Stop and start the instance (if EBS-backed AMI).
    Terminate the instance and launch a replacement (if instance-store backed AMI).
    Wait for us to resolve the issue.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

11 years ago

Attempted to Stop the instance using the console, went into stopping state but no further progress after a few minutes, so Force Stop by using Stop again. Still in a 'stopping' stage.

Over to rail as the resident expert on these VMs. The obvious courses at this point are opening a support ticket to try to recover the existing VM, trying the API to see if it's more responsive or detailed than the console, or terminating the instance and recreating.

Assignee: nobody → rail

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

11 years ago

The list of 29 builds that buildapi said were running on this master have already gone away, presumably to be rerun on other masters.

Rail Aliiev [:rail]

Assignee

Comment 4

•

11 years ago

It's back now. I'm going to investigate a bit this issue before enabling it in slavealloc. 

For build/test machines we check for so called "impaired" status (http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-system-instance-status-check.html) and it's quite common (5-10 times a day). 

 Probably for other types of instances we should report impaired status events.

FTR, it would be also useful to get console output (Get System Log) before you stop.

Rail Aliiev [:rail]

Assignee

Comment 5

•

11 years ago

Nothing interesting in the logs:


Jul 23 01:48:25 buildbot-master79 puppet-agent[8709]: Finished catalog run in 59.09 seconds
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPREQUEST on eth0 to 10.132.49.1 port 67 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: DHCPACK from 10.132.49.1 (xid=0x2db4efe1)
Jul 23 02:07:21 buildbot-master79 dhclient[775]: bound to 10.132.49.117 -- renewal in 1738 seconds.
Jul 23 05:02:43 buildbot-master79 kernel: imklog 4.6.2, log source = /proc/kmsg started.
Jul 23 05:02:43 buildbot-master79 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="863" x-info="http://www.rsyslog.com"] (re)start
Jul 23 05:02:43 buildbot-master79 kernel: Initializing cgroup subsys cpuset


I enabled the master and watching the master logs.

Rail Aliiev [:rail]

Assignee

Comment 6

•

11 years ago

The master looks good so far. I filed bug 897002 to be more proactive here.

Rail Aliiev [:rail]

Assignee

Updated

•

11 years ago

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

11 years ago

I did have a look at the console log and it was all the usual stuff you get in dmesg at  boot time, then the login prompt, then a couple of lines which weren't obviously errors. Should have saved those last two as they've been garbage collected from my brain now.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

BMO Automation

Updated

•

6 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

4 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

buildbot-master79 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

People

(Reporter: nthomas, Assigned: rail)

References

Details

(Whiteboard: [buildduty][buildslaves][capacity])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Updated

Updated

Updated