Closed Bug 644991 Opened 13 years ago Closed 13 years ago

disable masters on buildbot-master1 due to slow drive

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

Details

(Whiteboard: [buildmasters][buildduty])

Attachments

(1 file)

Depending on how jittery the drive is feeling, I get hdparm reuslts ranging from 70-95MB/s on buildbot-master1.  The machine has been up for 97 days, and has plenty of these messages in dmesg:
ata1: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x8)
ata1: spurious interrupt (irq_stat 0x8 active_tag -84148995 sactive 0x2)

will reboot and see what the machine feels like afterwards
Running a SMART self test causes a whole ton of those spurious interrupt messages
Before we can take this out of service, we need to move the slaves off of it. Moving to RelEng, throw it back when the slaves have been moved.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Priority: -- → P3
Whiteboard: [buildmasters][slaveduty]
Blocks: 651542
The slaves are off the master.

I wasn't aware that we are lacking so many masters.

I have bumped the version to critical as it affects our current capacity.
Assignee: nobody → server-ops-releng
Severity: normal → critical
Component: Release Engineering → Server Operations: RelEng
Priority: P3 → --
QA Contact: release → zandr
Whiteboard: [buildmasters][slaveduty] → [buildmasters][slaveduty][buildduty]
Reducing as there is not much more that you can do besides getting a healthy master.
Severity: critical → normal
Sorry for the noise. There are still some slaves connected.
Assignee: server-ops-releng → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Found in triage: 

1) Before we can hand this over to IT, we have to power off all the master instances running on this box.

2) Per triage right now, the try and build master instances are still running. The test master instance is already off. 

3) Because of load on other buildbot masters, we cannot power off this master just yet, no matter how sick it is. Once the other new masters (04,06) are fully online, and sharing load, we can revisit.

Pushing this bug to catlee, who is working on setting up 04,06, and therefore will know when its safe to power off buildbot-master1.
Assignee: nobody → catlee
Depends on: 656084
Any update on this? Would like to get this shut down so we can send it out for repair.
zandr we are waiting on disabling masters from this host once we are done the work of setting more masters up on SJC (bug 656413).

Adding dependency.
Depends on: 656413
Summary: buildbot-master1 has slow drive → disable masters on buildbot-master1 due to slow drive
Whiteboard: [buildmasters][slaveduty][buildduty] → [buildmasters][slaveduty][buildduty] waiting on setup of other masters
Whiteboard: [buildmasters][slaveduty][buildduty] waiting on setup of other masters → [buildmasters][buildduty][waiting on setup of other masters]
This is ready to go back to IX.
No longer depends on: 656413
Whiteboard: [buildmasters][buildduty][waiting on setup of other masters] → [buildmasters][buildduty]
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Attachment #536417 - Flags: review?(dustin)
In which bug is this host tracked to be sent back to IX?
Comment on attachment 536417 [details] [diff] [review]
Remove buildbot-master1 from masters json

Remove them from slavealloc, too?
Attachment #536417 - Flags: review?(dustin) → review+
(In reply to comment #12)
> Comment on attachment 536417 [details] [diff] [review] [review]
> Remove buildbot-master1 from masters json
> 
> Remove them from slavealloc, too?

Yeah, might as well. Is that doable easily?
Attachment #536417 - Flags: checked-in+
Via mysql, yes - I took care of it.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: