Closed Bug 774986 Opened 12 years ago Closed 12 years ago

Active (Connected to buildbot) tegra count at 50% normal.

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Windows 7
task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: Callek)

References

Details

I'm filing this as a post-mortem bug. And in relops, as I *suspect* this is a good place for it to live/get tracked, we can move it back over the wall if it bothers anyone.

So on monday afternoon (~1p PT 07-16) I noticed that we had only 62 tegras actively connected to buildbot per our dashboard (http://mobile-dashboard.pub.build.mozilla.org/) with ~110 tegras with "Tegra online but buildslave is not"

A cursory glance at a few of the clientproxy/twistd logs on two seperate foopies (23//24) showed that we were hitting Bug 774984 in most cases (a few minor cases of things like verify flapping, but they were not the real issue.)

Instead of spend MANY hours triaging each individual tegra I logged into all the foopies, 2 at a time, in reverse order (from foopy24) and brought them up.

* screen -x
* ran |./check.sh| (from /builds)
* for all tegras with a INACTIVE or OFFLINE entry in the buildbot column:
** for i in ### ### ###; do ./stop_cp.sh tegra-$i; ./start_cp.sh tegra-$i; done
* wait 1-2 minutes (once last step is done)
* re-ran |./check.sh|
* for a few stragglers (per foopy) that still didn't seem to be back up, ran stop_cp, then did a PDU powercycle, then stop_cp manually

When all was said and done (~6p PT) we were back up to 159 tegras active/online and connected to buildbot. So crisis averted, but 774984 just got bumped on my personal priority list (even if the only priority is getting briarpatch to find some way to handle the issue)
fwiw as of right now (just over a day later) we're back down to 109 tegras active/online and connected to buildbot.

Lots of theories on the cause, but no facts at my fingertips; so I will refrain from vocalizing them.

[moving severity to blocker, while I resolve, since it *was* a blocker yesterday]
Severity: normal → blocker
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Moving to our side of the fence until there's an action for IT.

Callek: you mentioned something about a network blip taking these offline, or am I misremembering?
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
(In reply to Chris Cooper [:coop] from comment #2)
> Callek: you mentioned something about a network blip taking these offline,
> or am I misremembering?

That is/was my largest theory, but I have no evidence to support that theory yet. So before I blame[d] any specific team (other than myself) immortalized in a bug I want[ed] to find proof/evidence to support the theory.
Blocks: 777273
This might be what I am seeing in bug 777273.
Blocks: 782627
Product: mozilla.org → Release Engineering
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.