Closed Bug 1234261 (t-w732-ix-195) Opened 9 years ago Closed 8 years ago

t-w732-ix-195 problem tracking

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P3)

x86
Windows 7

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

Details

(Whiteboard: [buildduty][buildslaves][capacity])

On a rampage of failing the test runs when it manages to stay connected through s whole one, which it rarely does. Disabled.
Re-imaged and enabled in slavealloc
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Heh. Didn't realize it had been that long that Q and grenade were running AWS slaves with various names which lied and pretended that they were this slave.

If something fails and claims to be this slave, you can bet it isn't actually this slave, which you can usually, at least so far, determine by looking in the log for the spew of env vars for a computername like T-W7-AWS-BASE.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Or, alternate fun possibility, that reimage at the time when 195 was both hardware and a lying AWS instance might have reimaged the hardware slave to think that its computername was T-W7-AWS-BASE, though probably not.
:philor should we try to re-image again the slave ?
Flags: needinfo?(philringnalda)
Maybe?

The only information I have is that while t-w732-ix-195 is enabled in slavealloc, something with the slavename t-w732-ix-195 and the env var "computername" t-w7-aws-base takes jobs and fails talos jobs by trying to tell graphserver that its name is t-w7-aws-base, and since I disabled t-w732-ix-195 in slavealloc I haven't seen another instance of that.

I can imagine that part of the setup to have t-w7-aws-* lie and claim to be t-w732-ix-195 resulted in a reimage of the actual t-w732-ix-195 being broken and that that has been reverted and another reimage would fix it; I can imagine that it hasn't been reverted and another reimage won't fix it; I can imagine that rather than it being the actual t-w732-ix-195 which was failing jobs it was a t-w7-aws instance which is obeying the disabling of t-w732-ix-195 in slavealloc; I can imagine that it was a t-w7-aws instance but rather than obeying the disabling it just happened to have been terminated around the time I disabled t-w732-ix-195.
Flags: needinfo?(philringnalda) → needinfo?(rthijssen)
I have looked in the EC2 instance list and seen that there is an instance sharing the name "t-w732-ix-195". The instance ID is i-8454df32 and has a moz-owner tag of q@mozilla.com. The instance state was 'stopped' at the time that I checked but I would guess that if it were started, it would create the sort of problems described above. I believe that the instance is probably being used to create the base image that we will later use to spawn golden and spot images. It would probably benefit from having its name changed to a name that doesn't exist in slave-alloc or buildbot-configs but as it isn't my instance, I don't want to make that change, in case there are circumstances or reasons I haven't considered for the name to remain as is.
Flags: needinfo?(rthijssen)
@Q: would it be possible to change the name of the instance to one that doesn't exist in slavealloc or buildbot-configs?

Thanks.
Flags: needinfo?(q)
Alin,  I wiped out these instances and made sure that host name is out of the testing loop. I am making sure that the real 195 is working today.
Flags: needinfo?(q)
195 is taking jobs back in scl3 and things look good so far.
Mostly green jobs at the moment. Marking as resolved for now.
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.