Bug 1234261 (t-w732-ix-195)

t-w732-ix-195 problem tracking

RESOLVED FIXED

Status

P3
normal
RESOLVED FIXED
3 years ago
6 months ago

People

(Reporter: philor, Unassigned)

Tracking

Details

(Whiteboard: [buildduty][buildslaves][capacity])

(Reporter)

Description

3 years ago
On a rampage of failing the test runs when it manages to stay connected through s whole one, which it rarely does. Disabled.
Re-imaged and enabled in slavealloc
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(Reporter)

Comment 2

3 years ago
Heh. Didn't realize it had been that long that Q and grenade were running AWS slaves with various names which lied and pretended that they were this slave.

If something fails and claims to be this slave, you can bet it isn't actually this slave, which you can usually, at least so far, determine by looking in the log for the spew of env vars for a computername like T-W7-AWS-BASE.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 3

3 years ago
Or, alternate fun possibility, that reimage at the time when 195 was both hardware and a lying AWS instance might have reimaged the hardware slave to think that its computername was T-W7-AWS-BASE, though probably not.
:philor should we try to re-image again the slave ?
Flags: needinfo?(philringnalda)
(Reporter)

Comment 5

3 years ago
Maybe?

The only information I have is that while t-w732-ix-195 is enabled in slavealloc, something with the slavename t-w732-ix-195 and the env var "computername" t-w7-aws-base takes jobs and fails talos jobs by trying to tell graphserver that its name is t-w7-aws-base, and since I disabled t-w732-ix-195 in slavealloc I haven't seen another instance of that.

I can imagine that part of the setup to have t-w7-aws-* lie and claim to be t-w732-ix-195 resulted in a reimage of the actual t-w732-ix-195 being broken and that that has been reverted and another reimage would fix it; I can imagine that it hasn't been reverted and another reimage won't fix it; I can imagine that rather than it being the actual t-w732-ix-195 which was failing jobs it was a t-w7-aws instance which is obeying the disabling of t-w732-ix-195 in slavealloc; I can imagine that it was a t-w7-aws instance but rather than obeying the disabling it just happened to have been terminated around the time I disabled t-w732-ix-195.
Flags: needinfo?(philringnalda) → needinfo?(rthijssen)
I have looked in the EC2 instance list and seen that there is an instance sharing the name "t-w732-ix-195". The instance ID is i-8454df32 and has a moz-owner tag of q@mozilla.com. The instance state was 'stopped' at the time that I checked but I would guess that if it were started, it would create the sort of problems described above. I believe that the instance is probably being used to create the base image that we will later use to spawn golden and spot images. It would probably benefit from having its name changed to a name that doesn't exist in slave-alloc or buildbot-configs but as it isn't my instance, I don't want to make that change, in case there are circumstances or reasons I haven't considered for the name to remain as is.
Flags: needinfo?(rthijssen)
@Q: would it be possible to change the name of the instance to one that doesn't exist in slavealloc or buildbot-configs?

Thanks.
Flags: needinfo?(q)

Comment 8

3 years ago
Alin,  I wiped out these instances and made sure that host name is out of the testing loop. I am making sure that the real 195 is working today.
Flags: needinfo?(q)

Comment 9

3 years ago
195 is taking jobs back in scl3 and things look good so far.
Mostly green jobs at the moment. Marking as resolved for now.
Status: REOPENED → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED

Updated

6 months ago
Product: Release Engineering → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.