Closed Bug 1445899 Opened 6 years ago Closed 6 years ago

Quarantine on gecko-t-osx-1010

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zfay, Unassigned)

References

Details

Quarantined 2 OS workers as they have the last 10+ jobs as exception:

t-yosemite-r7-0208
t-yosemite-r7-189
Added these to quarantine as well. I've also checked the logs and most of them failed at numerous tasks giving a "max runtime exceeded" type error:

t-yosemite-r7-159
t-yosemite-r7-171
t-yosemite-r7-177
t-yosemite-r7-180
t-yosemite-r7-181
t-yosemite-r7-195
t-yosemite-r7-196
t-yosemite-r7-208

t-yosemite-r7-0160
Depends on: 1445580
buildduty, can we quarantine the entire MDC2 set of gecko-t-osx-1010 TC generic-workers.  These host were recently moved from SCL3 to MDC2.  They should not be taking tasks until MDC2 is deemed 'production ready'.  In fact, they are burning jobs in bug 1445580

These hosts are:
t-yosemite-r7-(157-236).test.releng.mdc2.mozilla.com
I was already working on it, I added these workers to quarantine: 
t-yosemite-r7-169
t-yosemite-r7-229
t-yosemite-r7-190
t-yosemite-r7-0236
t-yosemite-r7-0234
t-yosemite-r7-0231
t-yosemite-r7-0200
t-yosemite-r7-0196
t-yosemite-r7-0195
t-yosemite-r7-0194
t-yosemite-r7-0193
t-yosemite-r7-0192
t-yosemite-r7-0186
t-yosemite-r7-0185
t-yosemite-r7-0184
t-yosemite-r7-0182
t-yosemite-r7-0179
t-yosemite-r7-0170
t-yosemite-r7-0168
t-yosemite-r7-0167
t-yosemite-r7-0166
t-yosemite-r7-0165
t-yosemite-r7-230
t-yosemite-r7-197
t-yosemite-r7-190

Something that caught my attention is the fact that you're saying that hosts were moved from scl3 to mdc2, but workers
t-yosemite-r7-0166
t-yosemite-r7-0192
t-yosemite-r7-0195
are currently in scl3.

I don't know exactly if that can help you, these 3 workers failed with the same reason like the others and unfortunately I don't have a quick method at my hand to quarantine all of the workers from 157 to 236.

I will continue looking at workers and look in logs to see if the problem persists and update the list of quarantined workers.
Also, add these to the list of quarantined workers:
t-yosemite-r7-235
t-yosemite-r7-233
t-yosemite-r7-231
t-yosemite-r7-228
t-yosemite-r7-221
t-yosemite-r7-220
t-yosemite-r7-215
t-yosemite-r7-214
t-yosemite-r7-209
t-yosemite-r7-206
t-yosemite-r7-204
t-yosemite-r7-202
t-yosemite-r7-201
t-yosemite-r7-198
t-yosemite-r7-194
t-yosemite-r7-187
t-yosemite-r7-186
t-yosemite-r7-184
t-yosemite-r7-183
t-yosemite-r7-179
t-yosemite-r7-176
t-yosemite-r7-172
t-yosemite-r7-170
t-yosemite-r7-168
t-yosemite-r7-167
t-yosemite-r7-164
t-yosemite-r7-163
t-yosemite-r7-162
t-yosemite-r7-161
t-yosemite-r7-158
Quarantined all t-yosemite-r7 from 157-236.
(In reply to Bogdan Crisan [:bogdancrisan] from comment #3)

> Something that caught my attention is the fact that you're saying that hosts
> were moved from scl3 to mdc2, but workers
> t-yosemite-r7-0166
> t-yosemite-r7-0192
> t-yosemite-r7-0195
> are currently in scl3.
> 
> I don't know exactly if that can help you, these 3 workers failed with the
> same reason like the others and unfortunately I don't have a quick method at
> my hand to quarantine all of the workers from 157 to 236.
> 
> I will continue looking at workers and look in logs to see if the problem
> persists and update the list of quarantined workers.

I suspect those worker 'think' they are in SCL3 but are actually in MDC2 and just hadn't been renamed and reimaged.  When they were physically racked, cabled and powered on, they picked up right where they left off and started taking jobs right away. (Not good)

Since 03/15, they should have all been renamed and reimaged (by :van), therefore if you are still seeing jobs from that range while reporting that they are still in SCL3, then the reimage may have not completed for that subset of hosts.  Please file a bug if that is the case. Thanks!
Quarantined t-yosemite-r7-189. Have the last 10+ jobs as exception and other failed jobs giving "max run time exceeded".
Quarantined (again) all t-yosemite-r7 from mdc2
(In reply to Jake Watkins [:dividehex] from comment #2)
> buildduty, can we quarantine the entire MDC2 set of gecko-t-osx-1010 TC
> generic-workers.  These host were recently moved from SCL3 to MDC2.  They
> should not be taking tasks until MDC2 is deemed 'production ready'.  In
> fact, they are burning jobs in bug 1445580
> 
> These hosts are:
> t-yosemite-r7-(157-236).test.releng.mdc2.mozilla.com

I'd like to start taking these host out of quarantine now that pypi is accessible in MDC2 (see bug 1446176).  Can we enabled a handful of these hosts and keep a close eye on them for a few days?  If they complete tasks without issue, we can then enable the rest of them.

Let's start by enabling t-yosemite-r7-(157-170).test.releng.mdc2
They look to be taken out of quarantine prematurely, they are all enabled.
Also the problem that was this bug created for was solved (the machines from mdc2 don't burn tests anymore), I close this bug for now. 
If any builds/tests fail with "max runtime exceeded" please reopen this bug and tell which worker(s) are burning tests.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.