Closed
Bug 1458774
Opened 7 years ago
Closed 7 years ago
Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 and t-yosemite-r7-240
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Unassigned)
References
()
Details
It fails to find vcversioner, like https://treeherder.mozilla.org/logviewer.html#?job_id=176604968&repo=mozilla-beta, on pretty much every job it runs (I presume the talos jobs don't require it, and that's why they pass and everything else fails).
| Reporter | ||
Comment 1•7 years ago
|
||
Does quarantining not actually work? Pretty much every time I turn around, this thing is back out of quarantine and just as broken as always.
| Reporter | ||
Updated•7 years ago
|
| Reporter | ||
Updated•7 years ago
|
Blocks: 1459391
Summary: Quarantine on t-yosemite-r7-241 → Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415
Comment 2•7 years ago
|
||
Rechecked the workers, t-yosemite-r7-415 had the quarantine on but t-yosemite-r7-241 didn't so I quarantined it again.
:philor, are you okay with my reimaging t-yosemite-r7-241 and then removing the quarantine on both of these?
t-yosemite-r7-415 was reimaged on May 4th. So I think it is ready to be put back into service.
For the failure on t-yosemite-r7-241:
"Could not find suitable distribution for Requirement.parse('vcversioner')"
It looks related to pip's problem in finding six for bug 1449350. I think it is a problem with the package(s) and not this hardware. I'm not sure why it only appeared on one machine, t-yosemite-r7-241, if it could be triggered by the environment then I'd like to reimage #241 also.
Flags: needinfo?(philringnalda)
| Reporter | ||
Comment 4•7 years ago
|
||
The order of events for 415 goes the other way: it wasn't broken and then should be fixed by a reimage on May 4th, it was reimaged on May 4th, and broken as a result. A ridiculous amount of our infra prints datetimes without admitting what timezone it is using, but I think the first time it failed was probably at 22:22:31 Pacific May 4th. At any rate, it started failing somewhere between last Friday afternoon and late evening, and it continued failing throughout the weekend, up until Tuesday morning when I saw we'd filed a broken machine as an intermittent failure. "Fixed by a reimage May 4th" is not possible.
I don't know much about pip dependencies, and even less about pypi.pvt.b.m.o, but I'd interpret the results we're getting, given that there have been a bunch of bugs-and-pushes lately about not falling back from pypi.pvt to pypi.python.org, as "we have a dependency on vcversioner, but we don't actually have it on pypi.pvt, so working machines are reusing a cached copy from back when they could download it from pypi.p.o, and newly reimaged machines are broken because they can't get it there." (Though I should note that my first guess on things where I start with a disclaimer about not knowing much about the details tend to be wildly wrong at least as often as they are close to right.)
Flags: needinfo?(philringnalda)
(In reply to Phil Ringnalda (:philor) from comment #4)
> The order of events for 415 goes the other way: it wasn't broken and then
> should be fixed by a reimage on May 4th, it was reimaged on May 4th, and
> broken as a result. A ridiculous amount of our infra prints datetimes
> without admitting what timezone it is using, but I think the first time it
> failed was probably at 22:22:31 Pacific May 4th. At any rate, it started
> failing somewhere between last Friday afternoon and late evening, and it
> continued failing throughout the weekend, up until Tuesday morning when I
> saw we'd filed a broken machine as an intermittent failure. "Fixed by a
> reimage May 4th" is not possible.
Could you share some links (if you have them on-hand; don't worry if not) for the failures on t-yosemite-r7-415? I don't know how to find that except for looking in papertrail or in taskcluster jobs one-by-one. The other problem seen on the #415 machine for bug 1405083, which prompted the reimaging, was a (likely keychain-related) timeout.
> I don't know much about pip dependencies, and even less about
> pypi.pvt.b.m.o, but I'd interpret the results we're getting, given that
> there have been a bunch of bugs-and-pushes lately about not falling back
> from pypi.pvt to pypi.python.org, as "we have a dependency on vcversioner,
> but we don't actually have it on pypi.pvt, so working machines are reusing a
> cached copy from back when they could download it from pypi.p.o, and newly
> reimaged machines are broken because they can't get it there." (Though I
> should note that my first guess on things where I start with a disclaimer
> about not knowing much about the details tend to be wildly wrong at least as
> often as they are close to right.)
pypi.p.o/caching would explain why we started seeing this failure recently and for multiple packages. And that would explain why I can manually reproduce it on the macs I've spot-checked it on (other macs in mdc1/mdc2: t-yosemite-r7-{271,471}). The pip install works from the two linux(moonshot) machines I tested (one newly built today, one 4 months old), but the packages may be different, and we have different openssl (so it may not be failing over to pypi.pvt for the packages).
| Reporter | ||
Comment 6•7 years ago
|
||
The non-deprecated source for links is https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-05-04&endday=2018-05-09&tree=trunk&bug=1459391 but it's annoying by not having machine names (bug 1445527) and by not realizing that trees other than trunk exist (bug 1449513) so I still prefer https://brasstacks.mozilla.com/orangefactor/?display=Bug&tree=all&startday=2018-05-04&endday=2018-05-09&bugid=1459391 plus a sort on the Test machine column.
| Reporter | ||
Comment 7•7 years ago
|
||
Mmm, and Orange Factor says t-yosemite-r7-240 is freshly broken, too. Quarantined.
Summary: Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 → Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 and t-yosemite-r7-240
(In reply to Phil Ringnalda (:philor) from comment #6)
> The non-deprecated source for links is
> https://treeherder.mozilla.org/intermittent-failures.html#/
> bugdetails?startday=2018-05-04&endday=2018-05-09&tree=trunk&bug=1459391 but
> it's annoying by not having machine names (bug 1445527) and by not realizing
> that trees other than trunk exist (bug 1449513) so I still prefer
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&tree=all&startday=2018-05-04&endday=2018-05-09&bugid=1459391
> plus a sort on the Test machine column.
Thank you!
Comment 9•7 years ago
|
||
Rechecked the workers
All of them were out of quarantine, rechecked the live logs to see if the problem persists in the failed tasks and it does.
Quarantined all 3 again.
| Reporter | ||
Comment 10•7 years ago
|
||
Fixed by bug 1459391
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•