Closed Bug 1458774 Opened 7 years ago Closed 7 years ago

Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 and t-yosemite-r7-240

Categories

(Infrastructure & Operations :: RelOps: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

()

Details

It fails to find vcversioner, like https://treeherder.mozilla.org/logviewer.html#?job_id=176604968&repo=mozilla-beta, on pretty much every job it runs (I presume the talos jobs don't require it, and that's why they pass and everything else fails).
Does quarantining not actually work? Pretty much every time I turn around, this thing is back out of quarantine and just as broken as always.
Blocks: 1459391
Summary: Quarantine on t-yosemite-r7-241 → Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415
Rechecked the workers, t-yosemite-r7-415 had the quarantine on but t-yosemite-r7-241 didn't so I quarantined it again.
:philor, are you okay with my reimaging t-yosemite-r7-241 and then removing the quarantine on both of these? t-yosemite-r7-415 was reimaged on May 4th. So I think it is ready to be put back into service. For the failure on t-yosemite-r7-241: "Could not find suitable distribution for Requirement.parse('vcversioner')" It looks related to pip's problem in finding six for bug 1449350. I think it is a problem with the package(s) and not this hardware. I'm not sure why it only appeared on one machine, t-yosemite-r7-241, if it could be triggered by the environment then I'd like to reimage #241 also.
Flags: needinfo?(philringnalda)
The order of events for 415 goes the other way: it wasn't broken and then should be fixed by a reimage on May 4th, it was reimaged on May 4th, and broken as a result. A ridiculous amount of our infra prints datetimes without admitting what timezone it is using, but I think the first time it failed was probably at 22:22:31 Pacific May 4th. At any rate, it started failing somewhere between last Friday afternoon and late evening, and it continued failing throughout the weekend, up until Tuesday morning when I saw we'd filed a broken machine as an intermittent failure. "Fixed by a reimage May 4th" is not possible. I don't know much about pip dependencies, and even less about pypi.pvt.b.m.o, but I'd interpret the results we're getting, given that there have been a bunch of bugs-and-pushes lately about not falling back from pypi.pvt to pypi.python.org, as "we have a dependency on vcversioner, but we don't actually have it on pypi.pvt, so working machines are reusing a cached copy from back when they could download it from pypi.p.o, and newly reimaged machines are broken because they can't get it there." (Though I should note that my first guess on things where I start with a disclaimer about not knowing much about the details tend to be wildly wrong at least as often as they are close to right.)
Flags: needinfo?(philringnalda)
(In reply to Phil Ringnalda (:philor) from comment #4) > The order of events for 415 goes the other way: it wasn't broken and then > should be fixed by a reimage on May 4th, it was reimaged on May 4th, and > broken as a result. A ridiculous amount of our infra prints datetimes > without admitting what timezone it is using, but I think the first time it > failed was probably at 22:22:31 Pacific May 4th. At any rate, it started > failing somewhere between last Friday afternoon and late evening, and it > continued failing throughout the weekend, up until Tuesday morning when I > saw we'd filed a broken machine as an intermittent failure. "Fixed by a > reimage May 4th" is not possible. Could you share some links (if you have them on-hand; don't worry if not) for the failures on t-yosemite-r7-415? I don't know how to find that except for looking in papertrail or in taskcluster jobs one-by-one. The other problem seen on the #415 machine for bug 1405083, which prompted the reimaging, was a (likely keychain-related) timeout. > I don't know much about pip dependencies, and even less about > pypi.pvt.b.m.o, but I'd interpret the results we're getting, given that > there have been a bunch of bugs-and-pushes lately about not falling back > from pypi.pvt to pypi.python.org, as "we have a dependency on vcversioner, > but we don't actually have it on pypi.pvt, so working machines are reusing a > cached copy from back when they could download it from pypi.p.o, and newly > reimaged machines are broken because they can't get it there." (Though I > should note that my first guess on things where I start with a disclaimer > about not knowing much about the details tend to be wildly wrong at least as > often as they are close to right.) pypi.p.o/caching would explain why we started seeing this failure recently and for multiple packages. And that would explain why I can manually reproduce it on the macs I've spot-checked it on (other macs in mdc1/mdc2: t-yosemite-r7-{271,471}). The pip install works from the two linux(moonshot) machines I tested (one newly built today, one 4 months old), but the packages may be different, and we have different openssl (so it may not be failing over to pypi.pvt for the packages).
The non-deprecated source for links is https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2018-05-04&endday=2018-05-09&tree=trunk&bug=1459391 but it's annoying by not having machine names (bug 1445527) and by not realizing that trees other than trunk exist (bug 1449513) so I still prefer https://brasstacks.mozilla.com/orangefactor/?display=Bug&tree=all&startday=2018-05-04&endday=2018-05-09&bugid=1459391 plus a sort on the Test machine column.
Mmm, and Orange Factor says t-yosemite-r7-240 is freshly broken, too. Quarantined.
Summary: Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 → Quarantine on t-yosemite-r7-241 and t-yosemite-r7-415 and t-yosemite-r7-240
(In reply to Phil Ringnalda (:philor) from comment #6) > The non-deprecated source for links is > https://treeherder.mozilla.org/intermittent-failures.html#/ > bugdetails?startday=2018-05-04&endday=2018-05-09&tree=trunk&bug=1459391 but > it's annoying by not having machine names (bug 1445527) and by not realizing > that trees other than trunk exist (bug 1449513) so I still prefer > https://brasstacks.mozilla.com/orangefactor/ > ?display=Bug&tree=all&startday=2018-05-04&endday=2018-05-09&bugid=1459391 > plus a sort on the Test machine column. Thank you!
Rechecked the workers All of them were out of quarantine, rechecked the live logs to see if the problem persists in the failed tasks and it does. Quarantined all 3 again.
Fixed by bug 1459391
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.