Closed Bug 622308 Opened 14 years ago Closed 14 years ago

Frequent Talos hangs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

Details

Starting with http://hg.mozilla.org/mozilla-central/rev/39db16b78175 (which was Windows-only but failed on Linux) we've been having Talos hangs, mostly on Linux, and on both mozilla-central and TraceMonkey: 2 Linux and 1 Linux64 on that push, 1 Linux and 2 Linux64 on the next push, 1 Linux64 on the next push which was a comment change to trigger another build, 1 Linux and 1 Windows on the only TraceMonkey push of the day.

joduinn mentioned that some slaves had come back from staging today, which got me looking at recent builds for the slaves involved:

talos-r3-fed64-023 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-023
talos-r3-fed-012 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-012
talos-r3-fed64-027 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-027
talos-r3-fed64-053 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-053
talos-r3-fed-038 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-038
talos-r3-fed-024 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-024
talos-r3-fed64-039 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-039
talos-r3-fed-044 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-044
talos-r3-w7-007 - https://build.mozilla.org/buildapi/recent/talos-r3-w7-007

Every one of them had a big gap before today, some only since the 17th or 25th, but several since October or November, so I suspect these are those slaves, and that they weren't as healthy as they seemed while they were hanging out in staging.
Though none of the affected slaves has yet done exactly the same test on the same branch (the closest is https://build.mozilla.org/buildapi/recent/talos-r3-fed-024 doing svg on whatever shadow-central is), at least 8 of them have gone on to successfully do another talos run on the same branch, so although I don't have a theory for what sort of leftover something could cause such a thing, it's possible that they are only failing on their first talos run after coming back to production.
Six pushes without seeing any more, perhaps we're out of the woods (or perhaps Monday morning we'll have 30 pushes going at once before anyone notices that the first 10 are failing, who knows?).
Severity: blocker → normal
I can't imagine anyone being able to do anything useful at this point with "a bunch of slaves that just came back to production failed once on their first Talos run 5 days ago, and then were fine."
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.