Starting with http://hg.mozilla.org/mozilla-central/rev/39db16b78175 (which was Windows-only but failed on Linux) we've been having Talos hangs, mostly on Linux, and on both mozilla-central and TraceMonkey: 2 Linux and 1 Linux64 on that push, 1 Linux and 2 Linux64 on the next push, 1 Linux64 on the next push which was a comment change to trigger another build, 1 Linux and 1 Windows on the only TraceMonkey push of the day. joduinn mentioned that some slaves had come back from staging today, which got me looking at recent builds for the slaves involved: talos-r3-fed64-023 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-023 talos-r3-fed-012 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-012 talos-r3-fed64-027 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-027 talos-r3-fed64-053 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-053 talos-r3-fed-038 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-038 talos-r3-fed-024 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-024 talos-r3-fed64-039 - https://build.mozilla.org/buildapi/recent/talos-r3-fed64-039 talos-r3-fed-044 - https://build.mozilla.org/buildapi/recent/talos-r3-fed-044 talos-r3-w7-007 - https://build.mozilla.org/buildapi/recent/talos-r3-w7-007 Every one of them had a big gap before today, some only since the 17th or 25th, but several since October or November, so I suspect these are those slaves, and that they weren't as healthy as they seemed while they were hanging out in staging.
Uh oh, https://build.mozilla.org/buildapi/recent/talos-r3-fed64-040 doesn't fit with the slave pattern, though http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1293842455.1293849865.20936.gz fits with the hang pattern.
Though none of the affected slaves has yet done exactly the same test on the same branch (the closest is https://build.mozilla.org/buildapi/recent/talos-r3-fed-024 doing svg on whatever shadow-central is), at least 8 of them have gone on to successfully do another talos run on the same branch, so although I don't have a theory for what sort of leftover something could cause such a thing, it's possible that they are only failing on their first talos run after coming back to production.
Six pushes without seeing any more, perhaps we're out of the woods (or perhaps Monday morning we'll have 30 pushes going at once before anyone notices that the first 10 are failing, who knows?).
Severity: blocker → normal
I can't imagine anyone being able to do anything useful at this point with "a bunch of slaves that just came back to production failed once on their first Talos run 5 days ago, and then were fine."
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.