Closed Bug 1317189 Opened 3 years ago Closed 2 years ago
talos --rebuild option stopped working
59 bytes, text/x-review-board-request
For a couple months now I'm testing performance of my branch using the following talos run: ./mach try -b o -p linux64,macosx64,win64 -u none[x64,10.10,Windows\ 8] -t other[x64,10.10,Windows\ 8],other-e10s[x64,10.10,Windows\ 8] --rebuild 20 Historically, this always worked. I got builds like: - https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab90334d93d8 - https://treeherder.mozilla.org/#/jobs?repo=try&revision=79facb824200 - https://treeherder.mozilla.org/#/jobs?repo=try&revision=248c297a129b but over last two days the builds get 20 rebuilds for linux, but just one for mac and windows: - https://treeherder.mozilla.org/#/jobs?repo=try&revision=6544a957e60e64fa97a11e293e17af02c1d1fd22 - https://treeherder.mozilla.org/#/jobs?repo=try&revision=7abee73aa6672ef7528ed4d6345138a50239c74c or like here 20 builds for windows, 20 builds for mac e10s, but only 1 non-e10s mac: - https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0e752b48e4c499d61f49065ce2c585ec4735d1f -
Armen, possibly related to bug 1316976?
adding jobs to that build doesn't work either. I tried to add more talos-other and it never happened.
I did respin talos-other today, and it turned out to kick 1+20 new ones, though there are still pending ones on https://firstname.lastname@example.org
More examples: - https://treeherder.mozilla.org/#/jobs?repo=try&revision=6544a957e60e64fa97a11e293e17af02c1d1fd22 - windows stuck, macos e10s stuck - https://treeherder.mozilla.org/#/jobs?repo=try&revision=13cbd8a4e42c81516f7a2a3c2887865ad0b1a925 - windows and linux stuck, macos done etc. Can we get some help with this? I'm running a lot of perf tests right now and this bug is making it rally hard to work.
More recent updates: Linux stuck, windos and OSX done: https://treeherder.mozilla.org/#/jobs?repo=try&revision=584c35d5187b however, it was able to re-build on Linux a week before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6c7d834929a76ff701671a9d0474290d188f1132
:bstack, would you be able to help us figure out why this wouldn't be working on linux (i.e. taskcluster) ?
Sorry I've let this language all day today. Had some other stuff I needed to look into first. Afaict, this isn't related to our recent work in triggering talos from treeherder. This would most likely be an in-tree taskgraph generation issue. I'll look into this a bit and defer to someone more wise in the ways of in-tree stuff if I can't find anything awry.
Assignee: nobody → bstack
Status: NEW → ASSIGNED
I'm at a bit of a loss. I don't think I really have the context here to figure out what's going on. wlach, is this related to the work you're doing now?
No, this isn't really related to anything I'm doing. I don't really see why this would be taskcluster related, at least not fully, as apparently the problem goes back 3 months (long before we used buildbotbridge to schedule the linux talos jobs). If the problems were linux-specific and were more recent, :wcosta would be the person I'd ping (he was doing most of the work for linux talos and BBB). :catlee, do you know who might be able to debug this? They would need to know about buildbot and how try syntax translates into talos jobs being scheduled.
Flags: needinfo?(wlachance) → needinfo?(catlee)
Assignee: bstack → nobody
Status: ASSIGNED → NEW
I think --rebuild support is something that trigger-bot  handles. Chris, can you help out here?  http://chmanchester.github.io/blog/2015/07/15/automatic-triggering-on-try-server/
Flags: needinfo?(catlee) → needinfo?(cmanchester)
It's pretty unclear to me what the issue is here, or which jobs it impacts, so I pushed to try with `--rebuild` for Linux and OS X: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55dfc7a5b6514581601e5472ea73f880a822cdc3 https://treeherder.mozilla.org/#/jobs?repo=try&revision=a45930154c6a16225bbbe23e9dd7ec7c882f2de9 This is working as expected for buildbot jobs, which are triggered by trigger-bot, and taskcluster jobs, which are triggered by a different mechanism. People reporting this issue refer to jobs being "stuck" -- perhaps this refers to some re-triggered jobs being in pending for an apparently unreasonable amount of time?
I think the symptom is more like sending multiple platforms Talos w/ --rebuild in one time, tests might be stucked. Pushing with single platform seems fine.
I can't seem to reproduce this, trying again in https://treeherder.mozilla.org/#/jobs?repo=try&revision=123655847133d9f2b757770ad6729effb0753f26
AFAICT this blocks us evaluating stylo changes on Linux. For example in a recent try push  we saw retriggers for win and mac, but not linux. This seems to imply the feature is broken on taskcluster but not buildbot. chmanchester or wlach, can you take a closer look at this?  https://treeherder.mozilla.org/#/jobs?repo=try&revision=22028266be5e4485a959d44b1619c7e3d3f80dfa
I'm sorry, don't think I can help (this is even less my area now than it was a few months ago). If Chris doesn't know what's up, I would escalate to :garndt and/or :jmaher.
I think I figured this out. It's the difference between "--rebuild" and "--rebuild-talos", the former works fine on TC, the latter as implemented in bug 1333167 does not seem to work, but I think I see the issue.
Assignee: nobody → cmanchester
Actually, based on the links in comment 0 this bug was actually filed about "--rebuild", where the issue still doesn't reproduce. I'll re purpose it to fix "--rebuild-talos" unless there are any objections.
Comment on attachment 8866058 [details] Bug 1317189 - Fix --rebuild-talos for TC try jobs by checking the correct attribute. https://reviewboard.mozilla.org/r/137654/#review141070
Attachment #8866058 - Flags: review?(wcosta) → review+
possibly bug 1352202 is a dup?
Pushed by email@example.com: https://hg.mozilla.org/integration/autoland/rev/d27e83aae737 Fix --rebuild-talos for TC try jobs by checking the correct attribute. r=wcosta
You need to log in before you can comment on or make changes to this bug.