Closed Bug 1277666 Opened 4 years ago Closed 4 years ago
taskcluster and mozharness disagree on rank
For buildbot jobs, mozharness uses the pushdate as its rank (eg: an epoch time like 1464872845), while taskcluster appears to use the pushlog_id (eg: a value like 30306). This means a buildbot job will always win out in the index, even over a newer taskcluster job. If we turn off a buildbot job, presumably the last-built job on that platform will always be the "latest" in the index. I think we'll probably need to change taskcluster's rank to match buildbot's.
I'm not sure who to check with here, but it looks like pmoore might have reviewed the initial implementation.
Hey Mike, Do you have an example index route that both buildbot (via mozharness) and taskcluster publish to? I think if we can engineer the buildbot tasks and taskcluster tasks to use different index routes, we can avoid the rank competition between the two systems. This also then would be more explicit (so you would know from the route if you are looking at the result of a buildbot job or a taskcluster task). However, I see https://docs.taskcluster.net/manual/devel/namespaces#indexes suggests not using 'buildbot' in the route - so ni'ing dustin on this as I'm not sure why we share the same namespace for both systems.
Flags: needinfo?(pmoore) → needinfo?(dustin)
So looking into it more, I see that https://docs.taskcluster.net/manual/devel/namespaces#indexes says: gecko.v1 - another "old" index for builds; do not use gecko.v2.<tree>.revision.<revision>.<platform>.<build>, gecko.v2.<tree>.latest.<platform>.<build> - Index for Gecko build jobs, either by revision or for the latest job with the given platform and build. These are the responsibility of the release engineering team. My concerns are: 1) We say that 'gecko.v1' should not be used, however, many in-tree configs are still using it. If we say we shouldn't use it, we should migrate away from it, IMHO. 2) the gecko.v2 routes do not include the build system used to publish them (e.g. buildbot or taskcluster) so indeed these two systems are colliding, hence this bug. Imagine you have a tool which queries the latest build for a given criteria, from the index. If we have BB and TC jobs running in parallel, you could get a TC or BB build one time, and a different type another time, just depending on whether TC or BB won the latest race (assuming both jobs are running). This seems dangerous and unpredictable. IMO it would be much better and cleaner if BB and TC published to different routes, so that you can consistently get either BB builds or TC builds from the index from a given route. Before I raise a bug to change this though, I'd like to understand what the motivation was for merging them, as maybe I am missing information.
That document was prepared based on my understanding of things -- I could be wrong. I do think it's very Mozillian to have three slightly different ways of indexing tasks that almost nobody understands, but that doesn't mean it's a good idea :) Long-term, I don't think we want to index builds from different build systems separately, and I think v2 is the long-term path. So maybe we should have everything feed there, with buildbot also feeding to the buildbot route and tc feeding to some new tc route?
(In reply to Pete Moore [:pmoore][:pete] from comment #2) > Do you have an example index route that both buildbot (via mozharness) and > taskcluster publish to? As an example, see this route used by Taskcluster: https://tools.taskcluster.net/index/#gecko.v2.mozilla-central.latest.firefox/gecko.v2.mozilla-central.latest.firefox.linux32-dbg Currently I see a rank of 30309 And compare to this route used by Buildbot: https://tools.taskcluster.net/index/#gecko.v2.mozilla-central.latest.firefox/gecko.v2.mozilla-central.latest.firefox.linux-debug Currently I see a rank of 1464948040 In bug 1276352 I'd like to rename the Taskcluster routes so that it uses the naming convention in buildbot. So Taskcluster will write to firefox.linux-debug. However, given that the buildbot tasks have a much higher rank, we won't actually see the newer Taskcluster results there. > I think if we can engineer the buildbot tasks and taskcluster tasks to use > different index routes, we can avoid the rank competition between the two > systems. This also then would be more explicit (so you would know from the > route if you are looking at the result of a buildbot job or a taskcluster > task). We don't want buildbot and taskcluster to use different routes. As an example use-case, artifact builds will look in the index for a 'linux-debug' build. We don't want to have to update clients like this just because a build moved from buildbot to taskcluster.
(In reply to Pete Moore [:pmoore][:pete] from comment #3) > 1) We say that 'gecko.v1' should not be used, however, many in-tree configs > are still using it. If we say we shouldn't use it, we should migrate away > from it, IMHO. I don't believe it is actually used anywhere - I just filed bug 1277881 to get rid of it. (By used, I mean any client actually trying to read from gecko.v1.X. The tasks still publish to it, but we should be able to just delete those routes from the task definitions). Unfortunately buildbot.X still has some users - artifact builds (bug 1250700) and mozregression (no bug yet I think?). So we can't remove those until the clients are updated. > 2) the gecko.v2 routes do not include the build system used to publish them > (e.g. buildbot or taskcluster) so indeed these two systems are colliding, > hence this bug. This is intentional. We originally tried to go with the model of buildbot publishing to specific routes (hence buildbot.X) and Taskcluster publishing to other routes (gecko.v1.X), but since we are constantly changing where things are built, it becomes hard/impossible for a user to know what route to use to get a build (eg: "give me the latest linux 64-bit debug build"). If you're using buildbot.whatever.linux64-debug, and we turn off that buildbot job because it was moved to Taskcluster, suddenly you stop getting the latest build. So to work around this, Taskcluster jobs started publishing to both gecko.v1.X *and* buildbot.X, which made it even more confusing. gecko.v2 is an attempt to unify them in a way that we can get a singular view of everything that can be built, regardless of the underlying system while the transition is underway. This way clients (like mozregression and artifact builds), can use the index and not constantly be breaking as builds move to Taskcluster. But that means we do need consistency between them, including naming (bug 1276352) and ranking (this bug). > Imagine you have a tool which queries the latest build for a given criteria, > from the index. If we have BB and TC jobs running in parallel, you could get > a TC or BB build one time, and a different type another time, just depending > on whether TC or BB won the latest race (assuming both jobs are running). > This seems dangerous and unpredictable. IMO it would be much better and > cleaner if BB and TC published to different routes, so that you can > consistently get either BB builds or TC builds from the index from a given > route. This is a great point to raise. I think this needs to be part of promoting a Taskcluster build from tier-2 to tier-1 (ie: when does it overtake the buildbot job as the canonical source?). This can be accomplished just by setting the rank I believe (tier-2 always has rank 0?) > Before I raise a bug to change this though, I'd like to understand what the motivation was for > merging them, as maybe I am missing information. Does that help? Or did I make things more confusing? :)
(In reply to Dustin J. Mitchell [:dustin] from comment #4) > Long-term, I don't think we want to index builds from different build > systems separately, and I think v2 is the long-term path. So maybe we > should have everything feed there, with buildbot also feeding to the > buildbot route and tc feeding to some new tc route? They do both publish there, and they use the same routes file (testing/taskcluster/routes.json), so the route formatting should be identical for both. However the parameters that get substituted in for things like "build_name" come from different sources (the mozharness/buildbot configuration vs. task definitions). So unfortunately we have to make sure we maintain parity there while converting things.
Wow, thanks for the details Mike! Definitely parity of terms is important, and I think you're working on that in other bugs. So it's just rank to fix up here. I like the idea of having tier-2 builds use a lower rank than tier-1.
Thanks guys! And sorry for being opinionated, after reading your feedback I realise it does make sense. I guess it is a balancing act between a) surprising consumers that a build unexpectedly comes from a different source (like they start getting TC builds instead of BB ones without expecting it), or, b) surprising consumers that their scripts to get latest builds suddenly stop fetching new builds, because e.g. we migrated from buildbot to taskcluster (and they don't really care). Whichever way we go, some consumers are going to be surprised, so there's no magic bullet that keeps everybody happy. At least we're not indexing buildbot builds to a taskcluster namespace, or taskcluster builds to a buildbot namespace - now that would be a nasty hack! :-) In this case, I'm happy for the taskcluster rank to be adapted to be epoch based. @Mike can you send an email to email@example.com with details of your intentions? Thanks! Pete
Buildbot builds use the epoch time for rank, so Taskcluster needs to use this as well. Using the pushlog_id instead means that a Buildbot build will always persist in the "latest" index since the highest value wins. Review commit: https://reviewboard.mozilla.org/r/59388/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/59388/
Attachment #8763057 - Flags: review?(dustin)
Comment on attachment 8763057 [details] Bug 1277666 - Use epoch time for rank; https://reviewboard.mozilla.org/r/59388/#review56744
Attachment #8763057 - Flags: review?(dustin) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/mozilla-inbound/rev/48117b1f5f86 Use epoch time for rank; r=dustin
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.