Closed Bug 1301495 Opened 4 years ago Closed 4 years ago

Taskcluster l10n indexing should match mozharness' l10n indexing

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mshal, Assigned: Callek)

References

Details

Attachments

(2 files)

l10n jobs are currently indexed as:

 "{index}.gecko.v2.{project}.revision.{head_rev}.{build_product}-l10n.{build_name}-{build_type}.{locale}",
 "{index}.gecko.v2.{project}.pushdate.{year}.{month}.{day}.{pushdate}.{build_product}-l10n.{build_name}-{build_type}.{locale}",
  "{index}.gecko.v2.{project}.latest.{build_product}-l10n.{build_name}-{build_type}.{locale}"

Some of the work in bug 1286075 and some things Callek is working on (which bug?) look like these routes will be changing to use the regular build routes, like:

index.gecko.v2.{project}.latest.{product}.{job-name-gecko-v2}

Where product is just "firefox" and job-name-gecko-v2 is "linux64-l10n-opt".

Fully expanded, here's a comparison of the "latest" routes:

gecko.v2.mozilla-central.latest.firefox-l10n.linux64-opt.ar
gecko.v2.mozilla-central.latest.firefox-l10n.linux64-opt.ast
gecko.v2.mozilla-central.latest.firefox-l10n.linux64-opt.cs

vs:

gecko.v2.mozilla-central.latest.firefox.linux64-l10n-opt

I believe this will present a few problems:

1) Changing the routes to a new format means we'll have to backfill the old jobs with the new format in order to make the history usable. (There's an example in braindump/taskcluster/route-backfill that may help here)

2) l10n jobs in buildbot land are "chunked", so building say 30 locales in 6 chunks means we build 5 locales per job (numbers are made up). There are a couple of options here:

 a) In mozharness, after each locale we create a dummy task, and upload the build artifacts with its own unique index. In Taskcluster it sounds like that won't work, since the task is created by Taskcluster itself.

 b) Use the chunk instead of the locale name in the route, which has the downside that nobody knows which chunk to look at to get a particular locale, and the locale-to-chunk mapping may change over time.

 c) Use a single route, which is the option used in bug 1286075. This has the downside that the first (or last) chunk wins, so only one set of locales is actually visible in the index.

 d) Remove chunking in l10n land and just build one locale per build. The downside here is that chunking introduces performance wins. (Are those performance wins still valid in TC land? Or would we benefit now from more parallelization anyway?)

 e) dustin suggested in IRC that we may be able to just index the same chunk at multiple routes, so .ar, .ast, etc all point to a single task, which has all of the artifacts for that chunk. However, we'd need to know which locales are in the chunk ahead of time in Taskcluster in order to generate the list of routes correctly.

Any other options I'm missing?

If we're going to have to change the routes for sure, I'd recommend roping in the l10n team to see if they have other requirements they'd like to levy on the index and make sure they are handled in the new world order.
f) Make the chunk task have child tasks, and each child can have its own set of routes and upload+index one of the locale's artifacts. I think this is possible in Taskcluster?
Regarding e)
You don't have to know the index namespaces for a task upfront. You do if you want to index the task by adding a route in task.routes.

However, you could also just add the scopes index:insert-task:...* to task.scopes and use the authenticating proxy from within the task to explicitly insert the task in the index.

This changes semantics a tiny bit:
Before: tasks in index was always successfully completed.
After: tasks in index could be failed, but a task is always added to index if it was successful.

Note: to avoid possible race conditions one might want to upload artifacts from within the task before inserting it into the index. That way even if the task crashes after index insertion it'll still have the required artifact.
How are the repacks going to be signed? Is there going to be a single signing task that takes all of the packages from a chunk and signs them all? Or is there going to be a signing task per package? If the latter, that would essentially mean we're doing option f), and can just index the signed package.
(In reply to Michael Shal [:mshal] from comment #3)
> How are the repacks going to be signed? Is there going to be a single
> signing task that takes all of the packages from a chunk and signs them all?
> Or is there going to be a signing task per package? If the latter, that
> would essentially mean we're doing option f), and can just index the signed
> package.

To this question, all repacks in a chunk are signed in one signing task, different chunks get signed in different signing tasks.

To elaborate, me and mike met a few weeks ago, and came up with the following rough plan:

Problem: We had (some) buildbot l10n jobs reporting with a shorter-than-40-char sha on routes like:  `{index}.gecko.v2.{project}.revision.{head_rev}.{build_product}-l10n.{build_name}-{build_type}.{locale}`
Solution: Was deemed not a big deal for BB, but was fixed since that meeting.

Problem: L10n Repacks (currently only nightly in BB world) don't use nightly route namespace
Solution: Callek to look into changing/adjusting that.

Problem: (in taskcluster) investigate adding l10n routes per locale to each chunk of repacks
Solution: Callek will get this working for taskcluster, even if we need TC team to bump the allowed routes on a single task.

Problem: investigate making the routes for binaries the *signed* package routes, rather than unsigned
Solution: None yet, but will be the ideal, but currently not blocking.
Depends on: 1323792
Depends on: 1323796
:mshal, any concerns here?
Flags: needinfo?(mshal)
This is to facilitate that n-i of mshal (and dustin's review). and is a diff of my date tree with (and without) this patch applied
The results look pretty good to me! Thanks for taking this on Callek.

So the linux64-nightly-opt is a linux64 nightly build in Taskcluster that's something new on the date branch, right? And the diff for the nightly portion just shows that we're fixing those routes to match the existing nightly organization? If so, one thing I noticed is that the macosx64 nightly routes show up as "macosx64-nightly-opt", but this should be "macosx64-opt".

Were you able to run a test with a chunk size of 1? I'd be curious to see some up-to-date numbers on what the chunking is actually buying us, in terms of total compute time and chunk turnaround time.
Flags: needinfo?(mshal)
(In reply to Michael Shal [:mshal] from comment #8)
> The results look pretty good to me! Thanks for taking this on Callek.
> 
> So the linux64-nightly-opt is a linux64 nightly build in Taskcluster that's
> something new on the date branch, right?

Yes, BUT we're aiming tier 1 on m-c within the next 1-2 weeks, and majority of this code has already landed on m-c (just not running, because its currently triggered by a hook)

> And the diff for the nightly
> portion just shows that we're fixing those routes to match the existing
> nightly organization?

Yes, for both l10n (in BB only used in nightlies) and the actual nightly builds.

> If so, one thing I noticed is that the macosx64
> nightly routes show up as "macosx64-nightly-opt", but this should be
> "macosx64-opt".

I'll fix that, thanks (mac is not yet as close to ready as linux/android)

> Were you able to run a test with a chunk size of 1? I'd be curious to see
> some up-to-date numbers on what the chunking is actually buying us, in terms
> of total compute time and chunk turnaround time.

The main point here is to (try) and get similar/same chunking sizes as mozilla-central. The benefit with this code is its SUPER easy to change chunking algorithm to be different chunk sizes, and it is merely a mozilla-central commit (so can be measured on try).

That said, the way the routes used for l10n work presently, means this won't be useable with a chunk size of 1 (since then we can't place enough routes on an l10n task) due to the amount of locales... 

We could figure out a way to work around it (seperate task that collates X repacks into a single chunk, rather than all into a chunk, with the index, so we can have the necessary index's, but that is duplicated storage/etc. -- doable but probably more human work than it is a benefit).
(In reply to Justin Wood (:Callek) from comment #9)
> The main point here is to (try) and get similar/same chunking sizes as
> mozilla-central. The benefit with this code is its SUPER easy to change
> chunking algorithm to be different chunk sizes, and it is merely a
> mozilla-central commit (so can be measured on try).

Sounds good!

> That said, the way the routes used for l10n work presently, means this won't
> be useable with a chunk size of 1 (since then we can't place enough routes
> on an l10n task) due to the amount of locales... 

Doh, I mis-spoke. I meant one locale-per-chunk (essentially, no chunking), instead of 1 chunk with all the locales. I was mostly curious if the overhead per chunk is still large enough that chunking is still worthwhile, or if we could just abandon chunking (and then by extension, have simpler index routing).
Comment on attachment 8825252 [details]
Bug 1301495 - Taskcluster l10n indexing should match mozharness' l10n indexing.

https://reviewboard.mozilla.org/r/103426/#review104170

Ugh, sorry, this has apparently been sitting in mozreview with an un-submitted r+ for a week now wit

> I only had a quick look, but didn't see anything to alarm me.  Mike, please do take a few extra minutes looking, to make up for my hurried review?

which mike has certainly done.
Attachment #8825252 - Flags: review?(dustin) → review+
Pushed by Callek@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/7a27e1371c3e
Taskcluster l10n indexing should match mozharness' l10n indexing. r=dustin
Backed out in https://hg.mozilla.org/integration/autoland/rev/80eac484366ad881c6a10bf81e8d9b8f7a676c75 at callek's request for breaking decision tasks.
Flags: needinfo?(bugspam.Callek)
Pushed by Callek@gmail.com:
https://hg.mozilla.org/mozilla-central/rev/b3774461acc6
Taskcluster l10n indexing should match mozharness' l10n indexing. r=dustin a=RyanVM
(In reply to Pulsebot from comment #13)
> Pushed by Callek@gmail.com:
> https://hg.mozilla.org/integration/autoland/rev/7a27e1371c3e
> Taskcluster l10n indexing should match mozharness' l10n indexing. r=dustin

Manually landed to central with RyanVM's a+:

https://hg.mozilla.org/mozilla-central/rev/b3774461acc6bee2216c5f57e167f9e5795fb09d

issue was that I forgot I crafted this patch against date (aka: had OSX nightly support in it, but thats not on central)
Flags: needinfo?(bugspam.Callek)
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Component: Index → Services
You need to log in before you can comment on or make changes to this bug.