Closed Bug 608004 Opened 11 years ago Closed 11 years ago

l10n repack jobs should be able to do multiple repacks per job

Categories

(Release Engineering :: General, defect, P4)

All
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [l10n])

Attachments

(1 file, 3 obsolete files)

We currently fire 1 job for every locale, which means that we deal with setup/teardown cost per locale, even though a single slave ends up dealing with many jobs for a set of repacks. We should be able to do N repacks in a job to save (setup/teardown cost) * N-1. This is especially helpful in releases where l10n repacks affect end to end time non-trivially.
Depends on: 608005
We also have to make sure that we can report the logs of each locale individually as we post to each sub-tinderbox page ("Mozilla-l10n-$locale"). I guess we also have to give to each locale the right exit status rather than share the same job exit status (as some locales can be red, others oranges and others green).

Axel, anything else for us to watch for?
Whiteboard: [l10n]
Do people still use that? I thought everyone was looking at the dashboard these days. I'd argue that we shouldn't make design decisions based on Tinderbox support, too. Status for new things should come through Buildbot JSON.
According to my coconuts (sorry, should put that online, but busy with l10n-changesets), l10n-repacks spend their time in three steps, generally:

configure
installers
partial mar

configure is not really sweet, but not too bad, either. installers and partial take the biggest part, varying by platform.

Sounds like a premature optimization.

And yes, if we tune our infrastructure to reporting, it should be forwards, not backwards.

PS: coconuts is a django app on top of the status db dumps that catlee put up.
(In reply to comment #3)
> According to my coconuts (sorry, should put that online, but busy with
> l10n-changesets), l10n-repacks spend their time in three steps, generally:
> 
> configure
> installers
> partial mar
> 
> configure is not really sweet, but not too bad, either. installers and partial
> take the biggest part, varying by platform.

configure alone adds about 30 minutes on the end2end time on Windows. (~30s * 74 locales) -- when we're talking about an overall end2end time of 6 hours, that's pretty significant.

> Sounds like a premature optimization.

Optimization is only part of this. The other parts are:
- Moving logic out of BuildFactory's
- Enabling one-build-per-builder releases, which helps a ton in the scheduling area.

> And yes, if we tune our infrastructure to reporting, it should be forwards, not
> backwards.

Sorry, I don't quite follow. The specific thing I'm concerned about is: Is it a requirement that l10n builds send mail to the per-locale Tinderbox trees?
If we run configure via

make -f client.mk /Users/axelhecht/src/central/mozilla-central/../fx-debug/Makefile 

that should stop being an issue. Requires mozconfigs to be used, though.

I don't see any value moving logic or one-build-per-builder, from the given context.

Regarding reporting, anything that makes reporting harder is a no-go, basically. That's true in this very abstract form, independent on whether that's affecting existing or non-existing reports. Just generally, reporting is much more important than scheduling.
(In reply to comment #5)
> Regarding reporting, anything that makes reporting harder is a no-go,
> basically. That's true in this very abstract form, independent on whether
> that's affecting existing or non-existing reports. Just generally, reporting is
> much more important than scheduling.

I don't disagree with that general principle. Can you please answer my question explicitly though? I'm having trouble interpreting your answer.
Yes, people still look at the individual tinderbox pages, mostly because the top-level all-locales page drop builds, and usually have way more builds than one can look at.

Tinderbox is not something that I call reporting, though. As such, "works on tinderbox" isn't an argument beyond "doesn't regress the cruft we have".
(In reply to comment #7)
> Yes, people still look at the individual tinderbox pages, mostly because the
> top-level all-locales page drop builds, and usually have way more builds than
> one can look at.

OK. I'll make sure that whatever does or doesn't happen here doesn't regress that.

(In reply to comment #5)
> If we run configure via
> 
> make -f client.mk
> /Users/axelhecht/src/central/mozilla-central/../fx-debug/Makefile 
> 
> that should stop being an issue. Requires mozconfigs to be used, though.

What Makefile is fx-debug/Makefile? http://hg.mozilla.org/mozilla-central/file/39a979e26931/Makefile.in ?
That's the generated Makefile in my local objdir. Using my quick test, it doesn't matter if it's an absolute or relative path in that command either, luckily.
(In reply to comment #9)
> That's the generated Makefile in my local objdir. Using my quick test, it
> doesn't matter if it's an absolute or relative path in that command either,
> luckily.

Ok, cool. Thanks for that info
If things get too complicated. Would it help to have a TBPL page for Mozilla-l10n tbox page and drop per locale tbox pages? 

Would this page still work http://l10n.mozilla.org/~axel/nightlies/?
Please, let's not extend this bug to changing l10n reporting unless it ends up overly difficult to maintain.
Depends on: 609901
Attached patch l10n repack logic, in a script! (obsolete) — Splinter Review
This is a somewhat rough version, but I'd love some feedback on it. Notable things:
- Some stuff is just a copy of existing logic in buildbotcustom (eg, postUploadCmdPrefix)
- I've tried to separate the l10n logic in such a way that bits that can be shared with nightly l10n builds are not in release-specific spots. These are in build/l10n.py
- The way I've written these, we'll want to start using a separate mozconfig for l10n builds. This is something we've wanted to do for awhile (bug 482447) anyways.
- There's some messy stuff with paths in here due to MSYS -> env var -> MSYS crap. (Basically, path constructed with path.join(), stuffed into an env var, and read by another msys program do not work.) I'm OK with what I ended up with here, but frustration is a bias -- let me know if you'd like to see it cleaner.
- I'm planning to write a wrapper similar to that in bug 508896 that will be invoked from Buildbot. I thought about reading json directly here but in my brain, it seems to partly defeat the purpose of moving things out of Buildbot if the main scripts themselves depend on buildbot-configs stuff.
- This supports multiple repacks without doing any of the setup/teardown twice.

I tested on mac (10.5), linux (32-bit), and windows
Attachment #489009 - Flags: review?(catlee)
Blocks: 482447
Comment on attachment 489009 [details] [diff] [review]
l10n repack logic, in a script!

This is obscuring a ton of things that are worthwhile understanding when looking at repack builds.

I don't think we should go down this route, and I do so very strongly.
Attachment #489009 - Flags: review-
(In reply to comment #14)
> Comment on attachment 489009 [details] [diff] [review]
> l10n repack logic, in a script!
> 
> This is obscuring a ton of things that are worthwhile understanding when
> looking at repack builds.
> 
> I don't think we should go down this route, and I do so very strongly.

If you're going to object so strongly, please be more specific. The ultimate goal here is getting the repack logic out of Buildbot and making it more parallelizable. Do you have an alternate implementation suggestion?
More parallelizable by making it less modular? That cannot work.

Also, the idea to get logic out of one piece of software into a different piece of software is not a goal in itself. Right now, you can at least point people to "check what our builds are doing", and this bug is taking that away.
If the script has a for loop (per locale), we should be able to report/notify in that for loop.

I only see goodness coming from this.
I consider the logic move to be bad for Mozilla, really. Right now, at least for those that have first-class access to the data, you can just check which step failed with what result and decide if it's an locale problem or an infrastructure problem. Like, the ssh upload stuff. just some easy db queries, and you get exactly when an upload step failed on which slave.
What's preventing a slave side script from populating such a db?
I don't see why buildbot is a prereq for this.
And there's various other things we can do to make querying easier -- such as post processing logs. Optimizing for reporting is the wrong thing to do IMHO. We should optimize for efficiency, clarity, maintainability, and other things. Reporting issues can be fixed with tools.
A) reporting is not a post-processing step, but a first class requirement. I know that none of our build infrastructure does that, but maybe that's part of the reason why we're not shipping? Or why I am running my own build system?
B) efficiency, clarity and whatnot are nice to have, no goals. Nor bound to what you're doing.
C) Good coding style has nothing to do with spreading the code across more files with more lengthy command lines that noone but one releng system is ever going to use.

You're really just ripping out features that once you add them back will get you into a worse place than before.
Mozharness should be able to trim the command line args needed to as few as zero.

Reporting can be done even more efficiently in a slave side script than in buildbot afaict.

I think you're arguing that if we do this, we need to do it well, and I agree.
I'm arguing that if you want to do something well, focus on the things that matter instead of those that don't.

Create real mozconfigs, use failsafe mercurial steps, generate reports on infrastructure problems, refactor the build factories to not do something different all around. Make configure use the build system to be quick.
Comment on attachment 489009 [details] [diff] [review]
l10n repack logic, in a script!

This has changed a fair bit since I posted it, I'll attach a new one for feedback later.
Attachment #489009 - Attachment is obsolete: true
Attachment #489009 - Flags: review?(catlee)
Attached patch updated scripts (obsolete) — Splinter Review
This is a much much more tested version, and includes a thin .sh wrapper. Despite our conversation yesterday I ended up reading release configs directly in create-release-repacks.py. The only alternative I could find that was reconfigless release friendly was parsing the release/branch config in the .sh wrapper, which is worse IMHO.

I expanded the config reading functions in release.info to do some validation.

I've tested from within Buildbot across win32, mac (ppc+i386 universal), and 32-bit linux. Buildbotcustom changes to enable that are incoming.
Attachment #489656 - Flags: review?(catlee)
Attached patch buildbotcustom changes (obsolete) — Splinter Review
All my testing was done with "force build". I still need to make sure I can get proper properties/sourcestamp propagation through the Schedulers (specifically, sourcestamp.revision and the release_config property).
Attachment #489657 - Flags: feedback?(catlee)
Comment on attachment 489656 [details] [diff] [review]
updated scripts

Nothing in this patch changes my mind.

r-.

I don't accept this change in infrastructure, fankly, as it's regressing existing functionality, and it's not providing promise for the future.

I'm also not open to arguments like "reporting can be done...", as years of experience have shown that nothing's going to happen. I'm not going to hide my frustration here.

That said, if you're interested in constructive feedback on the problems you want to solve, I'm happy to discuss those in detail, grouped as it makes sense.
Attachment #489656 - Flags: review-
Blocks: 608008
New in this version:
- purge_builds.py support, with smarter defaults in that script
- use mercurial/update calls everywhere to ensure we always get the most recent _RELEASE tag. will also get us using share when possible, once that patch lands
- use l10n mozconfigs from $platform/$branch/release/l10n-mozconfig
- inherit merge option for release config


Axel, I appreciate that you care about our l10n builds but frankly, your unrequested review is not going to stop us from moving forward here. I am more than happy to work with you and do our best to accommodate your reporting requirements but especially for release builds (which are my current target) they are not the primary design consideration.
Attachment #489656 - Attachment is obsolete: true
Attachment #489657 - Attachment is obsolete: true
Attachment #490155 - Flags: review?(catlee)
Attachment #489657 - Flags: feedback?(catlee)
Attachment #489656 - Flags: review?(catlee)
I do expect that the trouble for release factories will be significantly eased by bug 525438, fwiw.

Anyway, I'll make this bug part of a post to .governance.
CCing LegNeato on this one. Transparency of release builds is on his list, IIRC.
Comment on attachment 490155 [details] [diff] [review]
scripts updated with purge_builds and a few other things

>+def readConfig(configfile, keys=[], required=[]):
>     c = {}
>     execfile(configfile, c)
>-    return c['releaseConfig']
>+    for k in keys:
>+        c = c[k]
>+    items = c.keys()
>+    err = False
>+    for key in required:
>+        if key not in items:
>+            err = True
>+            log.error("Required item `%s' missing from %s" % (key, c))
>+    if err:
>+        raise
>+    return c

You need to raise some kind of exception here.

>+    build.misc.cleanupObjdir(sourceRepoName, objdir, appName)
>+    l10nRepackPrep(sourceRepo, sourceRepoName, revision, objdir,
>+                   mozconfigPath, l10nRepoDir, makeDirs, localeSrcDir, env)
>+    input_env = downloadReleaseBuilds(stageServer, product, brand, version,
>+                                      buildNumber, platform)
>+    env.update(input_env)
>+
>+    err = False
>+    for l in locales:
>+        try:
>+            repackLocale(l, l10nRepoDir, l10nBaseRepo, revision,
>+                         localeSrcDir, l10nIni, compareLocalesRepo, env, merge)
>+        except Exception, e:
>+            err = True
>+            log.error("Error creating locale '%s': %s", l, e)
>+            pass
>+
>+    if err:
>+        raise

Here too.  The 'pass' above isn't required either.

>+    l10nRepoDir = path.split(releaseConfig["l10nRepoClonePath"])[-1]
>+    stageSshKey = path.join("~", ".ssh", branchConfig["stage_ssh_key"])

Does this "~" need to be expanded with os.path.expanduser?

r+ otherwise.
Attachment #490155 - Flags: review?(catlee) → review+
Whoops, thanks for catching those empty raises.

We actually *must* preserve the ~ rather than expanding it, because it's being set as an environment variable. A path which has been fully expanded by a Windows app (python), set as env var, then read in again by a Windows app (python), then passed to MSYS' ssh will not work, because of quoting or slashes (can't remember which).
Comment on attachment 490155 [details] [diff] [review]
scripts updated with purge_builds and a few other things

changeset:   1271:9cba7a9fd13e
Attachment #490155 - Flags: checked-in+
Active in staging now, will be turned in production with the first 0.8.0 releases there.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 608005
No longer depends on: 608005, 609901
Depends on: 804090
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.