Closed Bug 962863 Opened 10 years ago Closed 8 years ago

cut over l10n repos to the new vcs-sync system

Categories

(Release Engineering :: General, defect, P3)

x86
macOS
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: mozilla, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1898] )

Attachments

(6 files, 2 obsolete files)

The config file is already landed: http://hg.mozilla.org/build/mozharness/file/1d65ffa9b514/configs/vcs_sync/l10n.py though the GECKO_BRANCHES dict is now out of date: http://hg.mozilla.org/build/mozharness/file/1d65ffa9b514/configs/vcs_sync/l10n.py#l6

I think l10n is ready to go, BUT we have a mapper dependency.
Once bug 962853 is done, we can cut over the l10n repos.  These are definitely in use in production, so this bug is more sensitive than bug 866260.

It's probably best to:

* run in staging
* populate mapper's l10n project with the staging runs
* verify staging looks good
* cut over
* when we cut over, change the b2g_build.py l10n mapper call to call the new mapper url:  http://hg.mozilla.org/build/mozharness/diff/1d65ffa9b514/scripts/b2g_build.py#l1.133
* verify
Blocks: 962870
Blocks: 967282
No longer blocks: 967282
Depends on: 967282
Get the GECKO_VERSIONS in the vcs-sync l10n config file matching the instructions in https://wiki.mozilla.org/ReleaseEngineering/Merge_Duty/Steps#Restart_Gecko_l10n_vcs-sync_.28even_numbered_central_gecko_version.29 .  This config file isn't live yet, but will be at some point.
Attachment #8371233 - Flags: review?(hwine)
Comment on attachment 8371233 [details] [diff] [review]
vcs-sync_l10n-versions

Review of attachment 8371233 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8371233 - Flags: review?(hwine) → review+
Attached patch gaia_1.3_l10nSplinter Review
Add the gaia 1.3 repos as well.
Attachment #8371969 - Flags: review?(hwine)
Comment on attachment 8371969 [details] [diff] [review]
gaia_1.3_l10n

Review of attachment 8371969 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8371969 - Flags: review?(hwine) → review+
in production
Aki: should this be assigned to Pete, and should it be a Q2 goal?
Flags: needinfo?(aki)
Assignee: nobody → pmoore
Priority: -- → P3
Flags: needinfo?(aki)
Working on this now too...
Awesome.  L10n is less sensitive than gecko.git but is still partner-facing, so please check in with me and/or hwine before cutting over.  Pretty sure you may have questions before then anyway :)
(In reply to Aki Sasaki [:aki] from comment #10)
> Pretty sure you may have questions before then anyway :)

Thanks Aki for your support!

BTW, I am pushing here, while I am testing in staging:
https://github.com/petemoore/build-mozharness/compare/bug962853...bug962863
Discovered first issue!

Esperanto is listed as a supported locale for gaia 1.3 here:
https://raw.githubusercontent.com/mozilla-b2g/gaia/v1.3/locales/languages_all.json

But is missing here:
https://hg.mozilla.org/releases/gaia-l10n/v1_3/eo

I'll make sure vcs sync can handle this, but I should probably raise a separate bug about it - do you know which product / component I should raise that in, Aki?

Thanks,
Pete
Flags: needinfo?(aki)
Hm.  Short answer, switch to languages_dev.json for v1.3.

Long answer:

* if the repo is missing on git.m.o, a push will auto-create, so no worries.  However, this is hg.m.o
* I pointed it at languages_all.json instead of languages_dev.json because Axel was wanting to get all the locales synced, and all the locales enabled on Flame & Dolphin (& localizer desktop builds).
** However, Flame isn't building on 1.3; neither is Dolphin or the localizer desktop builds

So we can do one or more of the following:

1) Switch gaia 1.3 to use languages_dev.json instead of languages_all.json
2) ask
 * Look at the languages_all blame list to see who inserted esperanto without the corresponding repo, and ask them about it
 * Send an email out to Axel and possibly dev-gaia or dev-b2g to ask about esperanto and why it exists in languages_all.json on v1.3 when there's no hg repo
Depending on the answer, we can create the repo or remove it from languages_all.json

I'd lean towards (1) for now.  If you want to pursue the discrepancy, feel free to also do (2), with the caveat that we're going to EOL v1.3 by September 1 and no builds are referencing v1.3's languages_all.json.

Pretty sure we're covered for v1.4, v2.0, and v2.1 (master) languages_all.json, since we have builds that use that file for their locales list, so aiui they'd die if there was nothing to clone.
Flags: needinfo?(aki)
Aki,

Thanks for this! I've done 1) and am retesting. I think I'll also have a stab at 2) also for the purpose of introducing myself, and becoming familiar with the people/processes - first I'll get 1) working though.

Appreciated your detailed answer. :)

Thanks,
Pete
I'm still hacking away here: https://github.com/petemoore/build-mozharness/compare/bug962853...bug962863

There are some workarounds in place while I get the code working, that will need to be cleaned up.

The existing implementation wasn't publishing to remote repos, so I'm just working on this part at the moment, to get the remote targets generating from the locales. I think I still have a few more days work to do here, before I'll have it review-ready, but making good progress. :)

Just wanted to keep you up-to-date...

Pete
Hi Aki,

So I'd just finished preparing the patches, when I by chance discovered a commit which didn't have a git note - which led me to discover an issue with the git notes when we push multiple source repos to the same target repo. Long story short - I still have an open issue to solve, but I don't consider it a blocker, as nothing is relying on the git notes. It is going to be a bit of work to fix (merging the git notes is not so trivial) so I'll post these now, and follow up with a seperate bug to fix the missing git notes.

So other than this annoying glitch, the mozharness changes are ready. I have split the change into two separate mozharness patches for clarity. The first is for the l10n changes related to the requirements of this bug. The second is just a minor cleanup of the build-repos configs.

l10n patch
==========

This should be mostly self explanatory. Some code has been written to generate configs at runtime, including all the required remote targets. Since mozharness configs that are passed in should not be updated (indeed, they are read-only) I added a class variable for the remote targets (self.remote_targets), which is set to the generated l10n remote targets, if this is an l10n conversion, otherwise if it is not explicitly set, it defaults to the remote targets that are passed in, in the config file.

I've refactored some of the code where there was duplication, in both the vcs sync conversion code, and also the configs.

The remote targets and the gaia and gecko configs contain placeholders for the locales. As the locales are read, the configs get updated with new remote targets and conversion dictionaries added as required.

I also changed the email truncation from 10K to 100K when data from the log files is embedded into the email. This is because 10K doesn't allow for much content, and 100K is still quite manageable. The emails can be in total just over 200K in size, and that does not cause any problems with emails being blocked by the SMTP server, as far as I am aware.

build-repos patch
=================

This minor patch is created to achieve two things:
  1) To change our configs so that we have one staging config per vcs sync prod config (i.e. we can easily switch between running l10n and build-repos in staging, by having a separate config file for each, rather than just having a single "staging" config).
  2) A minor clean up of formatting so that build-repos.py config is identical to staging_build-repos.py except for the value of the credentials being different between these two files. This makes it easier to diff between them, to see what the true difference is between staging and production.

Looking forward to your feedback. Many thanks in advance!

Kind regards,
Pete
Attached patch l10n patch (obsolete) — Splinter Review
Attachment #8451180 - Flags: review?(aki)
Attached patch build-repos patch (obsolete) — Splinter Review
Attachment #8451182 - Flags: review?(aki)
(In reply to Pete Moore [:pete][:pmoore] from comment #16)
> Hi Aki,
> 
> So I'd just finished preparing the patches, when I by chance discovered a
> commit which didn't have a git note - which led me to discover an issue with
> the git notes when we push multiple source repos to the same target repo.
> Long story short - I still have an open issue to solve, but I don't consider
> it a blocker, as nothing is relying on the git notes. It is going to be a
> bit of work to fix (merging the git notes is not so trivial) so I'll post
> these now, and follow up with a seperate bug to fix the missing git notes.

Created bug 1034725 for this. Like I say, it is not a dependency and not a blocker, as nothing is currently relying on the git notes - this is just a "nice to have" at the moment.
Comment on attachment 8451180 [details] [diff] [review]
l10n patch

r-ing because I'd like to see responses (at the very least) to the concerns below, if not code changes.

>diff --git a/configs/vcs_sync/l10n.py b/configs/vcs_sync/l10n.py
<snip>
>+        "generate_git_notes": True, # False by default

Don't you want to set this to False since these aren't handled properly yet?

>+GAIA_CONFIG = {}
>+# for branch_name in ('master', 'v1_0_1', 'v1-train', 'v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
>+for branch_name in ('v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
>+    branch_name_with_dots=branch_name.replace('_', '.')
>+    GAIA_CONFIG[branch_name] = {
>+        'locales_file_url': 'https://raw.github.com/mozilla-b2g/gaia/' + branch_name_with_dots + '/locales/languages_all.json',
>+        'hg_url': 'https://hg.mozilla.org/releases/gaia-l10n/' + branch_name + '/%(locale)s',
>+        'git_branch_name': branch_name_with_dots,
>+        'targets': [{
>+            "target_dest": "gitmo-gaia-l10n-%(locale)s",
>+        }, {
>+            'target_dest': 'releases-l10n-%(locale)s-gaia/.git',
>+            'vcs': 'git',
>+            'test_push': True,
>+        }],

The main purpose for a test_push is to make sure the push is valid before pushing over the network.  Let's switch the order of these two.

>+        "generate_git_notes": True, # False by default

Same here.
We could have a GENERATE_GIT_NOTES flag set at the top, so we can turn these all on with a one-line patch once we're ready, but I think if we're having issues with git notes we shouldn't be turning them on for important production repos.  (build/* mirrors in git isn't critical for production afaik)

>+# now fix up nonconformant entries
>+GAIA_CONFIG['master']['hg_url'] = 'https://hg.mozilla.org/gaia-l10n/%(locale)s'
>+# GAIA_CONFIG['v1-train']['hg_url'] = 'https://hg.mozilla.org/releases/gaia-l10n/v1_1/%(locale)s'
>+# GAIA_CONFIG['v1-train']['git_branch_name'] = 'v1.1'
>+GAIA_CONFIG['v1_3']['locales_file_url'] = 'https://raw.github.com/mozilla-b2g/gaia/v1.3/locales/languages_dev.json'
>+# MERGE DAY remove the line below on 2014-07-21 (so it picks up default value from for-loop above)
>+GAIA_CONFIG['v2_0']['hg_url'] = 'https://hg.mozilla.org/gaia-l10n/%(locale)s'
also these:
>+# now fix up nonconformant entries
>+GECKO_CONFIG_TEMPLATE['mozilla-central']['locales_file_url'] = 'https://hg.mozilla.org/mozilla-central/raw-file/default/b2g/locales/all-locales'
>+GECKO_CONFIG_TEMPLATE['mozilla-central']['hg_url'] = 'https://hg.mozilla.org/l10n-central/%(locale)s'

I actually vastly prefer verbose and explicit, because it's very clear what you're editing when it's a critical production config file.  In a for-loop-with-edits, it requires you to read the whole thing and the edits before you understand what's going on in a specific place.  Explicit > implicit.  We don't edit this file much (once every 6-12 weeks), so optimizing writes over reads is questionable at best.  I'd rather optimize for debugging what's going on if something goes wrong, when you have much less time and headspace than when you're adding a new version to the file.  There's a reason I chose to define it the way I did.

>     "find_links": [
>-        "http://pypi.pvt.build.mozilla.org/pub",
>-        "http://pypi.pub.build.mozilla.org/pub",
>+        "http://pypi.pub.build.mozilla.org/pub"
>     ],

Is there a reason you removed the pvt link? Afaik we should use it for internal infra.

>     "upload_config": [{
>-        "ssh_key": "~/.ssh/id_rsa",
>+        "ssh_key": "~/.ssh/vcs-sync_rsa",
>         "ssh_user": "asasaki",
>-        "remote_host": "github-sync2",
>-        "remote_path": "/home/asasaki/upload/l10n",
>+        "remote_host": "people.mozilla.org",
>+        "remote_path": "/home/asasaki/public_html/vcs2vcs/l10n",

I see http://people.mozilla.org/~asasaki/vcs2vcs/l10n/logs/ but it looks empty?

>+    "default_actions": [
>+        'list-repos',
>+        'create-virtualenv',
>+        'update-stage-mirror',
>+        'update-work-mirror',
>+        'create-git-notes',

Same here.

>diff --git a/configs/vcs_sync/staging_l10n.py b/configs/vcs_sync/staging_l10n.py
>+for repo in ('mozilla-release', 'mozilla-beta', 'mozilla-aurora', 'mozilla-central'):
>+    GECKO_CONFIG_TEMPLATE[repo] = {
>+        'locales_file_url': 'https://hg.mozilla.org/releases/%s/raw-file/default/b2g/locales/all-locales' % repo,
>+        'hg_url': 'https://hg.mozilla.org/releases/l10n/mozilla-release/%(locale)s',
>+        'targets': [{
>+            "target_dest": "gitmo-gecko-l10n-%(locale)s",
>+        }, {
>+            'target_dest': 'releases-l10n-%(locale)s-gecko/.git',
>+            'vcs': 'git',
>+            'test_push': True,
>+        }],
>+        'tag_config': {
>+            'tag_regexes': [
>+                '^B2G_',
>+            ],
>+        },
>+        'mapper': {
>+            "url": "https://api-pub-build.allizom.org/mapper",
>+            "project": "gitmo-gecko-l10n",
>+        },
>+        "generate_git_notes": True, # False by default
>+    }

This looks largely the same except for the mapper url.
Do we need to be able to specify different mapper urls for different repos, or is one global mapper url fine?  If the latter, you can just override that config item and not worry about redefining the whole GECKO_CONFIG_TEMPLATE (e.g., |script -c production_config -c staging_config ...|; the staging_config can just contain the needs-overriding options).  However, it's fine to have a duplicate, especially if there's a risk of staging contaminating production if we're not careful.  With staging configs it's always a balancing act between not contaminating production and not falling so out of date with the production configs that you're testing an obsolete configuration.  Minimizing the touch points if you do a straight |diff| is one thing that can help.

>+for branch_name in ('v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
>+    branch_name_with_dots=branch_name.replace('_', '.')
>+    GAIA_CONFIG[branch_name] = {
>+        'locales_file_url': 'https://raw.github.com/mozilla-b2g/gaia/' + branch_name_with_dots + '/locales/languages_all.json',
>+        'hg_url': 'https://hg.mozilla.org/releases/gaia-l10n/' + branch_name + '/%(locale)s',
>+        'git_branch_name': branch_name_with_dots,
>+        'targets': [{
>+            "target_dest": "gitmo-gaia-l10n-%(locale)s",
>+        }, {
>+            'target_dest': 'releases-l10n-%(locale)s-gaia/.git',
>+            'vcs': 'git',
>+            'test_push': True,
>+        }],
>+        'tag_config': {
>+            'tag_regexes': [
>+            '^B2G_',
>+        ]},
>+        'mapper': {
>+            "url": "https://api-pub-build.allizom.org/mapper",
>+            "project": "gitmo-gaia-l10n",
>+        },
>+        "generate_git_notes": True, # False by default
>+    }

Same here.

>diff --git a/mozharness/base/vcs/vcssync.py b/mozharness/base/vcs/vcssync.py
>index 2f19f68..98e0ef7 100644
>--- a/mozharness/base/vcs/vcssync.py
>+++ b/mozharness/base/vcs/vcssync.py
>@@ -35,53 +35,53 @@ class VCSSyncScript(VCSScript):
>         if fatal:
>             subject = "[vcs2vcs] Failed conversion for %s" % job_name
>             text = ''
>-            if len(message) > 10240:
>-                text += '*** Message below has been truncated: it was %s characters, and has been reduced to 10240 characters:\n\n' % len(message)
>-            text += message[0:10240] + '\n\n' # limit message to 10KB in size (large emails fail to send)
>+            if len(message) > 102400:
>+                text += '*** Message below has been truncated: it was %s characters, and has been reduced to 102400 characters:\n\n' % len(message)
>+            text += message[0:102400] + '\n\n' # limit message to 100KB in size (large emails fail to send)

It might be good to set this number in a config option, default to 102400 and allow for configuration in individual config files.

>diff --git a/scripts/vcs-sync/vcs_sync.py b/scripts/vcs-sync/vcs_sync.py
>+    def _process_locale(self, locale, type, config, l10n_remote_targets, name, l10n_repos):
>+        """ This contains the common processing that we do on both gecko_config
>+            and gaia_config for a given locale.
>+            """
>+        replace_dict = {'locale': locale}
>+        new_targets = deepcopy(config.get('targets', {}))
>+        for target in new_targets:
>+            dest = target['target_dest']
>+            if dest.find('%(locale)s') >= 0:

|if '%(locale)s' in dest:| ?

>+                new_dest = dest % replace_dict
>+                target['target_dest'] = new_dest
>+                remote_target = l10n_remote_targets.get(new_dest)
>+                if remote_target is None:  # generate target if not seen before
>+                    possible_remote_target = l10n_remote_targets.get(dest)
>+                    if possible_remote_target is not None:  # might be local target

Some of this makes me a little uneasy... the comments make it seem like guesswork.  The fact that we can use --list-repos to show what the end result is definitely helps.

>+                        remote_target = deepcopy(possible_remote_target)
>+                        remote_repo = remote_target.get('repo')
>+                        if remote_repo.find('%(locale)s') >= 0:

if '%(locale)s' in remote_repo:

>@@ -331,31 +319,35 @@ intree=1
>                 args=(hg + ['clone', '--noupdate', repo_config['repo'],
>                       source_dest], ),
>                 kwargs={
>                     'output_timeout': 15 * 60,
>                     'cwd': dirs['abs_work_dir'],
>                     'error_list': HgErrorList,
>                 },
>             ):
>                 if retry:
>                     return self._update_stage_repo(
>                         repo_config, retry=False, clobber=True)
>                 else:
>                     # Don't leave a failed clone behind
>                     self.rmtree(source_dest)
>                     self._update_repo_previous_status(repo_name, successful_flag=False, write_update=True)
>-                    self.fatal("Can't clone %s!" % repo_config['repo'])
>+                    self.add_failure(
>+                        repo_name,
>+                        message="Can't clone %s!" % repo_config['repo'],
>+                        level=ERROR,
>+                    )

This is probably a good change, but have you verified that all downstream actions check for failure before continuing?
The query_failure() in update_work_mirror() comes too late now, f/e; it expects the update_stage_mirror() call to exit successfully.  Every action that requires a cloned stage mirror (all of them) should now check for failure before continuing.

>-                self.fatal("Can't pull %s!" % repo_config['repo'])
>+                self.add_failure(
>+                    repo_name,
>+                    message="Can't pull %s!" % repo_config['repo'],
>+                    level=ERROR,
>+                )

Here too.

>@@ -441,31 +437,33 @@ intree=1
>             else:
>                 target_name = target_config['target_dest']
>-                remote_config = self.config.get('remote_targets', {}).get(target_name, target_config)
>+                if not self.remote_targets:
>+                    self.remote_targets = self.config.get('remote_targets', {})

I think you need to initialize self.remote_targets = None at the beginning (either above __init__ or inside __init__), and change this to |if self.remote_targets is None:|.  |if not self.remote_targets:| will resolve to True for self.remote_targets == {}, so you'll continue to assign self.remote_targets = {} over and over.

>+                    if tag_list is not None:
>+                        for tag_line in tag_list.splitlines():
>+                            if not tag_line:
>                                 continue
>+                            tag_parts = tag_line.split()
>+                            if not tag_parts:
>+                                self.warning("Bogus tag_line? %s" % str(tag_line))
>+                                continue

Hm, there's a possibility we'll lose this warning.  How common do you think it will be?  If we self.error() it'll show up in the email, at least.  self.error() ?

>                 self.mkdir_p(os.path.dirname(dest))
>                 self.run_command(hg + ['clone', '--noupdate', source, dest],
>                                  error_list=HgErrorList,
>-                                 halt_on_failure=True)
>-                self.write_hggit_hgrc(dest)
>-                self.init_git_repo('%s/.git' % dest, additional_args=['--bare'])
>-                self.run_command(
>+                                 halt_on_failure=False)
>+                if os.path.exists(dest):
>+                    self.write_hggit_hgrc(dest)
>+                    self.init_git_repo('%s/.git' % dest, additional_args=['--bare'])
>+                    self.run_command(
>                     git + ['--git-dir', '%s/.git' % dest, 'config', 'gc.auto', '0'],
>-                )
>+                    )
>+                else:
>+                    self.add_failure(
>+                        repo_name,
>+                        message="Failed to clone %s!" % source,
>+                        level=ERROR,
>+                    )
>+                    continue

I very much like fatal()ing here because this is something I want to have dealt with immediately ("Failed conversion" email) and it doesn't happen very often.  Are there significant benefits to changing this behavior?
Attachment #8451180 - Flags: review?(aki) → review-
Comment on attachment 8451182 [details] [diff] [review]
build-repos patch

Making the staging and non-staging configs easily diffable ++
Attachment #8451182 - Flags: review?(aki) → review+
Thanks Aki. I'll go through all your review comments very carefully.

Please note I'm on build duty this week, but will try to slip it in between problems (although as you heard in our meeting today - we have real problems with pending queues at the moment, so build duty is fairly busy).

Another question I wanted to call out explicitly, can you confirm that I don't need to worry about gaia branches 'v1_0_1' and 'v1-train' for b2g l10n?

Thanks,
Pete
Flags: needinfo?(aki)
Status: NEW → ASSIGNED
(In reply to Pete Moore [:pete][:pmoore] from comment #22)
> Another question I wanted to call out explicitly, can you confirm that I
> don't need to worry about gaia branches 'v1_0_1' and 'v1-train' for b2g l10n?

I think for the gaia repos, we'll keep the existing gaia l10n repos on git.m.o and push the new vcs-sync conversions on top, so we'll have existing branches for pre-v1.2 that we can ignore.  So for your question, you don't need to worry about those gaia branches.

For gecko, I think we've borked the merge day stuff enough that we need to actually remove the git.m.o gecko l10n repos and push new.


I also missed this piece: when you land, we need to get these sections of the merge day docs updated (if we end up changing the config file format):

https://wiki.mozilla.org/ReleaseEngineering/Merge_Duty/Steps#Add_new_Gaia_and_l10n_vcs-sync
https://wiki.mozilla.org/ReleaseEngineering/Merge_Duty/Steps#Update_l10n_vcs-sync
Flags: needinfo?(aki)
(In reply to Aki Sasaki [:aki] from comment #23)
> For gecko, I think we've borked the merge day stuff enough that we need to
> actually remove the git.m.o gecko l10n repos and push new.

I'm not sure this is worth the effort until the L10N story solidifies for gecko. If the current situation was significant, we would have had a bug filed by now.
It's going to create different branch names on the repos.  I think we should do it.
Having said that, we can try overlaying and see if we actually hit non-ffwd issues.  Then the issue would be stub v1.3/v1.2 branches on a few repos that may or may not be lying as to what they are.  The real issue is if the currently existing master/mozilla-beta branches aren't populated from the right place, which will probably incur non-ffwd pushes when we cut over.
(In reply to Aki Sasaki [:aki] from comment #20)
> Comment on attachment 8451180 [details] [diff] [review]
> l10n patch
> 
> r-ing because I'd like to see responses (at the very least) to the concerns
> below, if not code changes.
> 
> >diff --git a/configs/vcs_sync/l10n.py b/configs/vcs_sync/l10n.py
> <snip>
> >+        "generate_git_notes": True, # False by default
> 
> Don't you want to set this to False since these aren't handled properly yet?
> 

Yes - I discovered bug 1034725 (20 minutes) after submitting this review request, so didn't realise at the time. Will change it to False. Nice catch.

> >+GAIA_CONFIG = {}
> >+# for branch_name in ('master', 'v1_0_1', 'v1-train', 'v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
> >+for branch_name in ('v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
> >+    branch_name_with_dots=branch_name.replace('_', '.')
> >+    GAIA_CONFIG[branch_name] = {
> >+        'locales_file_url': 'https://raw.github.com/mozilla-b2g/gaia/' + branch_name_with_dots + '/locales/languages_all.json',
> >+        'hg_url': 'https://hg.mozilla.org/releases/gaia-l10n/' + branch_name + '/%(locale)s',
> >+        'git_branch_name': branch_name_with_dots,
> >+        'targets': [{
> >+            "target_dest": "gitmo-gaia-l10n-%(locale)s",
> >+        }, {
> >+            'target_dest': 'releases-l10n-%(locale)s-gaia/.git',
> >+            'vcs': 'git',
> >+            'test_push': True,
> >+        }],
> 
> The main purpose for a test_push is to make sure the push is valid before
> pushing over the network.  Let's switch the order of these two.
> 
> >+        "generate_git_notes": True, # False by default
> 
> Same here.

Nice spot! Will do.

> We could have a GENERATE_GIT_NOTES flag set at the top, so we can turn these
> all on with a one-line patch once we're ready, but I think if we're having
> issues with git notes we shouldn't be turning them on for important
> production repos.  (build/* mirrors in git isn't critical for production
> afaik)

It works for repos where each target repo has only one source repo - so for build/* repos, that is fine. It is only when multiple source repos get merged into the same target repo, so I think having individual flags per config should allow us to handle it on a case-by-case basis, still leaving the build repos git notes working (if that is ok with you). Otherwise I can introduce a top-level flag. 

> 
> >+# now fix up nonconformant entries
> >+GAIA_CONFIG['master']['hg_url'] = 'https://hg.mozilla.org/gaia-l10n/%(locale)s'
> >+# GAIA_CONFIG['v1-train']['hg_url'] = 'https://hg.mozilla.org/releases/gaia-l10n/v1_1/%(locale)s'
> >+# GAIA_CONFIG['v1-train']['git_branch_name'] = 'v1.1'
> >+GAIA_CONFIG['v1_3']['locales_file_url'] = 'https://raw.github.com/mozilla-b2g/gaia/v1.3/locales/languages_dev.json'
> >+# MERGE DAY remove the line below on 2014-07-21 (so it picks up default value from for-loop above)
> >+GAIA_CONFIG['v2_0']['hg_url'] = 'https://hg.mozilla.org/gaia-l10n/%(locale)s'
> also these:
> >+# now fix up nonconformant entries
> >+GECKO_CONFIG_TEMPLATE['mozilla-central']['locales_file_url'] = 'https://hg.mozilla.org/mozilla-central/raw-file/default/b2g/locales/all-locales'
> >+GECKO_CONFIG_TEMPLATE['mozilla-central']['hg_url'] = 'https://hg.mozilla.org/l10n-central/%(locale)s'
> 
> I actually vastly prefer verbose and explicit, because it's very clear what
> you're editing when it's a critical production config file.  In a
> for-loop-with-edits, it requires you to read the whole thing and the edits
> before you understand what's going on in a specific place.  Explicit >
> implicit.  We don't edit this file much (once every 6-12 weeks), so
> optimizing writes over reads is questionable at best.  I'd rather optimize
> for debugging what's going on if something goes wrong, when you have much
> less time and headspace than when you're adding a new version to the file. 
> There's a reason I chose to define it the way I did.

ok, i'll expland it again.

> >     "find_links": [
> >-        "http://pypi.pvt.build.mozilla.org/pub",
> >-        "http://pypi.pub.build.mozilla.org/pub",
> >+        "http://pypi.pub.build.mozilla.org/pub"
> >     ],
> 
> Is there a reason you removed the pvt link? Afaik we should use it for
> internal infra.
> 

From Dustin (:dustin):

17:00:48 pmoore: they're the same vhost, just different IPs
17:00:55 pmoore: since everything else uses both, you should probably use both as well
17:01:03 but at this point, the reason for using both is kind of moot

So I'll add back in.

> >     "upload_config": [{
> >-        "ssh_key": "~/.ssh/id_rsa",
> >+        "ssh_key": "~/.ssh/vcs-sync_rsa",
> >         "ssh_user": "asasaki",
> >-        "remote_host": "github-sync2",
> >-        "remote_path": "/home/asasaki/upload/l10n",
> >+        "remote_host": "people.mozilla.org",
> >+        "remote_path": "/home/asasaki/public_html/vcs2vcs/l10n",
> 
> I see http://people.mozilla.org/~asasaki/vcs2vcs/l10n/logs/ but it looks
> empty?
> 

Yes, I just created it in preparation - no production runs yet, so it is empty.

> >+    "default_actions": [
> >+        'list-repos',
> >+        'create-virtualenv',
> >+        'update-stage-mirror',
> >+        'update-work-mirror',
> >+        'create-git-notes',
> 
> Same here.

Will remove create-git-notes.

> 
> >diff --git a/configs/vcs_sync/staging_l10n.py b/configs/vcs_sync/staging_l10n.py
> >+for repo in ('mozilla-release', 'mozilla-beta', 'mozilla-aurora', 'mozilla-central'):
> >+    GECKO_CONFIG_TEMPLATE[repo] = {
> >+        'locales_file_url': 'https://hg.mozilla.org/releases/%s/raw-file/default/b2g/locales/all-locales' % repo,
> >+        'hg_url': 'https://hg.mozilla.org/releases/l10n/mozilla-release/%(locale)s',
> >+        'targets': [{
> >+            "target_dest": "gitmo-gecko-l10n-%(locale)s",
> >+        }, {
> >+            'target_dest': 'releases-l10n-%(locale)s-gecko/.git',
> >+            'vcs': 'git',
> >+            'test_push': True,
> >+        }],
> >+        'tag_config': {
> >+            'tag_regexes': [
> >+                '^B2G_',
> >+            ],
> >+        },
> >+        'mapper': {
> >+            "url": "https://api-pub-build.allizom.org/mapper",
> >+            "project": "gitmo-gecko-l10n",
> >+        },
> >+        "generate_git_notes": True, # False by default
> >+    }
> 
> This looks largely the same except for the mapper url.
> Do we need to be able to specify different mapper urls for different repos,
> or is one global mapper url fine?  If the latter, you can just override that
> config item and not worry about redefining the whole GECKO_CONFIG_TEMPLATE
> (e.g., |script -c production_config -c staging_config ...|; the
> staging_config can just contain the needs-overriding options).  However,
> it's fine to have a duplicate, especially if there's a risk of staging
> contaminating production if we're not careful.  With staging configs it's
> always a balancing act between not contaminating production and not falling
> so out of date with the production configs that you're testing an obsolete
> configuration.  Minimizing the touch points if you do a straight |diff| is
> one thing that can help.

Yes, I agree that there could be a risk of pollution, or forgetting to override new options as they get added, if we inherit production config and only overwrite parts. Maybe if production inherited from staging it would be safer, but then again, we could end up updating staging config and forgetting production pointed to it - so both ways are potentially dangerous. If you're ok with it then, I think I'll leave them as separate files, but try to keep them ordered and indented identically, so that a diff shows a genuine diff, and not any other noise.

> 
> >+for branch_name in ('v1_2', 'v1_3', 'v1_4', 'v2_0', 'master'):
> >+    branch_name_with_dots=branch_name.replace('_', '.')
> >+    GAIA_CONFIG[branch_name] = {
> >+        'locales_file_url': 'https://raw.github.com/mozilla-b2g/gaia/' + branch_name_with_dots + '/locales/languages_all.json',
> >+        'hg_url': 'https://hg.mozilla.org/releases/gaia-l10n/' + branch_name + '/%(locale)s',
> >+        'git_branch_name': branch_name_with_dots,
> >+        'targets': [{
> >+            "target_dest": "gitmo-gaia-l10n-%(locale)s",
> >+        }, {
> >+            'target_dest': 'releases-l10n-%(locale)s-gaia/.git',
> >+            'vcs': 'git',
> >+            'test_push': True,
> >+        }],
> >+        'tag_config': {
> >+            'tag_regexes': [
> >+            '^B2G_',
> >+        ]},
> >+        'mapper': {
> >+            "url": "https://api-pub-build.allizom.org/mapper",
> >+            "project": "gitmo-gaia-l10n",
> >+        },
> >+        "generate_git_notes": True, # False by default
> >+    }
> 
> Same here.

OK will turn of generate_git_notes there too, and leave mapper in there.

> 
> >diff --git a/mozharness/base/vcs/vcssync.py b/mozharness/base/vcs/vcssync.py
> >index 2f19f68..98e0ef7 100644
> >--- a/mozharness/base/vcs/vcssync.py
> >+++ b/mozharness/base/vcs/vcssync.py
> >@@ -35,53 +35,53 @@ class VCSSyncScript(VCSScript):
> >         if fatal:
> >             subject = "[vcs2vcs] Failed conversion for %s" % job_name
> >             text = ''
> >-            if len(message) > 10240:
> >-                text += '*** Message below has been truncated: it was %s characters, and has been reduced to 10240 characters:\n\n' % len(message)
> >-            text += message[0:10240] + '\n\n' # limit message to 10KB in size (large emails fail to send)
> >+            if len(message) > 102400:
> >+                text += '*** Message below has been truncated: it was %s characters, and has been reduced to 102400 characters:\n\n' % len(message)
> >+            text += message[0:102400] + '\n\n' # limit message to 100KB in size (large emails fail to send)
> 
> It might be good to set this number in a config option, default to 102400
> and allow for configuration in individual config files.

OK, I'll add this.

> 
> >diff --git a/scripts/vcs-sync/vcs_sync.py b/scripts/vcs-sync/vcs_sync.py
> >+    def _process_locale(self, locale, type, config, l10n_remote_targets, name, l10n_repos):
> >+        """ This contains the common processing that we do on both gecko_config
> >+            and gaia_config for a given locale.
> >+            """
> >+        replace_dict = {'locale': locale}
> >+        new_targets = deepcopy(config.get('targets', {}))
> >+        for target in new_targets:
> >+            dest = target['target_dest']
> >+            if dest.find('%(locale)s') >= 0:
> 
> |if '%(locale)s' in dest:| ?

ok

> 
> >+                new_dest = dest % replace_dict
> >+                target['target_dest'] = new_dest
> >+                remote_target = l10n_remote_targets.get(new_dest)
> >+                if remote_target is None:  # generate target if not seen before
> >+                    possible_remote_target = l10n_remote_targets.get(dest)
> >+                    if possible_remote_target is not None:  # might be local target
> 
> Some of this makes me a little uneasy... the comments make it seem like
> guesswork.  The fact that we can use --list-repos to show what the end
> result is definitely helps.

So we build the actual targets from the template as we go, e.g. in the config we get passed we have:

"target_dest": "gitmo-gaia-l10n-%(locale)s" (local test repo)
'target_dest': 'releases-l10n-%(locale)s-gaia/.git' (remote actual repo)

So a target_dest is either the name of a remote_target (which has its own config) or is the name of a local directory (this is not something I've added, but was already like this). We test if it is local or remote, based on this. In either case, we have to switch out the template name (including '%(locale)s') with the actual name including the locale - and we generate this part of the config on-the-fly - i.e. substitute in the %(locale)s with the real value, and if it is a name we haven't seen before, we generate it, and add it to the l10n config. I'll add some comments to explain this. Basically, we can't update the self.config on-the-fly since it is read only (understandably) so we have to build our own l10n config on the fly, as we "discover" locales that need to be processed. This is why we switch out the placeholder %(locale)s in all the places it exists, with the actual locale, and add it if we haven't already got it (i.e. something like a "lazy generator").

> 
> >+                        remote_target = deepcopy(possible_remote_target)
> >+                        remote_repo = remote_target.get('repo')
> >+                        if remote_repo.find('%(locale)s') >= 0:
> 
> if '%(locale)s' in remote_repo:

ok

> 
> >@@ -331,31 +319,35 @@ intree=1
> >                 args=(hg + ['clone', '--noupdate', repo_config['repo'],
> >                       source_dest], ),
> >                 kwargs={
> >                     'output_timeout': 15 * 60,
> >                     'cwd': dirs['abs_work_dir'],
> >                     'error_list': HgErrorList,
> >                 },
> >             ):
> >                 if retry:
> >                     return self._update_stage_repo(
> >                         repo_config, retry=False, clobber=True)
> >                 else:
> >                     # Don't leave a failed clone behind
> >                     self.rmtree(source_dest)
> >                     self._update_repo_previous_status(repo_name, successful_flag=False, write_update=True)
> >-                    self.fatal("Can't clone %s!" % repo_config['repo'])
> >+                    self.add_failure(
> >+                        repo_name,
> >+                        message="Can't clone %s!" % repo_config['repo'],
> >+                        level=ERROR,
> >+                    )
> 
> This is probably a good change, but have you verified that all downstream
> actions check for failure before continuing?
> The query_failure() in update_work_mirror() comes too late now, f/e; it
> expects the update_stage_mirror() call to exit successfully.  Every action
> that requires a cloned stage mirror (all of them) should now check for
> failure before continuing.
> 

I will check this.

> >-                self.fatal("Can't pull %s!" % repo_config['repo'])
> >+                self.add_failure(
> >+                    repo_name,
> >+                    message="Can't pull %s!" % repo_config['repo'],
> >+                    level=ERROR,
> >+                )
> 
> Here too.

ok - will do the same.

> 
> >@@ -441,31 +437,33 @@ intree=1
> >             else:
> >                 target_name = target_config['target_dest']
> >-                remote_config = self.config.get('remote_targets', {}).get(target_name, target_config)
> >+                if not self.remote_targets:
> >+                    self.remote_targets = self.config.get('remote_targets', {})
> 
> I think you need to initialize self.remote_targets = None at the beginning
> (either above __init__ or inside __init__), and change this to |if
> self.remote_targets is None:|.  |if not self.remote_targets:| will resolve
> to True for self.remote_targets == {}, so you'll continue to assign
> self.remote_targets = {} over and over.
> 
> >+                    if tag_list is not None:
> >+                        for tag_line in tag_list.splitlines():
> >+                            if not tag_line:
> >                                 continue
> >+                            tag_parts = tag_line.split()
> >+                            if not tag_parts:
> >+                                self.warning("Bogus tag_line? %s" % str(tag_line))
> >+                                continue
> 
> Hm, there's a possibility we'll lose this warning.  How common do you think
> it will be?  If we self.error() it'll show up in the email, at least. 
> self.error() ?

ok i'll change from self.warning() to self.error(). Not sure how common it will be but can let you know after testing.

> 
> >                 self.mkdir_p(os.path.dirname(dest))
> >                 self.run_command(hg + ['clone', '--noupdate', source, dest],
> >                                  error_list=HgErrorList,
> >-                                 halt_on_failure=True)
> >-                self.write_hggit_hgrc(dest)
> >-                self.init_git_repo('%s/.git' % dest, additional_args=['--bare'])
> >-                self.run_command(
> >+                                 halt_on_failure=False)
> >+                if os.path.exists(dest):
> >+                    self.write_hggit_hgrc(dest)
> >+                    self.init_git_repo('%s/.git' % dest, additional_args=['--bare'])
> >+                    self.run_command(
> >                     git + ['--git-dir', '%s/.git' % dest, 'config', 'gc.auto', '0'],
> >-                )
> >+                    )
> >+                else:
> >+                    self.add_failure(
> >+                        repo_name,
> >+                        message="Failed to clone %s!" % source,
> >+                        level=ERROR,
> >+                    )
> >+                    continue
> 
> I very much like fatal()ing here because this is something I want to have
> dealt with immediately ("Failed conversion" email) and it doesn't happen
> very often.  Are there significant benefits to changing this behavior?

I think I did this because sometimes there is an issue with a single locale which causes the whole process to stop, so other locales don't get mirrored. For example a locale is listed that doesn't really exist. We should still get an email to warn of the problem, and I can change this maybe to be more obvious that it is not just an intermittent problem that can be ignored, but a serious flaw that needs to be addressed. I have a separate bug to refine the email notifications where I could work on this. I think the important thing is that if one locale fails the others still should run, and I can address the "visibility" of the failure in the email management. What do you think?

Thanks,
Pete
Just saw I missed this one:

> I think you need to initialize self.remote_targets = None at the beginning
> (either above __init__ or inside __init__), and change this to |if
> self.remote_targets is None:|.  |if not self.remote_targets:| will resolve
> to True for self.remote_targets == {}, so you'll continue to assign
> self.remote_targets = {} over and over.

Thanks Aki! You highlighted a novice python misunderstanding I had - I had wrongly assumed that if self.remote_targets had not been defined, I could use if not self.remote_targets - but I see now after testing this actually causes a runtime error - I will fix up. I hadn't picked it up in testing because I hadn't tested e.g. build/* repos again with all the code changes I made in this bug. So an additional thing I will take from this, is to test both staging_build-repos.py config and staging_l10n.py configs before submitting my next patch review - I believe I would have picked up this problem if I had additionally retested the build/* repos since it would not have had self.remote_targets already defined.
(In reply to Pete Moore [:pete][:pmoore] from comment #28)
> Just saw I missed this one:
> 
> > I think you need to initialize self.remote_targets = None at the beginning
> > (either above __init__ or inside __init__), and change this to |if
> > self.remote_targets is None:|.  |if not self.remote_targets:| will resolve
> > to True for self.remote_targets == {}, so you'll continue to assign
> > self.remote_targets = {} over and over.
> 
> Thanks Aki! You highlighted a novice python misunderstanding I had - I had
> wrongly assumed that if self.remote_targets had not been defined, I could
> use if not self.remote_targets - but I see now after testing this actually
> causes a runtime error - I will fix up.

Yeah, if you want to reference self.VARIABLE without knowing if it exists you probably want hasattr
https://duckduckgo.com/?q=python+hasattr
http://stackoverflow.com/questions/903130/hasattr-vs-try-except-block-to-deal-with-non-existent-attributes

> I hadn't picked it up in testing
> because I hadn't tested e.g. build/* repos again with all the code changes I
> made in this bug. So an additional thing I will take from this, is to test
> both staging_build-repos.py config and staging_l10n.py configs before
> submitting my next patch review - I believe I would have picked up this
> problem if I had additionally retested the build/* repos since it would not
> have had self.remote_targets already defined.

Hopefully these are quick tests! Not trying to make you do a lot of work over.
(In reply to Pete Moore [:pete][:pmoore] from comment #27)
> > We could have a GENERATE_GIT_NOTES flag set at the top, so we can turn these
> > all on with a one-line patch once we're ready, but I think if we're having
> > issues with git notes we shouldn't be turning them on for important
> > production repos.  (build/* mirrors in git isn't critical for production
> > afaik)
>
> It works for repos where each target repo has only one source repo - so for
> build/* repos, that is fine. It is only when multiple source repos get
> merged into the same target repo, so I think having individual flags per
> config should allow us to handle it on a case-by-case basis, still leaving
> the build repos git notes working (if that is ok with you). Otherwise I can
> introduce a top-level flag.

Sure.  If you foresee needing to enable/disable per repo, it makes sense to specify each one.

> > I see http://people.mozilla.org/~asasaki/vcs2vcs/l10n/logs/ but it looks
> > empty?
> >
>
> Yes, I just created it in preparation - no production runs yet, so it is
> empty.

Cool.  I had some thoughts about this, maybe we can talk a little about it soon.

> > This looks largely the same except for the mapper url.
> > Do we need to be able to specify different mapper urls for different repos,
> > or is one global mapper url fine?  If the latter, you can just override that
> > config item and not worry about redefining the whole GECKO_CONFIG_TEMPLATE
> > (e.g., |script -c production_config -c staging_config ...|; the
> > staging_config can just contain the needs-overriding options).  However,
> > it's fine to have a duplicate, especially if there's a risk of staging
> > contaminating production if we're not careful.  With staging configs it's
> > always a balancing act between not contaminating production and not falling
> > so out of date with the production configs that you're testing an obsolete
> > configuration.  Minimizing the touch points if you do a straight |diff| is
> > one thing that can help.
>
> Yes, I agree that there could be a risk of pollution, or forgetting to
> override new options as they get added, if we inherit production config and
> only overwrite parts. Maybe if production inherited from staging it would be
> safer, but then again, we could end up updating staging config and
> forgetting production pointed to it - so both ways are potentially
> dangerous. If you're ok with it then, I think I'll leave them as separate
> files, but try to keep them ordered and indented identically, so that a diff
> shows a genuine diff, and not any other noise.

I think this is a good option.

> > It might be good to set this number in a config option, default to 102400
> > and allow for configuration in individual config files.
>
> OK, I'll add this.

Scope creep: if we have that 102400 (or whatever number) in configs, it might be best to have the entire message size reduced to that number.
If that's hard to do because of headers or whatever, ignore me.  If it's not, from a user perspective that seems more intuitive.

> > >+                new_dest = dest % replace_dict
> > >+                target['target_dest'] = new_dest
> > >+                remote_target = l10n_remote_targets.get(new_dest)
> > >+                if remote_target is None:  # generate target if not seen before
> > >+                    possible_remote_target = l10n_remote_targets.get(dest)
> > >+                    if possible_remote_target is not None:  # might be local target
> >
> > Some of this makes me a little uneasy... the comments make it seem like
> > guesswork.  The fact that we can use --list-repos to show what the end
> > result is definitely helps.
>
> So we build the actual targets from the template as we go, e.g. in the
> config we get passed we have:
>
> "target_dest": "gitmo-gaia-l10n-%(locale)s" (local test repo)
> 'target_dest': 'releases-l10n-%(locale)s-gaia/.git' (remote actual repo)
>
> So a target_dest is either the name of a remote_target (which has its own
> config) or is the name of a local directory (this is not something I've
> added, but was already like this). We test if it is local or remote, based
> on this. In either case, we have to switch out the template name (including
> '%(locale)s') with the actual name including the locale - and we generate
> this part of the config on-the-fly - i.e. substitute in the %(locale)s with
> the real value, and if it is a name we haven't seen before, we generate it,
> and add it to the l10n config. I'll add some comments to explain this.
> Basically, we can't update the self.config on-the-fly since it is read only
> (understandably) so we have to build our own l10n config on the fly, as we
> "discover" locales that need to be processed. This is why we switch out the
> placeholder %(locale)s in all the places it exists, with the actual locale,
> and add it if we haven't already got it (i.e. something like a "lazy
> generator").

Right.  I meant the "might be" in the comment made me uneasy, since it sounded like you were completely guessing.
Sounds like you know what you're doing, though.

> > >+                    self.add_failure(
> > >+                        repo_name,
> > >+                        message="Can't clone %s!" % repo_config['repo'],
> > >+                        level=ERROR,
> > >+                    )
> >
> > This is probably a good change, but have you verified that all downstream
> > actions check for failure before continuing?
> > The query_failure() in update_work_mirror() comes too late now, f/e; it
> > expects the update_stage_mirror() call to exit successfully.  Every action
> > that requires a cloned stage mirror (all of them) should now check for
> > failure before continuing.
> >
>
> I will check this.

Thanks.  Sorry for the extra work.

> > I very much like fatal()ing here because this is something I want to have
> > dealt with immediately ("Failed conversion" email) and it doesn't happen
> > very often.  Are there significant benefits to changing this behavior?
>
> I think I did this because sometimes there is an issue with a single locale
> which causes the whole process to stop, so other locales don't get mirrored.
> For example a locale is listed that doesn't really exist. We should still
> get an email to warn of the problem, and I can change this maybe to be more
> obvious that it is not just an intermittent problem that can be ignored, but
> a serious flaw that needs to be addressed. I have a separate bug to refine
> the email notifications where I could work on this. I think the important
> thing is that if one locale fails the others still should run, and I can
> address the "visibility" of the failure in the email management. What do you
> think?

I think my main concern is visibility of failures, and if you're both

a) not continuing any steps for that failed repo/locale, and
b) making sure that failure is visibly non-intermittent and needing prompt attention,

then I agree that this is a change for the better.

Thanks Pete!
Hope to have this out in the next couple of hours...
First set of changes is done - everything except checking the self.add_failure() vs self.fatal() execution paths - to make sure that downstream actions can handle earlier actions failing.

I'm pretty sure these are working, because I introduced them when there were problems with individual locales missing on certain branches - but I still want to double check, of course, and also make sure the email messages give us clear information about the problems.

Changes so far:
https://github.com/petemoore/build-mozharness/commit/997ce23f6617b376e5f88389d1686494dd8d800b
You might be able to test by giving it an arbitrary list of locales, at least one of which is a bogus locale name.  The shorter the list, the faster the test.
This patch is a complete patch of all changes. I'll attach an interdiff too...

Thanks,
Pete
Attachment #8451180 - Attachment is obsolete: true
Attachment #8451182 - Attachment is obsolete: true
Attachment #8459896 - Flags: review?(aki)
Some words of explanation:

I created self.query_all_non_failed_repos() to return the same as self.query_all_repos() but filtering out the ones that have failures recorded against them. Then in all actions that loop through the repos (except --list-repos) I now only loop through the non-failing ones. This is an extra safety measure, so that if a repo fails on an earlier action, it is not processed in a later action.

I also changed the logging in some places to be more explicit - for example if a git push fails, I log what the url is that we are pushing to, not just the repo name. There are a couple of tweaks like this.

I'm also landing a patch for bug 939817 to tweak email notifications so that any unexpected problems or failures of any kind result in a failure email.

Thanks,
Pete
Here I've attached an email with failures from my dev test environment (which includes the latest patch in this bug, plus the patch for bug 939817).

In my dev environment, I have *intentionally* not created a repo here:
https://github.com/petermoore/l10n-en-GB-gaia to test what happens when a target repo is missing.

The four other problems reported in the email are genuine: *at the time of writing* the following two repos do not have a default revision (because there have been no pushes to them - they are completely empty):

  * https://hg.mozilla.org/gaia-l10n/af
  * https://hg.mozilla.org/gaia-l10n/zu

These are both referenced in both 2.0 and gaia master, which results in the 4 additional failures described in the email.
(In reply to Pete Moore [:pete][:pmoore] from comment #36)
> Created attachment 8459909 [details]
> Example failure email (from my dev testing environment)
> 
> Here I've attached an email with failures from my dev test environment
> (which includes the latest patch in this bug, plus the patch for bug 939817).
> 
> In my dev environment, I have *intentionally* not created a repo here:
> https://github.com/petermoore/l10n-en-GB-gaia to test what happens when a
> target repo is missing.
> 
> The four other problems reported in the email are genuine: *at the time of
> writing* the following two repos do not have a default revision (because
> there have been no pushes to them - they are completely empty):
> 
>   * https://hg.mozilla.org/gaia-l10n/af
>   * https://hg.mozilla.org/gaia-l10n/zu
> 
> These are both referenced in both 2.0 and gaia master, which results in the
> 4 additional failures described in the email.

Yeah.  I had to add IGNORE_LOCALES to my braindump script for this: http://hg.mozilla.org/build/braindump/rev/7908431e686d

I think if it continues, we may have to remove the locales from languages_all and let people know.
Comment on attachment 8459896 [details] [diff] [review]
bug962863_mozharness_v2.patch

Thanks Pete!  I didn't intend for you to pull an insane all nighter or anything, but I'm not going to waste it.

* This patch doesn't apply to tip-of-mozharness, but it worked if i

hg up -r 042ee7510b2f
hg import --no-commit PATCH
hg up -r default

which means http://hg.mozilla.org/build/mozharness/rev/84b3c50018f7 conflicts but merges ok.

* I like the email.

>diff --git a/configs/vcs_sync/l10n.py b/configs/vcs_sync/l10n.py
<snip>

I think this was the file with conflicts -- please be careful with the merge.

>     "upload_config": [{
>-        "ssh_key": "~/.ssh/id_rsa",
>+        "ssh_key": "~/.ssh/vcs-sync_rsa",
>         "ssh_user": "asasaki",
>-        "remote_host": "github-sync2",
>-        "remote_path": "/home/asasaki/upload/l10n",
>+        "remote_host": "people.mozilla.org",
>+        "remote_path": "/home/asasaki/public_html/vcs2vcs/l10n",
>     }],

Works for now, but may stop working after Friday.

>diff --git a/configs/vcs_sync/staging_build-repos.py b/configs/vcs_sync/staging_build-repos.py

|diff configs/vcs_sync/{staging_,}build-repos.py| and |diff configs/vcs_sync/{staging_,}l10n.py| both look sane; thanks Pete.

>diff --git a/mozharness/base/vcs/vcssync.py b/mozharness/base/vcs/vcssync.py
<snip>
>+        max_log_sample_size = c.get('email_max_log_sample_size', 102400) # default 100Kb

Awesome, ty

>diff --git a/scripts/vcs-sync/vcs_sync.py b/scripts/vcs-sync/vcs_sync.py
>old mode 100644
>new mode 100755

I approve :)

>+        [['--max-log-sample-size', ],
>+        {"action": "store",
>+         "dest": "email_max_log_sample_size",
>+         "type": "int",
>+         "help": "Specify the maximum number of characters from a log file to be "
>+                 "embedded in the email body, per embedding (not total - note we "
>+                 "embed two separate log samples into the email - so maximum "
>+                 "email body size can end up a little over 2x this amount).",
>+         }]

You can specify your default here, via

    "default": 102400,

Pretty much the same thing.  Both work.  Difference of syntax and location only.

>+        self.remote_targets = None

You're learning well!

>+    def _process_locale(self, locale, type, config, l10n_remote_targets, name, l10n_repos):

The interdiff was VERY useful here, thank you.

>     def query_all_non_failed_repos(self):
>+        all_repos = self.query_all_repos()
>+        return [repo for repo in all_repos if repo.get('repo_name') not in self.failures]

That's a mouthful :)

Can you add a docstring here?  Should be simple; the method name is largely self explanatory.
Attachment #8459896 - Flags: review?(aki) → review+
Thanks so much Aki for your quick turnaround on this one!

Final changes being tested in staging. If this run is successful, I will push to hg default. I'm doing a full clean run (blown away previous working dir) - could take about an hour.

Final changes here (new docstring, merge conflict resolved, merged l10n.py changes also applied to staging_l10n.py, and new location for default email_max_log_sample_size setting).

https://github.com/petemoore/build-mozharness/commit/ea8c2dcd2f101b1a356541c5f5794dcfa7a323b8

I'll merge to production branch tomorrow morning, in case there is fallout with e.g. the build/* repos or gecko-dev (beagle) etc. Then I'll keep my eye on it.
(In reply to Pete Moore [:pete][:pmoore] from comment #27)

> > >     "find_links": [
> > >-        "http://pypi.pvt.build.mozilla.org/pub",
> > >-        "http://pypi.pub.build.mozilla.org/pub",
> > >+        "http://pypi.pub.build.mozilla.org/pub"
> > >     ],
> > 
> > Is there a reason you removed the pvt link? Afaik we should use it for
> > internal infra.
> > 
> 
> From Dustin (:dustin):
> 
> 17:00:48 pmoore: they're the same vhost, just different IPs
> 17:00:55 pmoore: since everything else uses both, you should probably use
> both as well
> 17:01:03 but at this point, the reason for using both is kind of moot
> 
> So I'll add back in.

Mystery resolved - I now get:

Error running install of package, /home/pmoore/vcs_sync/build/venv/bin/pip install --download-cache /home/pmoore/vcs_sync/build/venv/cache --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub bottle==0.11.6!


15:51:07     INFO -  Downloading/unpacking bottle==0.11.6
15:51:07     INFO -    HTTP error 403 while getting http://pypi.pvt.build.mozilla.org/pub/bottle-0.11.6.tar.gz (from http://pypi.pvt.build.mozilla.org/pub/)
15:51:07     INFO -    Could not install requirement bottle==0.11.6 because of error HTTP Error 403: Forbidden
15:51:07     INFO -  Could not install requirement bottle==0.11.6 because of HTTP error HTTP Error 403: Forbidden for URL http://pypi.pvt.build.mozilla.org/pub/bottle-0.11.6.tar.gz (from http://pypi.pvt.build.mozilla.org/pub/)
15:51:07     INFO -  Storing complete log in /home/pmoore/.pip/pip.log
15:51:07    ERROR - Return code: 1
15:51:07    FATAL - Error running install of package, /home/pmoore/vcs_sync/build/venv/bin/pip install --download-cache /home/pmoore/vcs_sync/build/venv/cache --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub bottle==0.11.6!
15:51:07    FATAL - Running post_fatal callback...
15:51:07     INFO - Job took 4 seconds.
15:51:07     INFO - Getting output from command: ['egrep', '-C5', '^[0-9:]+ +(ERROR|CRITICAL|FATAL) -', '/home/pmoore/vcs_sync/logs/l10n_info.log']
15:51:07     INFO - Copy/paste: egrep -C5 "^[0-9:]+ +(ERROR|CRITICAL|FATAL) -" /home/pmoore/vcs_sync/logs/l10n_info.log

This must be the reason I removed it, but I had forgotten. Removing it again fixes this. Will patch up. Thankfully I was running a clean staging test with the previous virtualenv blown away, so spotted it in time. =)
(In reply to Hal Wine [:hwine] (use needinfo) from comment #24)
> (In reply to Aki Sasaki [:aki] from comment #23)
> > For gecko, I think we've borked the merge day stuff enough that we need to
> > actually remove the git.m.o gecko l10n repos and push new.
> 
> I'm not sure this is worth the effort until the L10N story solidifies for
> gecko. If the current situation was significant, we would have had a bug
> filed by now.

I'm leaning towards changing my mind on this one, or at least not pushing for it.
The gecko l10n repos have 'master' and sometimes 'mozilla-beta'.  The new l10n configs will introduce v2.0 type branches.  Those branches can likely coexist.  If we blow them away, we won't have any v1.0-v1.3? branches, and we might get questions.  (We might get questions anyway.)  Any discussion about whether the branch names should be mozilla-{central,aurora,beta} or vX.X should probably happen before Pete rolls this out ;-)
(In reply to Pete Moore [:pete][:pmoore] from comment #41)
> Error running install of package, /home/pmoore/vcs_sync/build/venv/bin/pip
> install --download-cache /home/pmoore/vcs_sync/build/venv/cache --no-index
> --find-links http://pypi.pvt.build.mozilla.org/pub --find-links
> http://pypi.pub.build.mozilla.org/pub bottle==0.11.6!
> 
> This must be the reason I removed it, but I had forgotten. Removing it again
> fixes this. Will patch up. Thankfully I was running a clean staging test
> with the previous virtualenv blown away, so spotted it in time. =)

Ah, I think that's gd[0-4] not having flows open to pvt.
If this ever gets moved to AWS, and if having both IPs available adds any sort of redundancy (sounds like maybe not in this configuration) we can add back in.
_generate_git_locale_manifest() queries mapper for l10n. http://hg.mozilla.org/build/mozharness/file/20d9dbaed82e/scripts/b2g_build.py#l466

_generate_locale_manifest() calls _generate_git_locale_manifest(). http://hg.mozilla.org/build/mozharness/file/20d9dbaed82e/scripts/b2g_build.py#l472

I don't think anything calls _generate_locale_manifest:

(mh)deathduck:/src/clean/mozharness [14:52:53] (default)
812$ grep -r generate_locale_manifest .
./scripts/b2g_build.py:    def _generate_locale_manifest(self, git_base_url="https://git.mozilla.org/release/"):
(mh)deathduck:/src/clean/mozharness [14:52:59] (default)
813$

So we need to fix that at some point, but the good thing is that we don't have anything relying on l10n mapper right now afaik.  Your rollout should be smoother.
(In reply to Aki Sasaki [:aki] from comment #44)
> (In reply to Pete Moore [:pete][:pmoore] from comment #41)
> > Error running install of package, /home/pmoore/vcs_sync/build/venv/bin/pip
> > install --download-cache /home/pmoore/vcs_sync/build/venv/cache --no-index
> > --find-links http://pypi.pvt.build.mozilla.org/pub --find-links
> > http://pypi.pub.build.mozilla.org/pub bottle==0.11.6!
> > 
> > This must be the reason I removed it, but I had forgotten. Removing it again
> > fixes this. Will patch up. Thankfully I was running a clean staging test
> > with the previous virtualenv blown away, so spotted it in time. =)
> 
> Ah, I think that's gd[0-4] not having flows open to pvt.
> If this ever gets moved to AWS, and if having both IPs available adds any
> sort of redundancy (sounds like maybe not in this configuration) we can add
> back in.

Dustin, what is your opinion? I think if we get HTTP 403's then it is able to hit http://pypi.pvt.build.mozilla.org/pub but is for some reason getting an authorization problem. Do I need to whitelist the source servers somehow? Or is https needed, or do we need to supply authorization credentials?

Thanks,
Pete
Flags: needinfo?(dustin)
I don't understand the full context here, and I don't see any hosts matching gd[0-4] in inventory, so I'm not sure where you're accessing this from.  But *.pvt.build.mozilla.org are only available from the releng network, so by the fact that you're getting 403's I suspect gd[0-4] are not on the releng network.  If that's the case, then you should not be using the *.pvt.* address.

HTTP 401 is authorization.  HTTP 403 is permission (as in, no amount of credentials will change it).
The full names are here: http://people.mozilla.org/~hwine/tmp/vcs2vcs/trouble_shooting.html?highlight=gd2#accessing-the-machines
e.g. gd4 	github-sync4.dmz.scl3.mozilla.com

I think the issue is the dmz can't hit pvt.
Agreed - DMZ's not in the releng network.
Thanks Dustin, Aki for the clarification.
I've written a test for comparing l10n tags and heads in staging vs production to sanity check the repos.

The report is here:
https://github.com/petemoore/myscrapbook/blob/master/compare_old_vs_new_l10n_vcs_sync.sh

I'm currently running a staging l10n vcs sync, and then will immediately run the report, to have prod and staging as similar as possible.

I'll update this bug with the output when it has run, which should show us any differences between the two.

That said, I would expect to see some differences due to e.g. bug 967282. Let's see what we get! :)
Flags: needinfo?(dustin)
On the whole it looks pretty good. In production we have a lot of refs/heads/default which we don't have in new vcs sync (because we only map default -> master, and don't have both default and master branches).

Once you pull those lines out:

cat staging_vs_prod_l10n_git.log | grep -v 'refs/heads/default'

then you don't have a lot of differences.

Other differences are where we are not pushing old branches, i.e. 1.2 is the oldest release we push for gaia. So anything before that would not get overwritten with the cutover, and those old branches would stay in place.

I do seem to be missing a couple of repos in staging - need to investigate why they don't pop up in the locale lists that I am using, e.g.:

https://github.com/petermoore/l10n-af-gaia

be gaia seems to be missing 1.4 branch in production

Another strange thing is there only seem to be 3 gecko repositories that got picked up in staging.

Do we need to update


GECKO_BRANCHES = {
    'v2.0': 'mozilla-beta',
    'v2.1': 'mozilla-central',
}

and add aurora and release back into here?


I'm posting this now, rather than completing a highly detailed deep dive myself, so that Aki at least has a chance to have a cursory glance over it before he leaves.

Thanks guys!
Pete
Attachment #8462851 - Flags: feedback?(hwine)
Attachment #8462851 - Flags: feedback?(escapewindow+mozbugs)
(In reply to Pete Moore [:pete][:pmoore] from comment #52)
> In production we have a lot of
> refs/heads/default which we don't have in new vcs sync (because we only map
> default -> master, and don't have both default and master branches).

Right, the default branch needs nuking.  You should coordinate with Hal on that.

> Other differences are where we are not pushing old branches, i.e. 1.2 is the
> oldest release we push for gaia. So anything before that would not get
> overwritten with the cutover, and those old branches would stay in place.

+1.

> I do seem to be missing a couple of repos in staging - need to investigate
> why they don't pop up in the locale lists that I am using, e.g.:
> 
> https://github.com/petermoore/l10n-af-gaia

af was just populated on the 23rd (af and zu were empty and you were having problems with them).
 
> be gaia seems to be missing 1.4 branch in production

Hm, that's in languages_all but not languages_dev: https://github.com/mozilla-b2g/gaia/blob/v1.4/locales/languages_dev.json

Seems ok to convert additional locales compared to legacy.

> Another strange thing is there only seem to be 3 gecko repositories that got
> picked up in staging.
> 
> Do we need to update
> 
> 
> GECKO_BRANCHES = {
>     'v2.0': 'mozilla-beta',
>     'v2.1': 'mozilla-central',
> }
> 
> and add aurora and release back into here?

Nope.  We're in the 6 weeks where there's no aurora, and there are no gecko release locales.
September 1, v2.1 will move to mozilla-aurora and central will become 2.2 unless they decide they want to skip forward versions again.
Comment on attachment 8462851 [details]
diffs between staging l10n git repos (new vcs sync) and production l10n git repos (legacy vcs sync)

Commented on the comment, not the attachment.
Attachment #8462851 - Flags: feedback?(escapewindow+mozbugs) → feedback+
Thanks so much Aki.
Attachment #8462851 - Attachment mime type: text/x-log → text/plain
Comment on attachment 8462851 [details]
diffs between staging l10n git repos (new vcs sync) and production l10n git repos (legacy vcs sync)

Hal and I went through this together when we met in California, and agreed that:

1) All of the differences were ok
2) I can deploy the new l10n vcs sync independently of shutting legacy l10n vcs sync off - so I will roll it out later today.

Pete
Attachment #8462851 - Flags: feedback?(hwine)
I've deployed under github-sync4.dmz.scl3.mozilla.com:/opt/vcs2vcs but not triggered yet, since I want to check whether I need to wait due to bug 1061861 - "Tracking bug for 13-oct-2014 migration work".

It is ready for a test run, but I'll wait until Rail comes online later today.
I've run manually a couple of times and ironed out the teething issues, and now placed in cron. I've also updated wiki: https://wiki.mozilla.org/ReleaseEngineering/VCSSync/HowTo#Machines

We should be able to disable legacy system in a day or two, when we are confident this is working fine.

Mappings are being published to mapper:
  * https://api.pub.build.mozilla.org/mapper/gitmo-gecko-l10n/mapfile/full
  * https://api.pub.build.mozilla.org/mapper/gitmo-gaia-l10n/mapfile/full

As a backup, the raw mapfiles are also being published to:
  * http://people.mozilla.org/~pmoore/vcs2vcs/l10n/
Hal, which consumers are still feeding from legacy vcs sync mappings?

Which steps are left for shutting off l10n gaia and gecko in legacy vcs sync?
Flags: needinfo?(hwine)
Assignee: pmoore → nobody
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1898]
(In reply to Pete Moore [:pete][:pmoore] from comment #59)
> Hal, which consumers are still feeding from legacy vcs sync mappings?

That's the joy of services -- you never know who is using them. :) 

Seeing comment 58, I realized there was a different understanding on what is being published to the current mapper. I added that information to the wiki (currently at https://wiki.mozilla.org/ReleaseEngineering/Applications/Mapper#Old_Mapper_Data_Feed).

As I see it, this bug only addresses 1 of 4 categories of mapper data provided by the old system. Whether those sources are currently used may be able to be determined from the mapper logs.

> Which steps are left for shutting off l10n gaia and gecko in legacy vcs sync?

Once the new system has taken up all the production load, it can be retired. From our last discussion, we agreed that the systems can run in parallel as the final go live step. (No errors should be reported, just one or the other system will end up having nothing to push.)

This can happen on a repository by repository basis, with the exception of repositories which contribute data to mapper. Unless someone codes a "support the other version" bridge, legacy vcs-sync & legacy mapper can't be retired until new mapper has enough data to supply the current needs.
Flags: needinfo?(hwine)
No longer blocks: merge-day-automation
This was needed for b2g, which is now tier3.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: