Closed Bug 1050109 Opened 10 years ago Closed 6 years ago

tracker - use local caches of mozharness and tools for build and test jobs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Assigned: jlund)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2483] )

Attachments

(4 files, 5 obsolete files)

motivation for acting on this now is https://bugzilla.mozilla.org/show_bug.cgi?id=1040308#c23.

https://bugzilla.mozilla.org/show_bug.cgi?id=1036122#c0 already pointed out this issue. although to clarify, mozharness will clone/pull the tools repo. Buildbot is what clones/pulls mozharness.

hwine and others already pointed out we have not confirmed the smoking gun for 1040308 but addressing this bug can't hurt

When do we use mozharness/tools repos?
* on last count, mozharness is used for ~85% of all test+build jobs. For all those jobs buildbot will hit hg.m.o/build/mozharness everytime AFAIK
* most mozharness jobs use tools so the hg.m.o/build/tools will also get hit a lot
* runner, a new service that is currently live on every CentOS machine, hits build/{mozharness,tools} on every reboot and stores a local cache copy of the checkout.

what options do we have we to be more efficient?

1) rely on runner to keep an up to date copy of tools/mozharness and do a clean cp of that into our job's work_dir
** issue here is that this is only live on CentOS machines. Remaining Linux and osx/win support is in the works but won't be resolved this week
** machines that use runner still, at least for now, reboot on every job so hg tools/mozharness will still be hit a lot here.
** this should be the fastest to implement. 'runner' would still be hg'n but at least we wouldn't be doubling up on runtime.
** unknown: how many of our total test+build jobs use centos machines.

2) use puppet to deploy tools/mozharness to all our slave machines and keep them up to date
** we will still need to find ways to be efficient about not putting load on hg.m.o while ensuring that we have the latest checkout

3) skip hg calls and only use http - download a tarball of the latest mozharness/tools checkouts
** we AFAIK don't actually need the history of these repos and can rely on the latest checkout
** we would still need a way to see what rev they were based off of for debugging
** it would be nice if this was stored in a S3 bucked that was always up to date with latest checkout
** we could use timestamps to optimize when we need to download the tarball (e.g. -z arg for curl)

option 3 might be doable and it could be used to stop hg hits coming from:
runner hg'n -> mozharness + tools
buildbot hg'n -> mozharness
mozharness hg'n -> tools

I'll be taking a look at this tomorrow. Looking for a fastest solution with the most reach. creating a break-out zimbra event to discuss with whoever is interested.

thoughts?
event created @ 930am PT in jlund's vidyo room. all welcome.
Other options might be:

1) If we clone/pull over http, would http caches help us (e.g. like proxxy)? Does returned content contain caching headers?
2) Would it make sense to have downstream hg mirrors e.g. per data center (or maybe the webheads are already located across data centers?) - so the central server is where changes are pushed, but then there is a waterfall of this propagating to downstream mirrors, and our infrastructure reads from the mirrors (something like dns propagation)
3) When hg publisher is working again, when new changesets come in, perhaps they can broadcast messages with the content to pulse, and buildbot/other infrastructure can subscribe to these messages, and update accordingly (i.e. everything gets pushed out, not pulled in).
4) repo hooks to trigger remote repos to update, rather than pulls happening when no changes have occurred
Here's the context I see:
 - anecdotal reports that mozharness/tools repo get's rm -rf'd more often than needed
 - same for other repos getting rm -rf'd
 - occasional notice that bundle downloads fail, and hgtool falls back to 'hg clone'
 - our data does not yet differentiate between minor update and full clone (stay tuned) (both use "getbundle")
 - lore suggests we have an uneven usage of hg tooling (hgtool, local caches, etc)

Here's what the data tells us so far:
 - approx 28% of releng http traffic from scl3 machines is pull/clone of mozharness/tools
 - ~80K/day of "getbundle" to mozharness
  - that sounds reasonable - if each repo is checked once per job
 - ~130MB/day of mozharness bytes transferred
  - that sounds excessive, as is closer to full clone amount
  - 40K of 41K for 10hours Wed were for a "full clone" transfer of ~1.5MB

Here's my opinion:
 - we have zero (0) evidence that hg is stupid or faulty on current versions
 - we have some reports that we're not using hg as optimally as possible
 - we should first focus our efforts on using hg efficiently
 - we should only "work around" hg (or replicate functionality it should supply) as a last resort.
   - those workaround are added technical debt
   - those workarounds will slow the velocity at which we can upgrade hg (we want to increase it)
(In reply to Jordan Lund (:jlund) from comment #0)
> motivation for acting on this now is
> https://bugzilla.mozilla.org/show_bug.cgi?id=1040308#c23.
> 
> https://bugzilla.mozilla.org/show_bug.cgi?id=1036122#c0 already pointed out
> this issue. although to clarify, mozharness will clone/pull the tools repo.
> Buildbot is what clones/pulls mozharness.
> 
> hwine and others already pointed out we have not confirmed the smoking gun
> for 1040308 but addressing this bug can't hurt
> 
> When do we use mozharness/tools repos?
> * on last count, mozharness is used for ~85% of all test+build jobs. For all
> those jobs buildbot will hit hg.m.o/build/mozharness everytime AFAIK
> * most mozharness jobs use tools so the hg.m.o/build/tools will also get hit
> a lot
> * runner, a new service that is currently live on every CentOS machine, hits
> build/{mozharness,tools} on every reboot and stores a local cache copy of
> the checkout.
> 
> what options do we have we to be more efficient?
> 
> 1) rely on runner to keep an up to date copy of tools/mozharness and do a
> clean cp of that into our job's work_dir
> ** issue here is that this is only live on CentOS machines. Remaining Linux
> and osx/win support is in the works but won't be resolved this week
> ** machines that use runner still, at least for now, reboot on every job so
> hg tools/mozharness will still be hit a lot here.
> ** this should be the fastest to implement. 'runner' would still be hg'n but
> at least we wouldn't be doubling up on runtime.
> ** unknown: how many of our total test+build jobs use centos machines.
> 
> 2) use puppet to deploy tools/mozharness to all our slave machines and keep
> them up to date
> ** we will still need to find ways to be efficient about not putting load on
> hg.m.o while ensuring that we have the latest checkout
> 
> 3) skip hg calls and only use http - download a tarball of the latest
> mozharness/tools checkouts
> ** we AFAIK don't actually need the history of these repos and can rely on
> the latest checkout
> ** we would still need a way to see what rev they were based off of for
> debugging
> ** it would be nice if this was stored in a S3 bucked that was always up to
> date with latest checkout
> ** we could use timestamps to optimize when we need to download the tarball
> (e.g. -z arg for curl)
> 
> option 3 might be doable and it could be used to stop hg hits coming from:
> runner hg'n -> mozharness + tools
> buildbot hg'n -> mozharness
> mozharness hg'n -> tools
> 
> I'll be taking a look at this tomorrow. Looking for a fastest solution with
> the most reach. creating a break-out zimbra event to discuss with whoever is
> interested.
> 
> thoughts?

We discarded option 2 in favour of doing this kind of thing in runner. It's much easier to ensure this kind of state outside of puppet. runner will be just as efficient as puppet could be here, but with better error handling. We're not doing full clones per reboot, we're pulling in any missing changes.
You're right that we don't need history for these repos in general, so having static copies we deploy is also an option. Makes deployment of tools/mozharness changes a bit harder perhaps?
(In reply to Chris AtLee [:catlee] from comment #5)
> You're right that we don't need history for these repos in general, so
> having static copies we deploy is also an option. Makes deployment of
> tools/mozharness changes a bit harder perhaps?

Also - near future versions of hg will support "shallow clones" which give us the best of both worlds. An example of something it would be nice to be able to integrate easily when the time comes.
we met this morning about this.

some notes:
- we should investigate how inefficient we are being so we know what solution to go with
- going with static tarballs of our tools/mozharness does come with its own issues WRT to unpacking, verifying, release tagging, and grabbing the right branch.
- this may not be the cause of our woes but improving it won't hurt
- runner does keep a clean copy of tools/mozharness and does not hammer hg.m.o, it only pulls on new changesets (which will be only a few times a day for each)

I took a look at our current state of how things are done and it seems like we can make mountains of improvements with how we use hg:
- (buildbot step) for every mozharness job (85% of all our jobs) we 'rm -rf & clone mozharness' *every* time[1]
- (buildbot step) for every buildbot factory job not through mozharness we pretty much 'rm -rf & clone both mozharness tools' *every* time[2][3]
- (within mozharness run) for many of our mozharn tests + jobs, we rely on tools repo. almost in every incident we use hg not hgtool to grab it and AFAICT we end up having to do a clone as dest does not exist[4][5][6][7]
- it looks like we actually keep mozharness + tools repo clean throughout job runs so we could just call a script from a local cache without having to copy it over to our builderdir

based off this info it is not clear to me that implementing static tarballs will make much of a difference (positively or negatively) compared to improving our logic above.

so what can we do?
- runner is already live for all our build machines and it keeps a local cache of both these repos on the slaves
- we *should* be able to simply ensure runner checkouts are up to date since last reboot and then call the script(s) we need

implementing that logic for [1][2][3] alone even for just jobs run through centos would be a dramatic win IMO. It is also something that can be tackled quickly.

thoughts?

[1] http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#6127
[2] http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#1101
[3] http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#484
[4] http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/building/buildb2gbase.py#296
[5] http://mxr.mozilla.org/build/source/mozharness/scripts/spidermonkey_build.py#335
[6] http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/building/buildbase.py#1355
[7] http://mxr.mozilla.org/build/source/mozharness/mozharness/base/vcs/vcsbase.py#133 (used by unittests)
(In reply to Jordan Lund (:jlund) from comment #7)
> - (buildbot step) for every mozharness job (85% of all our jobs) we 'rm -rf
> & clone mozharness' *every* time[1]
> - (buildbot step) for every buildbot factory job not through mozharness we
> pretty much 'rm -rf & clone both mozharness tools' *every* time[2][3]

The suggestions in bug 851398 might be of use :-)
I've been playing around with this and so far it looks pretty promising. This definitely frees up hg.m.o and takes less time overall. I triggered a whole bunch of jobs to run overnight with.

I'm going to call this v1 of a planned 4

v1 summary: every build job that passes through ScriptFactory and uses a linux slave will stop cloning mozharness everytime and instead rely on runner's mozharness checkout. that is unless we disable it at a branch level (e.g. ash)

one noteworthy side effect discovered: if the mozharness job is for default branch or a special tag, it will leave runner's mozharness at that tag rev until the slave reboots. I *think* this OK. Once slaves stop rebooting, we can just add extra logic to runner so that it 'hg up -C -r production's mozharness after each job.

accompanying bbot-cfg patch incoming
Attachment #8469884 - Flags: feedback?(catlee)
here's v2 (interdiff from v1). this is not very tested but I should have some results to wake up to.

I got some new info: we rm/clone mozharness for nearly every desktop platform build job. the new info is, AFAICT, we actually don't even need mozharness for most of those jobs. What we do need mozharness for is our b2g desktop gecko platforms. Or put another way, only when one of these are truthy: multiLocale, gaiaLanguagesFile

so unless I'm horribly mistaken, we don't need mozharness for all our existing ff desktop builds (linux/mac/win + all their variants). I think this might have been a bug introduced by: http://hg.mozilla.org/build/buildbotcustom/rev/b6a88a0badd9 as MozharnessRepoPath always holds a value because its value originates from GlobalVars.

v2 summary: this does two things. 1) it fixes above so we only worry about mozharness if it's used and 2) if we are building with MozillaBuildFactory and it's on a linux slave, we use the cache of mozharness like we do for ScriptFactory

v2 bbot-cfg patch incoming
Attachment #8469907 - Flags: feedback?(catlee)
(In reply to Ed Morley [:edmorley] from comment #8)
> (In reply to Jordan Lund (:jlund) from comment #7)
> > - (buildbot step) for every mozharness job (85% of all our jobs) we 'rm -rf
> > & clone mozharness' *every* time[1]
> > - (buildbot step) for every buildbot factory job not through mozharness we
> > pretty much 'rm -rf & clone both mozharness tools' *every* time[2][3]
> 
> The suggestions in bug 851398 might be of use :-)

possibly, thanks for the bug ref :)

I think, for now at least, I'm going to continue leveraging runner since it already keeps a local copy of the repos forever on the slaves.
(In reply to Jordan Lund (:jlund) from comment #13)
> I think, for now at least, I'm going to continue leveraging runner since it
> already keeps a local copy of the repos forever on the slaves.

They're complimentary approaches - using runner's copy of mozharness does make sense, but I see in the attached patches the else case still |rm -rf|s. One thing at a time however :-)
Comment on attachment 8469907 [details] [diff] [review]
140808_hg_tools_and_mozharn_less-bbotcustom-v2-interdiff.patch

Review of attachment 8469907 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +1038,5 @@
>          if self.enable_ccache:
>              self.addStep(ShellCommand(command=['ccache', '-z'],
>                                        name="clear_ccache_stats", warnOnFailure=False,
>                                        flunkOnFailure=False, haltOnFailure=False, env=self.env))
> +        if mozharnessRepoPath and (multiLocale or gaiaLanguagesFile):

I'm wondering if this flag should be more explicit. Maybe we should just avoid passing in mozharnessRepoPath in cases where we don't need it?

@@ +1141,5 @@
> +                command=['hg', 'update', '-r', self.mozharnessTag],
> +                description=['updating', 'mozharness', 'to', self.mozharnessTag],
> +                workdir='mozharness',
> +                haltOnFailure=True
> +            ))

use hgtool here?
Attachment #8469907 - Flags: feedback?(catlee) → feedback+
Comment on attachment 8469884 [details] [diff] [review]
140808_hg_tools_and_mozharn_less-bbotcustom-v1.patch

Review of attachment 8469884 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +6165,5 @@
> +            ))
> +            if scriptName[0] == '/':
> +                script_path = scriptName
> +            else:
> +                script_path = 'scripts/%s' % scriptName

can this section be replaced with hgtool?
Attachment #8469884 - Flags: feedback?(catlee) → feedback+
(In reply to Chris AtLee [:catlee] from comment #16)
> Comment on attachment 8469884 [details] [diff] [review]
> 140808_hg_tools_and_mozharn_less-bbotcustom-v1.patch
> 
> Review of attachment 8469884 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: process/factory.py
> @@ +6165,5 @@
> > +            ))
> > +            if scriptName[0] == '/':
> > +                script_path = scriptName
> > +            else:
> > +                script_path = 'scripts/%s' % scriptName
> 
> can this section be replaced with hgtool?

yes, I've added tools_repo_cache for this block so we can avail of runner's tools hgtool. We should have hgtool everywhere bar win test slaves but I want to stay consistent and rely on runner as the canonical truth for cached repos.
(In reply to Chris AtLee [:catlee] from comment #15)
> Comment on attachment 8469907 [details] [diff] [review]
> 140808_hg_tools_and_mozharn_less-bbotcustom-v2-interdiff.patch
> 
> Review of attachment 8469907 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: process/factory.py
> @@ +1038,5 @@
> >          if self.enable_ccache:
> >              self.addStep(ShellCommand(command=['ccache', '-z'],
> >                                        name="clear_ccache_stats", warnOnFailure=False,
> >                                        flunkOnFailure=False, haltOnFailure=False, env=self.env))
> > +        if mozharnessRepoPath and (multiLocale or gaiaLanguagesFile):
> 
> I'm wondering if this flag should be more explicit. Maybe we should just
> avoid passing in mozharnessRepoPath in cases where we don't need it?
> 

ya makes sense. I've added this logic to misc.py instead of here.

> @@ +1141,5 @@
> > +                command=['hg', 'update', '-r', self.mozharnessTag],
> > +                description=['updating', 'mozharness', 'to', self.mozharnessTag],
> > +                workdir='mozharness',
> > +                haltOnFailure=True
> > +            ))
> 
> use hgtool here?

like in comment 17 WRT ScriptFactory, I added tools_cache_repo to MBF. I am going to leave things as they were outside of runner since 1) hg.m.o's state has improved and 2) I'd rather put my effort into the mach_mozharness builds that will be replacing this.

I've kicked off a few tests on dev master. will upload patches based off results in the morning.

v3 is going to be dealing with tools repo as, thus far, these patches only fix where we rm/clone mozharness.
there have been some hiccups here when incorporating hgtool. I think I have them figured out and will post patch today.

1 thing left to do is address this:it looks like our /tools/checkouts/* repos use a shared repo so the .hg/sharedir was trippin hgtool with b2g builds that do not set HG_SHARE_DIR in their env.

I'm tempted to just put that env var directly in shellcmd call from bbotcustom rather than adding another item to our init list of factories and associated pf in bbot-cfgs. I guess long term it is better to have raw values in bbot-cfgs.
nearly there. Only one builder type failing on dev-master. when I trigger a linux desktop gecko build say: b2g_mozilla-central_linux64_gecko build,

it updates runner's checkout mozharness fine but is not happy when it needs to use it in the clone_gaia_l10n_repos step. I think it might be because this step runs through mock and /tools/checkouts/mozharness does not exist in the mock fs view. Will look tomorrow:

build:
http://dev-master1.srv.releng.scl3.mozilla.com:8037/builders/b2g_mozilla-central_linux64_gecko%20build/builds/0

snippet:
mock_mozilla -r mozilla-centos6-x86_64 --cwd /builds/slave/m-cen-l64_g-000000000000000000 --unpriv --shell '/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" LOCALES_FILE="locales/languages_dev.json" TOOLTOOL_HOME="/builds" SYMBOL_SERVER_HOST="dev-stage01.srv.releng.scl3.mozilla.com" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" MOZ_OBJDIR="obj-firefox" SYMBOL_SERVER_SSH_KEY="/home/cltbld/.ssh/ffxbld_dsa" LOCALE_BASEDIR="/builds/slave/m-cen-l64_g-000000000000000000/build-gaia-l10n" TINDERBOX_OUTPUT="1" CCACHE_COMPRESS="1" TOOLTOOL_CACHE="/builds/tooltool_cache" SYMBOL_SERVER_PATH="/mnt/netapp/breakpad/symbols_ffx/" PATH="/tools/python27-mercurial/bin:/tools/python27/bin:${PATH}:/tools/buildbot/bin" MOZ_CRASHREPORTER_NO_REPORT="1" SYMBOL_SERVER_USER="ffxbld" WGET_OPTS="-q -c" LC_ALL="C" CCACHE_UMASK="002" python /tools/checkouts/mozharness/scripts/b2g_desktop_multilocale.py --pull --gaia-languages-file '"'"'/builds/slave/m-cen-l64_g-000000000000000000/build/gaia/locales/languages_dev.json'"'"' --gaia-l10n-root https://hg.mozilla.org/gaia-l10n --gaia-l10n-base-dir '"'"'/builds/slave/m-cen-l64_g-000000000000000000/build-gaia-l10n'"'"' --config-file multi_locale/b2g_linux64.py --gecko-l10n-root https://hg.mozilla.org/l10n-central --gecko-languages-file build/b2g/locales/all-locales'
 in dir /builds/slave/m-cen-l64_g-000000000000000000 (timeout 1200 secs)
 watching logfiles {}
 argv: ['mock_mozilla', '-r', 'mozilla-centos6-x86_64', '--cwd', '/builds/slave/m-cen-l64_g-000000000000000000', '--unpriv', '--shell', '/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" LOCALES_FILE="locales/languages_dev.json" TOOLTOOL_HOME="/builds" SYMBOL_SERVER_HOST="dev-stage01.srv.releng.scl3.mozilla.com" CCACHE_DIR="/builds/ccache" POST_SYMBOL_UPLOAD_CMD="/usr/local/bin/post-symbol-upload.py" MOZ_AUTOMATION="1" MOZ_OBJDIR="obj-firefox" SYMBOL_SERVER_SSH_KEY="/home/cltbld/.ssh/ffxbld_dsa" LOCALE_BASEDIR="/builds/slave/m-cen-l64_g-000000000000000000/build-gaia-l10n" TINDERBOX_OUTPUT="1" CCACHE_COMPRESS="1" TOOLTOOL_CACHE="/builds/tooltool_cache" SYMBOL_SERVER_PATH="/mnt/netapp/breakpad/symbols_ffx/" PATH="/tools/python27-mercurial/bin:/tools/python27/bin:${PATH}:/tools/buildbot/bin" MOZ_CRASHREPORTER_NO_REPORT="1" SYMBOL_SERVER_USER="ffxbld" WGET_OPTS="-q -c" LC_ALL="C" CCACHE_UMASK="002" python /tools/checkouts/mozharness/scripts/b2g_desktop_multilocale.py --pull --gaia-languages-file \'/builds/slave/m-cen-l64_g-000000000000000000/build/gaia/locales/languages_dev.json\' --gaia-l10n-root https://hg.mozilla.org/gaia-l10n --gaia-l10n-base-dir \'/builds/slave/m-cen-l64_g-000000000000000000/build-gaia-l10n\' --config-file multi_locale/b2g_linux64.py --gecko-l10n-root https://hg.mozilla.org/l10n-central --gecko-languages-file build/b2g/locales/all-locales']
 environment: {
...
...
}
INFO: mock_mozilla.py version 1.0.3 starting...
State Changed: init plugins
INFO: selinux disabled
State Changed: start
State Changed: lock buildroot
State Changed: shell
python: can't open file '/tools/checkouts/mozharness/scripts/b2g_desktop_multilocale.py': [Errno 2] No such file or directory
State Changed: unlock buildroot
program finished with exit code 2
elapsedTime=0.276594
confirmed its not visible to mock:

[cltbld@dev-linux64-ec2-jlund2.dev.releng.use1.mozilla.com ~]$ mock_mozilla -r mozilla-centos6-x86_64 --cwd /builds/slave/m-cen-l64_g-000000000000000000 --unpriv --shell 'ls /tools/'
INFO: mock_mozilla.py version 1.0.3 starting...
State Changed: init plugins
INFO: selinux disabled
State Changed: start
State Changed: lock buildroot
State Changed: shell
gcc-4.7.2-0moz1  gcc-4.7.3-0moz1  git  python27  python27-mercurial
State Changed: unlock buildroot
[cltbld@dev-linux64-ec2-jlund2.dev.releng.use1.mozilla.com ~]$ ls /tools
buildbot                 buildbot-0.8.4-pre-moz3  checkouts  misc-python  python2   python27-mercurial   tooltool.py
buildbot-0.8.4-pre-moz2  buildbot-0.8.4-pre-moz4  git        python       python27  python27-virtualenv

so I guess we either:

1) change our mounts and associated mock configs to include /tools/checkouts/*
2) change the path that runner puts checkouts


for now I'll do up a patch that ignores desktop gecko builds so this isn't blocked
I can upload dump_master or builderlist diff if requested

this patch incorporates v1 and v2 with fixes from comment 18.
Attachment #8474997 - Flags: review?(catlee)
accompanying patch to buildbotcustom

this patch will change the following:

- every branch bar ash the below builds will use runner's checkout of mozharness as a cache and buildbot will call scripts from that instead of a builderdir local copy
* mozharness based jobs:
** all b2g builds (non desktop gecko builds)
** all linux slave based mozharness desktop builds (cedar)
** all hazard builds
* buildbot based jobs
** android nightlies

- stops clobbering + cloning mozharness for the below builders that do not even use mozharness
* all desktop builds (non gecko)
* all android builds aside from nightlies

- leaves gecko desktop builds as they were until https://bugzilla.mozilla.org/show_bug.cgi?id=1050109#c21 is solved. catlee do you have a preference on how that can be fixed? Or any guidance?

- uses runner's hgtool to update runne'rs mozharness checkout for builds on slaves that use runner. Also adds HG_SHARE_BASE_DIR to b2g builds. otherwise hgtool would try to rm/clone mozharness in /tools/checkouts/mozharness and runner would no longer yield of the hg-share mozharness copy.
Attachment #8469884 - Attachment is obsolete: true
Attachment #8469885 - Attachment is obsolete: true
Attachment #8469907 - Attachment is obsolete: true
Attachment #8469908 - Attachment is obsolete: true
Attachment #8475005 - Flags: review?(catlee)
Comment on attachment 8474997 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbotcustom.patch

Review of attachment 8474997 [details] [diff] [review]:
-----------------------------------------------------------------

::: misc.py
@@ +731,5 @@
>      scriptRepo = config.get('mozharness_repo_url',
>                              '%s%s' % (config['hgurl'], config['mozharness_repo_path']))
> +    script_repo_cache = None
> +    if config.get('use_mozharness_repo_cache'):  # branch supports it
> +        script_repo_cache = mh_cfg.get('mozharness_repo_cache',

Is there a real need for a separate 'use_mozharness_repo_cache' value? Would the presence of 'mozharness_repo_cache' in the mozharness or platform config suffice?
Attachment #8474997 - Flags: review?(catlee) → review+
Attachment #8475005 - Flags: review?(catlee) → review+
(In reply to Jordan Lund (:jlund) from comment #21)
> confirmed its not visible to mock:
> 
> [cltbld@dev-linux64-ec2-jlund2.dev.releng.use1.mozilla.com ~]$ mock_mozilla
> -r mozilla-centos6-x86_64 --cwd /builds/slave/m-cen-l64_g-000000000000000000
> --unpriv --shell 'ls /tools/'
> INFO: mock_mozilla.py version 1.0.3 starting...
> State Changed: init plugins
> INFO: selinux disabled
> State Changed: start
> State Changed: lock buildroot
> State Changed: shell
> gcc-4.7.2-0moz1  gcc-4.7.3-0moz1  git  python27  python27-mercurial
> State Changed: unlock buildroot
> [cltbld@dev-linux64-ec2-jlund2.dev.releng.use1.mozilla.com ~]$ ls /tools
> buildbot                 buildbot-0.8.4-pre-moz3  checkouts  misc-python 
> python2   python27-mercurial   tooltool.py
> buildbot-0.8.4-pre-moz2  buildbot-0.8.4-pre-moz4  git        python      
> python27  python27-virtualenv
> 
> so I guess we either:
> 
> 1) change our mounts and associated mock configs to include
> /tools/checkouts/*
> 2) change the path that runner puts checkouts
> 
> 
> for now I'll do up a patch that ignores desktop gecko builds so this isn't
> blocked

Probably best to modify the mock configs to bind mount these directories inside mock as well.
(In reply to Chris AtLee [:catlee] from comment #24)
> Comment on attachment 8474997 [details] [diff] [review]
> 140818_bug_1050109_hg_tools_and_mozharn_less-bbotcustom.patch
> 
> Review of attachment 8474997 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: misc.py
> @@ +731,5 @@
> >      scriptRepo = config.get('mozharness_repo_url',
> >                              '%s%s' % (config['hgurl'], config['mozharness_repo_path']))
> > +    script_repo_cache = None
> > +    if config.get('use_mozharness_repo_cache'):  # branch supports it
> > +        script_repo_cache = mh_cfg.get('mozharness_repo_cache',
> 
> Is there a real need for a separate 'use_mozharness_repo_cache' value? Would
> the presence of 'mozharness_repo_cache' in the mozharness or platform config
> suffice?

This is a pattern I've been using since mozharness desktop builds: http://mxr.mozilla.org/build/source/buildbotcustom/misc.py#1561

I've been trying use what we call config in misc.py as 'what the branch has enabled' and the platform/pf as 'what the platform can do'.

e.g. if branch supports x and platform has y to do x

the plus side is this allows us to skip having to add another loop at the end of config.py where we replace platform vars for given branches. I think it's more explicit because the values never change with later mutation. so for ash's case, we don't need to loop through the platforms in ash and remove the already defined 'mozharness_repo_cache' item.

the downside is that we have two items in our config instead of 1. I'm open to iterating on this if you do not like the approach. I could name them the same in config and platform but I feel like they serve different purposes.
Comment on attachment 8475005 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbot-cfgs.patch

on default: https://hg.mozilla.org/build/buildbot-configs/rev/38180b78f70c
Attachment #8475005 - Flags: checked-in+
Comment on attachment 8474997 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbotcustom.patch

on default: https://hg.mozilla.org/build/buildbotcustom/rev/0a3906d2bffe
Attachment #8474997 - Flags: checked-in+
In production with reconfig on 2014-08-20 07:47 PT
Comment on attachment 8475005 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbot-cfgs.patch

and backed out.

there were a couple found issues:

1st issue ->

log:
https://tbpl.mozilla.org/php/getParsedLog.php?id=46368116&full=1&branch=mozilla-inbound

snippet:
CalledProcessError: Command '['hg', 'pull', '-b', u'integration/mozilla-inbound', 'https://hg.mozilla.org/build/mozharness']' returned non-zero exit status 255

I think this was the result of hgtool.py picking up a buildbot property that didn't exist in staging env but is there for production: 'repo_path'.

I suspect the fix is I need to tell hgtool to not use build props. It looks like hgtool did the right thing and just did a clone anyway. I think it used the shared dir too so runner's checkout is intact. I'll confirm that anyway


2nd:

I neglected to try cedar builders in staging for linux mozharn desktop builds.

buildbot steps:
http://buildbot-master77.srv.releng.use1.mozilla.com:8001/builders/Linux%20cedar%20build/builds/9
log: https://tbpl.mozilla.org/php/getParsedLog.php?id=46363553&tree=Cedar

snippet: key error toolsdir

again looks something with buildbot props. I think for mozharn desktop builds, I don't define toolsdir or else it is invalid. I'll have to look into how it should work and also what part of update_script_repo_cache was expecting it.

my initial guess is it is somewhere in self.env and shellcommand tries to interpolate it: http://hg.mozilla.org/build/buildbotcustom/rev/0a3906d2bffe#l2.116

it's not tools_repo_cache as that should be runners checkout path
Attachment #8475005 - Flags: checked-in+ → checked-in-
Attachment #8474997 - Flags: checked-in+ → checked-in-
status update: I am on PTO until next wed and then on Buildduty. I will attempt a patch on wed/thurs while on duty to keep this bug moving forward.
status update: I was pulled away from this. I have had some time to pick it up again. patch incoming
interdiffs:
    fix for issue 1 in comment 30: http://people.mozilla.org/~jlund/140916_bug_1050109_hg_tools_and_mozharn_less_fix_branch_err-bbotcustom.patch
    fix for issue 2 in comment 30: http://people.mozilla.org/~jlund/140916_bug_1050109_hg_tools_and_mozharn_less_fixes_toolsdir-bbotcustom.patch

explanation of 1:
it seems that hgtool.py looks for a PROPERTIES_FILE and will extract things like repo_path and auto fill in branch from props. This does not work when cloning anything outside from tree source.


explanation of 2
wrt 2, that one was a bit trickier. you can see at http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#6333 we add a var to self.env (config time) that's value needs toolsdir interpolated (see query_moz_sign_cmd). so for each buildstep in SigningScriptFactory, if you use self.env as the env (this is runtime), it has to be after the toolsdir prop set step is done (line 6324) which wasn't the case for my hgtool step I was adding.

or so that is what I am grepping from: https://tbpl.mozilla.org/php/getParsedLog.php?id=46363553&full=1&branch=cedar

oddly enough I can't reproduce on my dev master but if I'm parsing above log right, I think this patch will fix that.
Attachment #8474997 - Attachment is obsolete: true
Attachment #8491268 - Flags: review?(catlee)
Comment on attachment 8491268 [details] [diff] [review]
140916_bug_1050109_hg_tools_and_mozharn_less_fixes_toolsdir_and_branch_issue-bbotcustom.patch

Review of attachment 8491268 [details] [diff] [review]:
-----------------------------------------------------------------

Looks ok. I'm not sure I understand your solution to the 2nd issue here though.
Attachment #8491268 - Flags: review?(catlee) → review+
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2483]
Comment on attachment 8474997 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbotcustom.patch

r+ carried forward

on default: https://hg.mozilla.org/build/buildbot-configs/rev/b6b8b27e43ae
Attachment #8474997 - Flags: checked-in- → checked-in+
Comment on attachment 8491268 [details] [diff] [review]
140916_bug_1050109_hg_tools_and_mozharn_less_fixes_toolsdir_and_branch_issue-bbotcustom.patch

on default: https://hg.mozilla.org/build/buildbot-configs/rev/b6b8b27e43ae

wrt my solution to issue number two, IIRC I make sure that toolsdir has a buildbot property value before a build step is called that uses self.env as its env (since an item in self.env requires toolsdir and WithProperties was failing).
Attachment #8491268 - Flags: checked-in+
Attachment #8474997 - Flags: checked-in+ → checked-in-
Comment on attachment 8475005 [details] [diff] [review]
140818_bug_1050109_hg_tools_and_mozharn_less-bbot-cfgs.patch

I meant this patch is r+ and carried forward

on default: https://hg.mozilla.org/build/buildbot-configs/rev/b6b8b27e43ae

time for bed I think..
Attachment #8475005 - Flags: checked-in- → checked-in+
Checked in code deployed to production
I looked at various Ash jobs and I don't see this running there.

Is there somewhere where I can see this running?
(In reply to Armen Zambrano - Automation & Tools Engineer (:armenzg) from comment #40)
> I looked at various Ash jobs and I don't see this running there.
> 
> Is there somewhere where I can see this running?

ash is special because it uses ash-mozharness so we can not use runners mozharness checkout. to accommodate this, I added: http://mxr.mozilla.org/build/source/buildbot-configs/mozilla/project_branches.py#150

if you check any other branch that does the following on linux slaves:

* desktop mozharness builds
* b2g device/emulator builds (b2g_build.py)
* android nightlies IIRC
* hazard/spider builds
I still see build jobs cloning build/tools. e.g. https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux/1416004308/mozilla-central-linux-bm91-build1-build11.txt.gz

After gaia-central's excessive clones got fixed in bug 1096653, build/tools is our next biggest consumer of hg.mozilla.org, with typically > 1TB of cloned data per day. That's kind of extreme. See bug 1096337 for some raw data showing how much of an outlier build/tools is.

As someone responsible for hg.mozilla.org, fixing this bug is my #1 ask from release engineering.
Blocks: 1096337
Depends on: 1100574
(In reply to Gregory Szorc [:gps] from comment #42)
> I still see build jobs cloning build/tools. e.g.
> https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-linux/1416004308/mozilla-central-linux-bm91-build1-build11.txt.gz

I have only fixed the mozharness repo case for linux (in most places). tools repo will get the local cache treatment next.

I have a few things on my plate right now but I will try to address this within the next week.
Comment on attachment 8526350 [details] [diff] [review]
[buildbotcustom] enable caching of tools and mozharness for test jobs

Not yet.
Attachment #8526350 - Flags: review?(jlund)
Attachment #8526349 - Flags: review?(jlund)
I think we should break up this bug.

It is hard to grasp which slaves already have local caches and which jobs actually need to avail of those caches. I've started a spreadsheet to track all of this: https://docs.google.com/a/mozilla.com/spreadsheets/d/1BwVDjrwTUZRuYGEGGD3GbjQO065KaLI1n41gwzj3aIQ/edit?usp=sharing

The first tab highlights where we use mozharness + tools. The 3 remaining tabs track this bugs work in progress.
Summary: stop jobs from hg'n mozharness/tools repos → tracker - use local caches of mozharness and tools for build and test jobs
Depends on: 1103700
Depends on: 1103701
Depends on: 1103702
Component: General Automation → General
I think this got solved by moving stuff to Taskcluster where we have the vcs caching. 
Feel free to reopen if I'm wrong.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: