Closed Bug 1097784 Opened 11 years ago Closed 11 years ago

Caching improvements for b2g_bumper

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: pmoore)

References

Details

Attachments

(3 files, 2 obsolete files)

Reworked comments 5,6,7 from bug 990173: We are not caching results to git ls-remote across branches, which would reduce our processing load by almost 80%, since there are five branches, and the majority of the revision lookups are identical. I created a very crude script (below) to run them in sequence. The first branch took approximately 20 mins to run. The subsequent branches sharing the cache took approximately 12 seconds each to run. b2g_bumper_combined.py: ======================= #!/usr/bin/env python import sys import copy from b2g_bumper import B2GBumper if __name__ == '__main__': git_ref_cache = {} b2g_bumper_script = sys.argv[0].replace('_combined', '') config_files = copy.copy(sys.argv[1:]) for config in config_files: sys.argv = [b2g_bumper_script, '-c', config, '-c', '/Users/pmoore/work/b2g_bumper/pete.py', '--checkout-manifests', '--massage-manifests'] bumper = B2GBumper() bumper._git_ref_cache = git_ref_cache print bumper._git_ref_cache bumper.run() print bumper._git_ref_cache This essentially cached the bumper._git_ref_cache between runs. Each time the cron runs, it will start with an empty cache, so the cache is just used once per branch, and then a new cache is created. This implementation sadly suffered from the double logging bug in mozharness, so I decided to implement differently. I've written code to clear/import/export the cache, so now it is possible to run against one branch, export the cache, and then run against a different branch, importing the same cache. In addition I have fixed some two caching bugs, which increase the usage of the cache: 1) Previously heads were not cached if not specified as refs/heads/<HEAD> due to a bug in the code 2) Previously several threads could perform the same lookup, as the code only checked if results already existed for a lookup, not whether a lookup had been started I have one more fix to test - I'd like to see if we gain performance improvements by running git ls-remote against a url without specifying a head/tag, so that future queries against the same repo do not require additional git ls-remote lookup calls to run. Will attach patches when completed testing.
Attached patch bug1097784_mozharness_v1.patch (obsolete) — Splinter Review
See e.g. https://travis-ci.org/petemoore/b2g-manifest/builds/40901415 In this travis run, I tested this --massage-manifests action using this patch. I'll attach a puppet patch too, to change the way we call b2g_bumper.py so that we run them in sequence, rather than in parallel, and share the cache between the runs of the different branches.
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
Attachment #8522306 - Flags: review?(catlee)
Not really sure how to test this, other than running in production!
Attachment #8522369 - Flags: review?(catlee)
Attached patch bug1097784_mozharness_v2.patch (obsolete) — Splinter Review
When writing the puppet change, I saw that we run each branch in different subdirectories, so I needed to add a global config setting for the location of the git ref cache, since they share this file, but have different working directories. Therefore this fix is just to enable a global location for the git ref cache, and to add it to the configs. Other than that, the patch is identical to the previous one. See e.g. https://travis-ci.org/petemoore/b2g-manifest where I have tested this mozharness patch.
Attachment #8522306 - Attachment is obsolete: true
Attachment #8522306 - Flags: review?(catlee)
Attachment #8522393 - Flags: review?(catlee)
Attachment #8522369 - Flags: review?(catlee) → review+
Comment on attachment 8522393 [details] [diff] [review] bug1097784_mozharness_v2.patch Review of attachment 8522393 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! I think it needs a tiny bit more polish in a few places, and definitely fix up the logging call. ::: configs/b2g_bumper/master.py @@ +9,4 @@ > 'gecko_pull_url': 'https://hg.mozilla.org/integration/b2g-inbound', > 'gecko_push_url': 'ssh://hg.mozilla.org/integration/b2g-inbound', > 'gecko_local_dir': 'b2g-inbound', > + 'git_ref_cache_dir': '/builds/b2g_bumper', is there a reason this is a path to the directory rather than just to the file? it looks like you're just using a single cache file in the code, so better to be explicit here about which file to use? ::: scripts/b2g_bumper.py @@ +213,3 @@ > # are consistent between devices, as long as they use the same > + # remote/refname. > + if "%s:%s" % (remote_url, revision) in self._git_ref_cache: this combination of remote_url, revision is used quite a bit in this function. can you make a variable for this and re-use it below? e.g. cache_key = "%s:%s" % (remote_url, revision). you can also use that as the key into lookup_threads_by_parameters. @@ +220,5 @@ > p.setAttribute('revision', abs_revision) > continue > > + # maybe a thread is already in process, even if result is not yet in cache... > + if lookup_threads_by_parameters.get((remote_url, revision,)): could also be written as if cache_key in lookup_threads_by_parameters: ... @@ +221,5 @@ > continue > > + # maybe a thread is already in process, even if result is not yet in cache... > + if lookup_threads_by_parameters.get((remote_url, revision,)): > + self.info("Reusing results of existing running thread to lookup %s:%s") you're missing the values here to substitute in for %s:%s @@ +232,3 @@ > > # TODO: alert/notify on missing repositories > # TODO: Add external caching you've done this! @@ +604,5 @@ > + useful for sharing the cache across multiple branches (by exporting it in a previous > + b2g bumper run for a different branch). > + """ > + dirs = self.query_abs_dirs() > + git_ref_cache = os.path.join(dirs['git_ref_cache_dir'], 'git_ref_cache.json') I think it's better to have the filename be specified in the config, rather than just the cache dir. @@ +626,5 @@ > + def clear_git_ref_cache(self): > + """ Used to clear the temporary git ref cache, which should be done between > + runs of the same branch. Different branches may use the same cache without > + clearing it. > + """ It's not so much that you should run this between runs of the same branch, or not between runs of different branches. Rather, you want to clear the cache before you process a bunch of branches and/or devices so you're not using stale data. @@ +629,5 @@ > + clearing it. > + """ > + dirs = self.query_abs_dirs() > + git_ref_cache = os.path.join(dirs['git_ref_cache_dir'], 'git_ref_cache.json') > + self.rmtree(git_ref_cache) rmtree scares me - can we use a method to delete just a single file instead of something that can blow away directories?
Attachment #8522393 - Flags: review?(catlee) → review-
Comment on attachment 8522393 [details] [diff] [review] bug1097784_mozharness_v2.patch Review of attachment 8522393 [details] [diff] [review]: ----------------------------------------------------------------- ::: configs/b2g_bumper/master.py @@ +9,4 @@ > 'gecko_pull_url': 'https://hg.mozilla.org/integration/b2g-inbound', > 'gecko_push_url': 'ssh://hg.mozilla.org/integration/b2g-inbound', > 'gecko_local_dir': 'b2g-inbound', > + 'git_ref_cache_dir': '/builds/b2g_bumper', Agreed. Funnily enough, I started with a file, and then switched to a dir because the existing config query method was called self.query_abs_dirs() and is built into the parent classes too, and I didn't fancy reworking that - but in hindsight I could probably just set a member variable at instantiation time, and skip the method call to look it up, completely. ::: scripts/b2g_bumper.py @@ +213,3 @@ > # are consistent between devices, as long as they use the same > + # remote/refname. > + if "%s:%s" % (remote_url, revision) in self._git_ref_cache: Yes, good idea. Thanks! @@ +220,5 @@ > p.setAttribute('revision', abs_revision) > continue > > + # maybe a thread is already in process, even if result is not yet in cache... > + if lookup_threads_by_parameters.get((remote_url, revision,)): Will do. Makes sense to use the same key, quite right. Just to explain the reason I switched from a tuple to a string key: json does not allow a tuple as a key, so I made it a string so that we could export the cache as a json file. @@ +221,5 @@ > continue > > + # maybe a thread is already in process, even if result is not yet in cache... > + if lookup_threads_by_parameters.get((remote_url, revision,)): > + self.info("Reusing results of existing running thread to lookup %s:%s") Whooooooops @@ +232,3 @@ > > # TODO: alert/notify on missing repositories > # TODO: Add external caching Oh yes! Good spot! @@ +604,5 @@ > + useful for sharing the cache across multiple branches (by exporting it in a previous > + b2g bumper run for a different branch). > + """ > + dirs = self.query_abs_dirs() > + git_ref_cache = os.path.join(dirs['git_ref_cache_dir'], 'git_ref_cache.json') Agreed. Will do. @@ +626,5 @@ > + def clear_git_ref_cache(self): > + """ Used to clear the temporary git ref cache, which should be done between > + runs of the same branch. Different branches may use the same cache without > + clearing it. > + """ Totally agree. Will fix doc. @@ +629,5 @@ > + clearing it. > + """ > + dirs = self.query_abs_dirs() > + git_ref_cache = os.path.join(dirs['git_ref_cache_dir'], 'git_ref_cache.json') > + self.rmtree(git_ref_cache) Very good point. Will do. I think originally I looked for a mozharness method for a single file but couldn't find one, so maybe I'll add a native file deletion call, or add an rm method in mozharness that is like rmtree but doesn't delete directories.
OK, my experiment failed above! I wrote my responses inline in the review, and hit publish, but the generated comments are shown against the code, rather than against the comment i was responding to. Never mind. Hopefully it is clear anyway, and if not, the comments can be viewed together here: https://bugzilla.mozilla.org/page.cgi?id=splinter.html&bug=1097784&attachment=8522393
Updated, based on review feedback.
Attachment #8522393 - Attachment is obsolete: true
Attachment #8522933 - Flags: review?(catlee)
Comment on attachment 8522933 [details] [diff] [review] bug1097784_mozharness_v3.patch Review of attachment 8522933 [details] [diff] [review]: ----------------------------------------------------------------- r+ with one important correction (I think) ::: scripts/b2g_bumper.py @@ +238,2 @@ > (remote_url, revision)) > + lookup_threads_by_parameters[(remote_url, revision,)] = async_result this should be lookup_threads_by_parameters[cache_key] now
Attachment #8522933 - Flags: review?(catlee) → review+
(In reply to Chris AtLee [:catlee] from comment #8) > Comment on attachment 8522933 [details] [diff] [review] > bug1097784_mozharness_v3.patch > > Review of attachment 8522933 [details] [diff] [review]: > ----------------------------------------------------------------- > > r+ with one important correction (I think) > > ::: scripts/b2g_bumper.py > @@ +238,2 @@ > > (remote_url, revision)) > > + lookup_threads_by_parameters[(remote_url, revision,)] = async_result > > this should be lookup_threads_by_parameters[cache_key] now I'm both highly impressed and very grateful for this spot! Indeed it should be cache_key! Will fix and land. Thanks Catlee!
Just landing the mozharness part first, and watching logs. If all goes ok, will also land puppet change. The mozharness part should not show marked improvements, since it adds the cache code but does not use it. The puppet change should have the more significant impact, but I prefer to land separately so I can check they are ok one at a time.
OK it seems fine, I'm going to land the scary untested puppet change now, and watch logs even more closely :p ....
So this is all working now. Before I close - just one minor point. Now that we only have one process running at a time instead of 5, we could potentially increase the number of threads we are running to run the git ls-remote commands (e.g. multiply it by a factor of 5 - currently I believe we run 20 threads - we could make this e.g. 100). Catlee, what do you think about this? End-to-end it takes about 3 mins to run: 2014-11-17 10:55:02 pid-22539 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v1.4.py, log in /builds/b2g_bumper/v1.4.log 2014-11-17 10:55:25 pid-22539 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.0.py, log in /builds/b2g_bumper/v2.0.log 2014-11-17 10:56:00 pid-22539 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v1.3t.py, log in /builds/b2g_bumper/v1.3t.log 2014-11-17 10:56:12 pid-22539 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.1.py, log in /builds/b2g_bumper/v2.1.log 2014-11-17 10:56:33 pid-22539 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/master.py, log in /builds/b2g_bumper/master.log 2014-11-17 10:58:10 pid-22539 All branches succeeded, touching /builds/b2g_bumper/b2g_bumper.stamp so nagios knows we're running ok It might be nicer to leave it as it is, since it runs every 5 mins, so if it takes 3 minutes that is not too bad. It also might mean that git.m.o is not hammered so heavily as if we had 100 threads hit it at a time.
How fast does it run with 100 threads?
Surprisingly, no real change! E.g. below is with 100 threads, and it took 3m18s. In comment 15 with 20 threads it took 3m8s, so in this case was 10s slower. Each iteration varies, it seems not to be highly consistent with timings, but I've run it for a few iteration and don't see a lot of change. I double checked that my test change was not getting overwritten, e.g. by the mozharness update or by puppet, but I had disabled both, and it wasn't getting updated - so the reason for no change is not that my change got reverted. Maybe git.m.o serialises the requests it gets in, and is running at max power, so an increase in threads does not help. In any case, setting back to 20, and closing this bug. Pete 2014-11-18 02:20:03 pid-14787 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v1.4.py, log in /builds/b2g_bumper/v1.4.log 2014-11-18 02:20:28 pid-14787 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.0.py, log in /builds/b2g_bumper/v2.0.log 2014-11-18 02:21:04 pid-14787 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v1.3t.py, log in /builds/b2g_bumper/v1.3t.log 2014-11-18 02:21:18 pid-14787 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/v2.1.py, log in /builds/b2g_bumper/v2.1.log 2014-11-18 02:21:41 pid-14787 Running b2g bumper using /builds/b2g_bumper/mozharness/configs/b2g_bumper/master.py, log in /builds/b2g_bumper/master.log 2014-11-18 02:23:21 pid-14787 All branches succeeded, touching /builds/b2g_bumper/b2g_bumper.stamp so nagios knows we're running ok
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Ah, one good thing. When I ran with 100 threads, the nagios alert came back: nagios-releng 11:30:47 Tue 02:30:55 PST [4396] buildbot-master66.srv.releng.usw2.mozilla.com:load is WARNING: WARNING - load average: 12.94, 12.45, 10.51 (http://m.mozilla.org/load) This is like it was before when we had 5 processes each with 20 threads running. Even if we did not gain a lot of end-to-end wall time improvement with this change, we reduced the load by a factor of five, which stopped the nagios alerts. So we've lowered load on git.m.o even if we have not gained a lot of wall time improvements.
This is now needed since bug 1091066 landed.
Attachment #8524579 - Flags: review?(catlee)
Reopening due to b2g-manifest fix required.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8524579 - Flags: review?(catlee) → review+
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Component: Tools → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: