Closed Bug 862910 Opened 12 years ago Closed 10 years ago

cache MAR + installer downloads in update verify

Categories

(Release Engineering :: Release Automation: Other, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: pmoore)

References

Details

Attachments

(3 files, 2 obsolete files)

We download the same MAR many times during update verify. For example, I count 18 times that we downloaded a single locale's complete MAR in https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/21.0b3-candidates/build1/logs/release-mozilla-beta-linux_update_verify_1-bm58-build1-build0.txt.gz:
grep http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/21.0b3-candidates/build1/update/linux-i686/ta-LK/firefox-21.0b3.complete.mar release-mozilla-beta-linux_update_verify_1-bm58-build1-build0.txt | grep retry | wc -l

We should cache this, like we started doing for final verify in bug 628796, and we can greatly speed up update verify. We could improve cache hits further by adjusting the chunking algorithm to favour keeping locales across different versions in the same chunk.

This could be a huge speed up in testing time and also reduce the load on ftp by a lot.
ftp.m.o will love you for this.
Caching builds (tar.bz2, dmg, exe) would be relatively easy. To do that we need to:
* Make download_builds.sh save to a longer path - right now it only saves to files like "firefox-21.0b2.tar.bz2": https://github.com/mozilla/build-tools/blob/master/release/common/download_builds.sh#L27
* Have it check for an existing file and either skip the download or verify that it's the right size with a HEAD request if it exists.
* Adjust other logic (eg https://github.com/mozilla/build-tools/blob/master/release/updates/verify.sh#L143) to deal with the longer names.

Not sure about the MAR part yet...
Summary: cache MAR downloads in update verify → cache MAR + installer downloads in update verify
MAR downloads might actually be easier as it turns out. All of the MAR downloading and staging happens in https://github.com/mozilla/build-tools/blob/master/release/common/download_mars.sh. So, if those were made to download to a filename or path that included platform, locale, version, and patch type we could do similar checking for existence (and maybe verifying the size) as suggested for installers in the previous comment. If we have the file already, we could skip the actual download and adjust the cp that stages it at the end: https://github.com/mozilla/build-tools/blob/master/release/common/download_mars.sh#L79
Pete did great work with the final verify work, and since no good deed goes unpunished... ;)

No guarantee he'll be able to tackle this immediately though. He's gather a lot of state for the mobile cleanup (Q2 goal) right now.
Assignee: nobody → pmoore
Pete, not that I don't like what you did with final verify (I really really do), but can you try to keep to the original scope here? We'll probably rewrite update verify in Python if we want on-machine ||ization, so I don't want to sink more into the existing scripts than we have to.

If you have any questions about the current state of things that aren't answered in comment #2  or 3, just let me know!
Product: mozilla.org → Release Engineering
Possibly a dupe of bug 710461.
I've started working on a patch for this.

See https://github.com/petemoore/build-tools/tree/bug862910

Initial draft - not yet tested.
Status: NEW → ASSIGNED
Pete, where does this sit on your list of priorities ? I ask because update verify for each beta is driving a lot of traffic to our ftp cluster, and nagios alerts go off for the oncall guys as machines get overloaded. It's possible something is wrong their end (AFAIK we haven't changed anything on ours) but this bug is part of being a lot nicer to cluster.
Pete, please see my query in comment #9.
Flags: needinfo?(pmoore)
Hey Nick,

Sorry, I'm not sure how i missed the needinfo here - apologies for delay. I spoke to :coop about this and agreed that I'd give it my full attention next week - so it will be moving to the top of my list on Monday (buildduty today).

Pete
Flags: needinfo?(pmoore)
So since writing the above code (https://github.com/petemoore/build-tools/tree/bug862910) I have dived quite deep into the mechanics of this whole process.

To summarize my thinking on this a little bit, there are two high-level approaches we can try to gain performance improvements, without impacting test coverage:

1) caching downloads
2) parallelizing downloads

In addition, we *could* consider taking an md5/sha of the updater binary in each "complete" upgrade path (as opposed to the "partial" upgrade paths) - and if the md5 matches a successfully tested upgrade, we *could* choose to skip it. This is making an assumption that if the complete upgrade works for one release, it should work for another, assuming the updater used is the same updater (which is taken from the old version you are upgrading away from). Potentially this could make a massive time saving, if we end up only testing one complete upgrade path, and then just checking that the updater contained in all historic versions that we are testing upgrading from, is the same. Furthermore, if the md5/sha of each release is recorded somewhere, and we only download that md5/sha value, as opposed to downloading the full historic release, we could have massive savings in download time too.

OK, going back to the caching/parallelising topic:

1) Caching
It is possible to do this inline, in the existing code, by checking if a download has already been performed, and reusing the binary if it exists. Alternatively it is possible to avoid writing such custom code, by installing a http caching proxy on the slave, and route all web traffic (curl, wget) through that proxy (e.g. by exporting proxy environment variables) so that the cache is transparently used by the update verify tool (and any other tools we set to use it).

Both options have merit (inline code means less components to maintain - no caching proxy component to install, but more code to maintain). Both mechanisms need to be able to handle cleaning/emptying of the cache, potentially prioritizing some cached files over others, and making sure that the disk does not fill up.

My instinct is to go with the first option - to modify the bash scripts to natively cache, as this is least impactful when testing/debugging/rolling out. Probably the simplest solution to clearing the cache is to completely wipe the existing cache when the job starts, or when the job finishes (or both).

2) Parallelizing downloads
I was hoping to reuse the code in final_verification.sh to parallelize the downloads of the .mar files and the native binary installers (.dmg files, etc). However, after testing in msys, I discovered the final_verification code will not run under msys. There are two missing libraries: mktemp and curl. Both can no doubt be installed, e.g. mktemp can be downloaded from http://downloads.sourceforge.net/project/mingw/OldFiles/msys-mktemp/mktemp-1.5-MSYS.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fmingw%2Ffiles%2FOldFiles%2Fmsys-mktemp%2F&ts=1398342670&use_mirror=heanet - however that then requires a change to all windows builders, or we would need to check in mktemp into the tools repository, or update GPO, ... - I think this comes at too much cost for the benefit it offers.

If we do not install the missing dependencies, I see we have two obvious solutions to overcome this current inability to download in parallel using the existing final_verification.sh code:

1) Download in parallel using python
2) Take final_verification.sh code, but switch out curl to use wget on windows, and avoid mktemp altogether, if possible, or at least on windows.

I think option 2) above is the quickest / easiest way to get this bug implemented.

Summary
=======
Plan is to:

  *) download all .mar files and native installers (.tar.bz2, .dmg, ...) upfront into a local cache, before the update validation step even starts
  *) delete full cache when job completes
  *) delete full cache when job starts, in case job did not complete properly for some reason, and did not delete at end of previous job
  *) parallelize downloads, like it is done in final_verification, *but* do not use mktemp or curl - instead use wget, and another mechanism to generate temporary files if needed (not mktemp)
  *) discuss with team the possibility just to validate the md5/sha of the "upgrader" inside an old release, and assume if it is a known "good one" to assume a complete upgrade from this version is ok, rather than explicitly test it
  *) don't merge update verify and final verification code together for now, due to incompatibility (mktemp/curl currently not supported on our msys windows environments) - instead take the "bones" of the parallelized downloads part of final_verification, adjust to work under msys/darwin/linux, and use that
  *) test thoroughly that e.g. xargs is working correctly on msys with multiple concurrent bash invocations (xargs -P option)
Just read comment 5 - that makes it simple. :)

Will only do caching downloads, not parallelizing for now.
Attached patch Patch for tools repo (obsolete) — Splinter Review
Hi Nick,

This is tested locally on Darwin, and I even gave it a spin in msys - that said, I tested it only by creating a small .cfg file, and running verify.sh -c <config file> directly (as opposed to running a full staging release, or a full update verify, for example).

If you think I need to do more testing first, let me know. Is there an easy way to trigger a staging update verify? :/

As per Ben's request - this patch only deals with (per-job) caching, and not parallisation of downloads. Still many more optimisations that can be done, if we want to get the time down further. This is just a "quick win".

Thanks,
Pete
Attachment #8412107 - Flags: review?(nthomas)
Attachment #8412107 - Flags: feedback?(bhearsum)
Attached patch Patch for tools repo (obsolete) — Splinter Review
Nick is on PTO - so giving to Ben, and also I spotted a minor bug:

exit "${exit_code}"

should have been:

return "${exit_code}"

since the return code of a function is handled by "return"; "exit" would exit the entire bash script in the case of failure, rather than give a non-zero return code to the function call. Didn't pick this up in testing as I didn't try testing a bad url (whoops - should have done a negative test too...).
Attachment #8412107 - Attachment is obsolete: true
Attachment #8412107 - Flags: review?(nthomas)
Attachment #8412107 - Flags: feedback?(bhearsum)
Attachment #8412119 - Flags: review?(bhearsum)
Attachment #8412119 - Flags: feedback?(nthomas)
Previous patch was corrupt! Fixed.

BTW - how/where do we define what available disk space should be available for a job? This job will obviously need significantly more disk space than it used to, since it will cache all the downloads.

If this is unworkable, there is an option that we optimize the code so that rather than caching everything, we shuffle the order that the tests run in, so that we can then download a "set" at a time, process those, delete the cache, and then process the next "set" etc - this would use much less disk space, but might involve quite a rewrite.

It would go something along the lines of using the current code to generate a list of tests (rather than calling the tests directly in the loop) and at the end, we end up with a full list of tests to be performed. Then we would sort this list in such a way as to place all tests together that use the same binaries, and then run each "set" in turn.

Thumb in the air says if each chunk has 20 release versions, and we need complete install binary (40MB) plus complete mars (40MB) for around 100 locales, we are looking at 80GB, plus partials (say 5 releases x 20MB x 100 locales = 10GB) - I guess we are looking at around 100GB in total(?) - just a very rough "thumb in the air".

I see for example that w64-ix-slave153 currently has 146GB available. Haven't checked a mac or linux slave. If we are running on spot instances, I think this could be a blocker.
Attachment #8412119 - Attachment is obsolete: true
Attachment #8412119 - Flags: review?(bhearsum)
Attachment #8412119 - Flags: feedback?(nthomas)
Attachment #8412560 - Flags: review?(bhearsum)
Attachment #8412560 - Flags: feedback?(nthomas)
After writing my last comment, I realised it would make sense to get real numbers. It turns out the disk space usage is not so much, and the savings are quite huge.

I wrote this small script to calculate what saving will be based on an existing update verify (attachment). The results were:

pmoore@fred:~/tmp $ ./process_log.sh
Total download size when *not* using cache: 76.49 Gb
Total download size when using cache: 5.29 Gb
Total saving: 93.09%

The script basically parses this example real-world log file:
http://buildbot-master84.srv.releng.scl3.mozilla.com:8001/builders/release-mozilla-beta-win32_update_verify_3%2F6/builds/12/steps/run_script/logs/stdio/text

And from this it calculates the sum of the Content-Length's added together from each url downloaded versus the sum of the Content-Length of each unique url (to simulate what would happen with the cache enabled).

A 93% saving in downloads should hopefully help with performance - although of course a big part of the time is taken up performing the update, not just downloading binaries.
Comment on attachment 8412560 [details] [diff] [review]
Patch for tools repo

Review of attachment 8412560 [details] [diff] [review]:
-----------------------------------------------------------------

::: release/common/cached_download.sh
@@ +24,5 @@
> +
> +    if fgrep -x "${url}" "${cache_dir}/urls.list" >/dev/null; then
> +        echo "Retrieving '${url}' from cache..."
> +        local line_number="$(fgrep -nx  "${url}" "${cache_dir}/urls.list" | sed 's/:.*//')"
> +        cp "${cache_dir}/obj_$(printf "%05d\n" "${line_number}").cache" "${output_file}"

I would suggest using an associative array here instead of the flat file, but we'd have to escape urls like
  http://download.mozilla.org/?product=firefox-29.0b8-complete&os=win&lang=en-US
or bash will laugh. Did you consider hashing the url to generate the filename in cache ? It'd help readability I think.

Otherwise this looks pretty good.
Attachment #8412560 - Flags: feedback?(nthomas) → feedback+
re some of the earlier comments - techniques which concentrate the traffic to the start of the job would be dangerous to ftp.m.o, because there are 4 platforms * 6 jobs/platform which get triggered simultaneously. Granted it wouldn't be that much worse than now, but we want to give ftp.m.o a break because it's pretty busy serving files for tests.
Comment on attachment 8412560 [details] [diff] [review]
Patch for tools repo

Review of attachment 8412560 [details] [diff] [review]:
-----------------------------------------------------------------

I'm giving this an f+ for now. I don't understand bash well enough to review it without seeing it running. You also need to add the purge_builds call as mentioned below.

(In reply to Pete Moore [:pete][:pmoore] from comment #12)
>   *) download all .mar files and native installers (.tar.bz2, .dmg, ...)
> upfront into a local cache, before the update validation step even starts

+1 to what Nick says about not doing this all up-front. I lik eyour idea for doing "sets" of things below.

> If you think I need to do more testing first, let me know. Is there an easy
> way to trigger a staging update verify? :/

Yeah, this is pretty easy. Ping me on IRC and I'll give you a hand.

(In reply to Pete Moore [:pete][:pmoore] from comment #16)
> BTW - how/where do we define what available disk space should be available
> for a job? This job will obviously need significantly more disk space than
> it used to, since it will cache all the downloads.

We don't run purge_builds for update verify at all right now. That should be easy to add to https://github.com/mozilla/build-tools/blob/master/scripts/release/updates/chunked-verify.sh. https://github.com/mozilla/build-tools/blob/master/scripts/release/generate-sums.sh#L42 is a similar script that *does* call it.

> If this is unworkable, there is an option that we optimize the code so that
> rather than caching everything, we shuffle the order that the tests run in,
> so that we can then download a "set" at a time, process those, delete the
> cache, and then process the next "set" etc - this would use much less disk
> space, but might involve quite a rewrite.

This sounds like a great idea regardless of anything else.

> I see for example that w64-ix-slave153 currently has 146GB available.
> Haven't checked a mac or linux slave. If we are running on spot instances, I
> think this could be a blocker.

Linux and Mac will be much more pressed for space. You'll definitely need the purge_builds.py change regardless of how you implement this.
Attachment #8412560 - Flags: feedback+
Attachment #8412560 - Flags: review?(bhearsum)
I'm going to be on PTO, and have some other B2G work on my plate which is pretty pressing - is this also something we can give to an intern? The code is written, we just need some staging testing, and to add the purge settings to make sure the slaves have enough space. Please also note we will need to test on *all* platforms before going live.

Otherwise I can look at this at some point in July, hopefully, when I'm back from PTO and the vcs sync stuff is cut over.
So I'm on build duty this week, and will use this opportunity to have a look at this again. I'm going to set up a staging environment - grabbing all the slave classes I need for update_verify.

That said, it is Thursday evening for me here now, so by the end of tomorrow hopefully I have a staging environment, but I don't expect to have everything wrapped up by then, but I can hopefully at least test the bash code on all the OS flavours it needs to run on, to check it actually works.
So basically build duty was not quite as quiet as I had hoped it would be today. I established which types of machines I need to get as loaners, but the rest of the day was busy.

It seems we run update verify for linux32 on linux64 machines. This was a surprise, but I guess it works, it is ok. So I'll be getting three loaners to cover the testing:

bld-linux64-hp-xxxx (not using ix, as we are under capacity on ix at the moment)
bld-lion-r5-xxx
w64-ix-slavexxx
I ran this is staging, redoing 32.0 update verify. There is a set of green results at 
  http://dev-master1.build.mozilla.org:8710/one_line_per_build
so no obvious platform weirdness to fix up (only ran one linux32, but I don't think it matters).
That is awesome, thanks so much for this Nick!

The only missing part then is the purge builds settings, and then I believe this is good-to-go!

Pete
It would be good to get this landed soon. I ran a tweaked version of Pete's disk space estimation script across the most recent release, 24.0 esr, and beta. The highest usage it spat out was:
/pub/mozilla.org/firefox/candidates/32.0.2-candidates/build1/logs/release-mozilla-release-macosx64_update_verify_4-bm84-build1-build7.txt.gz
Total download size when *not* using cache: 50.05 Gb
Total download size when using cache: 14.07 Gb
Total saving: 71.90%

So I think it's safe to use 16GB here.

I'd like to land this tonight and re-run a 33.0b5 update verify job to be 100% sure it works before the next beta.
Attachment #8492385 - Flags: review?
Comment on attachment 8492385 [details] [diff] [review]
add purge builds to pete's patch

don't forget to hg add cached_download.sh from the previous patch
Attachment #8492385 - Flags: review? → review+
Comment on attachment 8492385 [details] [diff] [review]
add purge builds to pete's patch

https://hg.mozilla.org/build/tools/rev/9bf6fafd2dbf
Attachment #8492385 - Flags: checked-in+
Needed a minor bustage fix but it worked fine after that. Saved over an hour on the rerun of the slowest platform (mac). Leaving bug open because I think there's at least one follow-up to fix:
19:44 < catlee-away> [17:51:44] bhearsum|afk: I think you want rel-*:45d or something
Thanks guys for landing this on my day off! Awesome work. =)
Some numbers:
* Mac Beta, chunk 1/6. Before: 2h, After: 1h43min - savings of 17min
* Windows Beta, chunk 1/6. Before: 1h56min, After: 1h18min - savings of 38min
* Linux Beta, chunk 1/6. Before: 1h6min, After: 1h1min - savings of 5min

This is not bad, but I had it in my head that it would be a bit better this. I suppose this goes to show that we spend a lot of the time actually unpacking files and applying updates vs. spending it downloading.

It might be good to see how well we perform the next time we do a batch of releases together - I'm hoping that we'll reduce the load on FTP significantly.
In the case of Linux, it depends if you got an AWS slave or not, and if the tunnel is working nicely at the moment (seems OK now). We were expecting the biggest win there, so worth double checking if you got an scl3 machine for your example.
So, we can consider parallelising downloads for further performance improvements (which may or may not be marginal - depends on how much of the total time is taken up with downloading).

Or we could close this bug, and leave it as we have it. =)

Do we have an idea what proportion of the total time is taken up with downloading, and whether it is worth optimising?

Pete
Flags: needinfo?(nthomas)
Flags: needinfo?(bhearsum)
I only have anecdotal evidence from watching some manual tests go by, fwiw on in-house machines. Downloading wasn't a problem, the caching is a really really nice win. We now spend most time unpacking the installers (particularly on mac) and applying the update. We could look at parallelizing on each machine, or caching unpacked files (minding the disk space hit). Lets call this done, thanks Pete.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Flags: needinfo?(nthomas)
Flags: needinfo?(bhearsum)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: