862910 - cache MAR + installer downloads in update verify

Reporter

Description

•

12 years ago

We download the same MAR many times during update verify. For example, I count 18 times that we downloaded a single locale's complete MAR in https://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/21.0b3-candidates/build1/logs/release-mozilla-beta-linux_update_verify_1-bm58-build1-build0.txt.gz: grep http://stage.mozilla.org/pub/mozilla.org/firefox/nightly/21.0b3-candidates/build1/update/linux-i686/ta-LK/firefox-21.0b3.complete.mar release-mozilla-beta-linux_update_verify_1-bm58-build1-build0.txt | grep retry | wc -l We should cache this, like we started doing for final verify in bug 628796, and we can greatly speed up update verify. We could improve cache hits further by adjusting the chunking algorithm to favour keeping locales across different versions in the same chunk. This could be a huge speed up in testing time and also reduce the load on ftp by a lot.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

12 years ago

ftp.m.o will love you for this.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 2

•

12 years ago

Caching builds (tar.bz2, dmg, exe) would be relatively easy. To do that we need to: * Make download_builds.sh save to a longer path - right now it only saves to files like "firefox-21.0b2.tar.bz2": https://github.com/mozilla/build-tools/blob/master/release/common/download_builds.sh#L27 * Have it check for an existing file and either skip the download or verify that it's the right size with a HEAD request if it exists. * Adjust other logic (eg https://github.com/mozilla/build-tools/blob/master/release/updates/verify.sh#L143) to deal with the longer names. Not sure about the MAR part yet...

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

12 years ago

Summary: cache MAR downloads in update verify → cache MAR + installer downloads in update verify

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 3

•

12 years ago

MAR downloads might actually be easier as it turns out. All of the MAR downloading and staging happens in https://github.com/mozilla/build-tools/blob/master/release/common/download_mars.sh. So, if those were made to download to a filename or path that included platform, locale, version, and patch type we could do similar checking for existence (and maybe verifying the size) as suggested for installers in the previous comment. If we have the file already, we could skip the actual download and adjust the cp that stages it at the end: https://github.com/mozilla/build-tools/blob/master/release/common/download_mars.sh#L79

Chris Cooper [:coop] (he/him)

Comment 4

•

12 years ago

Pete did great work with the final verify work, and since no good deed goes unpunished... ;) No guarantee he'll be able to tackle this immediately though. He's gather a lot of state for the mobile cleanup (Q2 goal) right now.

Assignee: nobody → pmoore

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 5

•

12 years ago

Pete, not that I don't like what you did with final verify (I really really do), but can you try to keep to the original scope here? We'll probably rewrite update verify in Python if we want on-machine ||ization, so I don't want to sink more into the existing scripts than we have to. If you have any questions about the current state of things that aren't answered in comment #2 or 3, just let me know!

Nobody; OK to take it and work on it

Updated

•

12 years ago

Product: mozilla.org → Release Engineering

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 6

•

11 years ago

Possibly a dupe of bug 710461.

Pete Moore [:pmoore][:pete]

Assignee

Comment 7

•

11 years ago

I've started working on a patch for this. See https://github.com/petemoore/build-tools/tree/bug862910 Initial draft - not yet tested.

Status: NEW → ASSIGNED

Nick Thomas [:nthomas] (UTC+12)

Comment 9

•

11 years ago

Pete, where does this sit on your list of priorities ? I ask because update verify for each beta is driving a lot of traffic to our ftp cluster, and nagios alerts go off for the oncall guys as machines get overloaded. It's possible something is wrong their end (AFAIK we haven't changed anything on ours) but this bug is part of being a lot nicer to cluster.

Nick Thomas [:nthomas] (UTC+12)

Comment 10

•

11 years ago

Pete, please see my query in comment #9.

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 11

•

11 years ago

Hey Nick, Sorry, I'm not sure how i missed the needinfo here - apologies for delay. I spoke to :coop about this and agreed that I'd give it my full attention next week - so it will be moving to the top of my list on Monday (buildduty today). Pete

Flags: needinfo?(pmoore)

Pete Moore [:pmoore][:pete]

Assignee

Comment 12

•

11 years ago

So since writing the above code (https://github.com/petemoore/build-tools/tree/bug862910) I have dived quite deep into the mechanics of this whole process. To summarize my thinking on this a little bit, there are two high-level approaches we can try to gain performance improvements, without impacting test coverage: 1) caching downloads 2) parallelizing downloads In addition, we *could* consider taking an md5/sha of the updater binary in each "complete" upgrade path (as opposed to the "partial" upgrade paths) - and if the md5 matches a successfully tested upgrade, we *could* choose to skip it. This is making an assumption that if the complete upgrade works for one release, it should work for another, assuming the updater used is the same updater (which is taken from the old version you are upgrading away from). Potentially this could make a massive time saving, if we end up only testing one complete upgrade path, and then just checking that the updater contained in all historic versions that we are testing upgrading from, is the same. Furthermore, if the md5/sha of each release is recorded somewhere, and we only download that md5/sha value, as opposed to downloading the full historic release, we could have massive savings in download time too. OK, going back to the caching/parallelising topic: 1) Caching It is possible to do this inline, in the existing code, by checking if a download has already been performed, and reusing the binary if it exists. Alternatively it is possible to avoid writing such custom code, by installing a http caching proxy on the slave, and route all web traffic (curl, wget) through that proxy (e.g. by exporting proxy environment variables) so that the cache is transparently used by the update verify tool (and any other tools we set to use it). Both options have merit (inline code means less components to maintain - no caching proxy component to install, but more code to maintain). Both mechanisms need to be able to handle cleaning/emptying of the cache, potentially prioritizing some cached files over others, and making sure that the disk does not fill up. My instinct is to go with the first option - to modify the bash scripts to natively cache, as this is least impactful when testing/debugging/rolling out. Probably the simplest solution to clearing the cache is to completely wipe the existing cache when the job starts, or when the job finishes (or both). 2) Parallelizing downloads I was hoping to reuse the code in final_verification.sh to parallelize the downloads of the .mar files and the native binary installers (.dmg files, etc). However, after testing in msys, I discovered the final_verification code will not run under msys. There are two missing libraries: mktemp and curl. Both can no doubt be installed, e.g. mktemp can be downloaded from http://downloads.sourceforge.net/project/mingw/OldFiles/msys-mktemp/mktemp-1.5-MSYS.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fmingw%2Ffiles%2FOldFiles%2Fmsys-mktemp%2F&ts=1398342670&use_mirror=heanet - however that then requires a change to all windows builders, or we would need to check in mktemp into the tools repository, or update GPO, ... - I think this comes at too much cost for the benefit it offers. If we do not install the missing dependencies, I see we have two obvious solutions to overcome this current inability to download in parallel using the existing final_verification.sh code: 1) Download in parallel using python 2) Take final_verification.sh code, but switch out curl to use wget on windows, and avoid mktemp altogether, if possible, or at least on windows. I think option 2) above is the quickest / easiest way to get this bug implemented. Summary ======= Plan is to: *) download all .mar files and native installers (.tar.bz2, .dmg, ...) upfront into a local cache, before the update validation step even starts *) delete full cache when job completes *) delete full cache when job starts, in case job did not complete properly for some reason, and did not delete at end of previous job *) parallelize downloads, like it is done in final_verification, *but* do not use mktemp or curl - instead use wget, and another mechanism to generate temporary files if needed (not mktemp) *) discuss with team the possibility just to validate the md5/sha of the "upgrader" inside an old release, and assume if it is a known "good one" to assume a complete upgrade from this version is ok, rather than explicitly test it *) don't merge update verify and final verification code together for now, due to incompatibility (mktemp/curl currently not supported on our msys windows environments) - instead take the "bones" of the parallelized downloads part of final_verification, adjust to work under msys/darwin/linux, and use that *) test thoroughly that e.g. xargs is working correctly on msys with multiple concurrent bash invocations (xargs -P option)

Pete Moore [:pmoore][:pete]

Assignee

Comment 13

•

11 years ago

Just read comment 5 - that makes it simple. :) Will only do caching downloads, not parallelizing for now.

Pete Moore [:pmoore][:pete]

Assignee

Comment 14

•

11 years ago

Attached patch Patch for tools repo (obsolete) — Details — Splinter Review

Hi Nick, This is tested locally on Darwin, and I even gave it a spin in msys - that said, I tested it only by creating a small .cfg file, and running verify.sh -c <config file> directly (as opposed to running a full staging release, or a full update verify, for example). If you think I need to do more testing first, let me know. Is there an easy way to trigger a staging update verify? :/ As per Ben's request - this patch only deals with (per-job) caching, and not parallisation of downloads. Still many more optimisations that can be done, if we want to get the time down further. This is just a "quick win". Thanks, Pete

Attachment #8412107 - Flags: review?(nthomas)

Attachment #8412107 - Flags: feedback?(bhearsum)

Pete Moore [:pmoore][:pete]

Assignee

Comment 15

•

11 years ago

Attached patch Patch for tools repo (obsolete) — Details — Splinter Review

Nick is on PTO - so giving to Ben, and also I spotted a minor bug: exit "${exit_code}" should have been: return "${exit_code}" since the return code of a function is handled by "return"; "exit" would exit the entire bash script in the case of failure, rather than give a non-zero return code to the function call. Didn't pick this up in testing as I didn't try testing a bad url (whoops - should have done a negative test too...).

Attachment #8412107 - Attachment is obsolete: true

Attachment #8412107 - Flags: review?(nthomas)

Attachment #8412107 - Flags: feedback?(bhearsum)

Attachment #8412119 - Flags: review?(bhearsum)

Attachment #8412119 - Flags: feedback?(nthomas)

Pete Moore [:pmoore][:pete]

Assignee

Comment 16

•

11 years ago

Attached patch Patch for tools repo — Details — Splinter Review

Previous patch was corrupt! Fixed. BTW - how/where do we define what available disk space should be available for a job? This job will obviously need significantly more disk space than it used to, since it will cache all the downloads. If this is unworkable, there is an option that we optimize the code so that rather than caching everything, we shuffle the order that the tests run in, so that we can then download a "set" at a time, process those, delete the cache, and then process the next "set" etc - this would use much less disk space, but might involve quite a rewrite. It would go something along the lines of using the current code to generate a list of tests (rather than calling the tests directly in the loop) and at the end, we end up with a full list of tests to be performed. Then we would sort this list in such a way as to place all tests together that use the same binaries, and then run each "set" in turn. Thumb in the air says if each chunk has 20 release versions, and we need complete install binary (40MB) plus complete mars (40MB) for around 100 locales, we are looking at 80GB, plus partials (say 5 releases x 20MB x 100 locales = 10GB) - I guess we are looking at around 100GB in total(?) - just a very rough "thumb in the air". I see for example that w64-ix-slave153 currently has 146GB available. Haven't checked a mac or linux slave. If we are running on spot instances, I think this could be a blocker.

Attachment #8412119 - Attachment is obsolete: true

Attachment #8412119 - Flags: review?(bhearsum)

Attachment #8412119 - Flags: feedback?(nthomas)

Attachment #8412560 - Flags: review?(bhearsum)

Attachment #8412560 - Flags: feedback?(nthomas)

Pete Moore [:pmoore][:pete]

Assignee

Comment 17

•

11 years ago

Attached file Script to demonstrate download saving in real terms — Details

After writing my last comment, I realised it would make sense to get real numbers. It turns out the disk space usage is not so much, and the savings are quite huge. I wrote this small script to calculate what saving will be based on an existing update verify (attachment). The results were: pmoore@fred:~/tmp $ ./process_log.sh Total download size when *not* using cache: 76.49 Gb Total download size when using cache: 5.29 Gb Total saving: 93.09% The script basically parses this example real-world log file: http://buildbot-master84.srv.releng.scl3.mozilla.com:8001/builders/release-mozilla-beta-win32_update_verify_3%2F6/builds/12/steps/run_script/logs/stdio/text And from this it calculates the sum of the Content-Length's added together from each url downloaded versus the sum of the Content-Length of each unique url (to simulate what would happen with the cache enabled). A 93% saving in downloads should hopefully help with performance - although of course a big part of the time is taken up performing the update, not just downloading binaries.

Nick Thomas [:nthomas] (UTC+12)

Comment 18

•

11 years ago

Comment on attachment 8412560 [details] [diff] [review] Patch for tools repo Review of attachment 8412560 [details] [diff] [review]: ----------------------------------------------------------------- ::: release/common/cached_download.sh @@ +24,5 @@ > + > + if fgrep -x "${url}" "${cache_dir}/urls.list" >/dev/null; then > + echo "Retrieving '${url}' from cache..." > + local line_number="$(fgrep -nx "${url}" "${cache_dir}/urls.list" | sed 's/:.*//')" > + cp "${cache_dir}/obj_$(printf "%05d\n" "${line_number}").cache" "${output_file}" I would suggest using an associative array here instead of the flat file, but we'd have to escape urls like http://download.mozilla.org/?product=firefox-29.0b8-complete&os=win&lang=en-US or bash will laugh. Did you consider hashing the url to generate the filename in cache ? It'd help readability I think. Otherwise this looks pretty good.

Attachment #8412560 - Flags: feedback?(nthomas) → feedback+

Nick Thomas [:nthomas] (UTC+12)

Comment 19

•

11 years ago

re some of the earlier comments - techniques which concentrate the traffic to the start of the job would be dangerous to ftp.m.o, because there are 4 platforms * 6 jobs/platform which get triggered simultaneously. Granted it wouldn't be that much worse than now, but we want to give ftp.m.o a break because it's pretty busy serving files for tests.

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 20

•

11 years ago

Comment on attachment 8412560 [details] [diff] [review] Patch for tools repo Review of attachment 8412560 [details] [diff] [review]: ----------------------------------------------------------------- I'm giving this an f+ for now. I don't understand bash well enough to review it without seeing it running. You also need to add the purge_builds call as mentioned below. (In reply to Pete Moore [:pete][:pmoore] from comment #12) > *) download all .mar files and native installers (.tar.bz2, .dmg, ...) > upfront into a local cache, before the update validation step even starts +1 to what Nick says about not doing this all up-front. I lik eyour idea for doing "sets" of things below. > If you think I need to do more testing first, let me know. Is there an easy > way to trigger a staging update verify? :/ Yeah, this is pretty easy. Ping me on IRC and I'll give you a hand. (In reply to Pete Moore [:pete][:pmoore] from comment #16) > BTW - how/where do we define what available disk space should be available > for a job? This job will obviously need significantly more disk space than > it used to, since it will cache all the downloads. We don't run purge_builds for update verify at all right now. That should be easy to add to https://github.com/mozilla/build-tools/blob/master/scripts/release/updates/chunked-verify.sh. https://github.com/mozilla/build-tools/blob/master/scripts/release/generate-sums.sh#L42 is a similar script that *does* call it. > If this is unworkable, there is an option that we optimize the code so that > rather than caching everything, we shuffle the order that the tests run in, > so that we can then download a "set" at a time, process those, delete the > cache, and then process the next "set" etc - this would use much less disk > space, but might involve quite a rewrite. This sounds like a great idea regardless of anything else. > I see for example that w64-ix-slave153 currently has 146GB available. > Haven't checked a mac or linux slave. If we are running on spot instances, I > think this could be a blocker. Linux and Mac will be much more pressed for space. You'll definitely need the purge_builds.py change regardless of how you implement this.

Attachment #8412560 - Flags: feedback+

bhearsum@mozilla.com (:bhearsum)

Reporter

Updated

•

11 years ago

Attachment #8412560 - Flags: review?(bhearsum)

Pete Moore [:pmoore][:pete]

Assignee

Comment 21

•

11 years ago

I'm going to be on PTO, and have some other B2G work on my plate which is pretty pressing - is this also something we can give to an intern? The code is written, we just need some staging testing, and to add the purge settings to make sure the slaves have enough space. Please also note we will need to test on *all* platforms before going live. Otherwise I can look at this at some point in July, hopefully, when I'm back from PTO and the vcs sync stuff is cut over.

Pete Moore [:pmoore][:pete]

Assignee

Comment 22

•

11 years ago

So I'm on build duty this week, and will use this opportunity to have a look at this again. I'm going to set up a staging environment - grabbing all the slave classes I need for update_verify. That said, it is Thursday evening for me here now, so by the end of tomorrow hopefully I have a staging environment, but I don't expect to have everything wrapped up by then, but I can hopefully at least test the bash code on all the OS flavours it needs to run on, to check it actually works.

Pete Moore [:pmoore][:pete]

Assignee

Comment 23

•

11 years ago

So basically build duty was not quite as quiet as I had hoped it would be today. I established which types of machines I need to get as loaners, but the rest of the day was busy. It seems we run update verify for linux32 on linux64 machines. This was a surprise, but I guess it works, it is ok. So I'll be getting three loaners to cover the testing: bld-linux64-hp-xxxx (not using ix, as we are under capacity on ix at the moment) bld-lion-r5-xxx w64-ix-slavexxx

Nick Thomas [:nthomas] (UTC+12)

Comment 24

•

10 years ago

I ran this is staging, redoing 32.0 update verify. There is a set of green results at http://dev-master1.build.mozilla.org:8710/one_line_per_build so no obvious platform weirdness to fix up (only ran one linux32, but I don't think it matters).

Pete Moore [:pmoore][:pete]

Assignee

Comment 25

•

10 years ago

That is awesome, thanks so much for this Nick! The only missing part then is the purge builds settings, and then I believe this is good-to-go! Pete

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 26

•

10 years ago

Attached patch add purge builds to pete's patch — Details — Splinter Review

It would be good to get this landed soon. I ran a tweaked version of Pete's disk space estimation script across the most recent release, 24.0 esr, and beta. The highest usage it spat out was: /pub/mozilla.org/firefox/candidates/32.0.2-candidates/build1/logs/release-mozilla-release-macosx64_update_verify_4-bm84-build1-build7.txt.gz Total download size when *not* using cache: 50.05 Gb Total download size when using cache: 14.07 Gb Total saving: 71.90% So I think it's safe to use 16GB here. I'd like to land this tonight and re-run a 33.0b5 update verify job to be 100% sure it works before the next beta.

Attachment #8492385 - Flags: review?

Rail Aliiev [:rail]

Comment 27

•

10 years ago

Comment on attachment 8492385 [details] [diff] [review] add purge builds to pete's patch don't forget to hg add cached_download.sh from the previous patch

Attachment #8492385 - Flags: review? → review+

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 28

•

10 years ago

Comment on attachment 8492385 [details] [diff] [review] add purge builds to pete's patch https://hg.mozilla.org/build/tools/rev/9bf6fafd2dbf

Attachment #8492385 - Flags: checked-in+

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 29

•

10 years ago

Needed a minor bustage fix but it worked fine after that. Saved over an hour on the rerun of the slowest platform (mac). Leaving bug open because I think there's at least one follow-up to fix: 19:44 < catlee-away> [17:51:44] bhearsum|afk: I think you want rel-*:45d or something

Pete Moore [:pmoore][:pete]

Assignee

Comment 30

•

10 years ago

Thanks guys for landing this on my day off! Awesome work. =)

bhearsum@mozilla.com (:bhearsum)

Reporter

Comment 31

•

10 years ago

Some numbers: * Mac Beta, chunk 1/6. Before: 2h, After: 1h43min - savings of 17min * Windows Beta, chunk 1/6. Before: 1h56min, After: 1h18min - savings of 38min * Linux Beta, chunk 1/6. Before: 1h6min, After: 1h1min - savings of 5min This is not bad, but I had it in my head that it would be a bit better this. I suppose this goes to show that we spend a lot of the time actually unpacking files and applying updates vs. spending it downloading. It might be good to see how well we perform the next time we do a batch of releases together - I'm hoping that we'll reduce the load on FTP significantly.

Nick Thomas [:nthomas] (UTC+12)

Comment 32

•

10 years ago

In the case of Linux, it depends if you got an AWS slave or not, and if the tunnel is working nicely at the moment (seems OK now). We were expecting the biggest win there, so worth double checking if you got an scl3 machine for your example.

Pete Moore [:pmoore][:pete]

Assignee

Comment 33

•

10 years ago

So, we can consider parallelising downloads for further performance improvements (which may or may not be marginal - depends on how much of the total time is taken up with downloading). Or we could close this bug, and leave it as we have it. =) Do we have an idea what proportion of the total time is taken up with downloading, and whether it is worth optimising? Pete

Flags: needinfo?(nthomas)

Flags: needinfo?(bhearsum)

Nick Thomas [:nthomas] (UTC+12)

Comment 34

•

10 years ago

I only have anecdotal evidence from watching some manual tests go by, fwiw on in-house machines. Downloading wasn't a problem, the caching is a really really nice win. We now spend most time unpacking the installers (particularly on mac) and applying the update. We could look at parallelizing on each machine, or caching unpacked files (minding the disk space hit). Lets call this done, thanks Pete.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Flags: needinfo?(nthomas)

Flags: needinfo?(bhearsum)

Resolution: --- → FIXED

Patch for tools repo 11 years ago Pete Moore [:pmoore][:pete] 4.01 KB, patch		Details \| Diff \| Splinter Review
Patch for tools repo 11 years ago Pete Moore [:pmoore][:pete] 4.02 KB, patch		Details \| Diff \| Splinter Review
Patch for tools repo 11 years ago Pete Moore [:pmoore][:pete] 4.02 KB, patch	nthomas : feedback+ bhearsum : feedback+	Details \| Diff \| Splinter Review
Script to demonstrate download saving in real terms 11 years ago Pete Moore [:pmoore][:pete] 1.36 KB, text/plain		Details
add purge builds to pete's patch 10 years ago bhearsum@mozilla.com (:bhearsum) 2.75 KB, patch	rail : review+ bhearsum : checked-in+	Details \| Diff \| Splinter Review