Closed Bug 1059943 Opened 6 years ago Closed 5 years ago

Upload test package files to S3 keyed by filename+hash

Categories

(Testing :: General, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ted, Unassigned)

References

Details

(Whiteboard: [leave open])

Attachments

(2 files, 1 obsolete file)

Right now we zip up all the test files + harnesses into one big zip and upload that to FTP next to the build. This sucks because the zip file is huge and we spend a lot of time uploading and downloading those bits.

Most of the files in the test package don't actually change between builds--we have a lot of tests and most of them get added once and then hardly ever change.

catlee and I talked about using blobber to store test package files, but that makes blobber a SPOF for test runs which is not great. We think we can wrap the useful functionality of blobber into a client-side library and then just upload directly to s3 using the filename + hash as the key. This would mean that each build only has to upload files that have never previously been uploaded, which ought to be a small, manageable number.

We'd want to upload each file from the test package to S3, then generate a manifest of relative_file_path -> URL for each file so that clients can fetch all the files they need.
Preliminary investigation looks promising. I took the test packages from win{32,64}, linux{32,64} and mac from today's nightly builds and ran them through these scripts:
https://github.com/luser/test-package-analysis

The output was:
Total input files 284773
Total unique hashes 53868
Total original size 1841028511
Size saved by removing duplicates 1152708118
% of original after de-duplication: 37%

Which means that 73% of the bytes of our test zip contents (uncompressed) are duplicated across platforms.
Using two inbound Linux64 builds that came one after the other:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1409332139/firefox-34.0a1.en-US.linux-x86_64.tests.zip
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1409336582/firefox-34.0a1.en-US.linux-x86_64.tests.zip

Total input files 113924
Total unique hashes 52363
Total original size 811478830
Size saved by removing duplicates 648517161
% of original after de-duplication: 20%
I still want to do this, but realized we need some cheap way to test if an object should be uploaded or not.

If a test packages contains ~50k objects, we can't reasonably HEAD each object in S3 to determine if it should be uploaded. It's too expensive time, network and price-wise. Perhaps we can come up with cached manifests of objects that exist, or use some kind of hierarchical naming/hashing scheme to determine quickly which objects need uploading.
We could certainly use some sort of simple local cache, even just a JSON file or something would probably work. I'm not sure how we'd work the mechanics of it, maybe have a bootstrap json cache file that the machine can fetch before it starts uploading, and refresh the bootstrap file periodically by reading the uploaded JSON manifests?
Purely out of curiosity have the download/upload times been tested talking directly to/from s3? Upload download should be _very_ fast if correctly optimized (for ec2 talking to s3 anyway) from an earlier test I observed 1-5s uploads for the test.zips and 2-10s downloads.

I am hoping some combination of diffs + talking to s3 instead of ftp will yield some big wins...
Make S3 objects idempotent and have names derived from a hash of their content.

The filesystem is then your cache. Store fetched objects in

/s3cache/bucket1/A/B/XXXXXXXXX
/s3cache/bucket1/C/D/XXXXXXXXX

Assemble objects into a working directory using your manifest.

Congratulations, you just built the beginnings of a source control system :)
FWIW, s3 uploads have a high latency. They need to be parallelized if you want to upload more than one thing quickly.
Did you write code for this in sccache?

It's not too difficult to put a thread pool in front of boto. e.g. https://github.com/sstoiana/s3funnel/blob/master/s3funnel/__init__.py

I wonder how high you can make the concurrency before the GIL becomes a problem. I hate the GIL so much.
sccache uses separate processes for compilation already, so it's just benefiting from that.
I'm working on some code that uses the multiprocessing module to do the uploads in separate processes. It avoids the GIL issue. I'm also gzip compressing the content before uploading.
I did some simple timings working with a recent win32 test package with 73,148 files. All tasks were done with 32 concurrent workers. The object manifest is a simple compressed list of all object hashes contained in the bucket. Clients try and load the object manifest first, so they can avoid HEAD requests for every object they have locally. Objects are gzip compressed on upload and decompressed after being downloaded.

Initial upload, nothing is cached on S3, no object manifest: 5m17s
Upload, everything cached on S3, no object manifest: 3m53s
Without the object manifest, the client has to do a look up for each of the 73k files.

Upload, everything cached on S3, with complete object manifest: 0m05s

Download & populate working dir with nothing cached locally: 5m04s
Download & populate working dir with everything cached, but empty working dir: 0m30s
Download & populate working dir with everything cached and populated working dir: 0m04s

The worst case times here are an order of magnitude better on instances in EC2 vs. on my laptop; I was getting ~15 minutes for a full fresh upload from my system. The latency to S3 is obviously much better from EC2.

So this approach definitely looks promising. I'd like to try this out on some branch to see how the rate of files changing impacts performance.
One thing to watch out for is TCP connection non-reuse. It's very easy to establish a separate TCP connection for every HTTP request. You may want to verify your S3 activity all occurs on the same socket from the worker. I suspect it is since you are achieving 7 req/s/worker in your 5:17 figure. But you never know.
(In reply to Gregory Szorc [:gps] from comment #12)
> One thing to watch out for is TCP connection non-reuse. It's very easy to
> establish a separate TCP connection for every HTTP request.

It's even worse with HTTPS, which is more likely what uploads use.
Going even further down the path into inventing a new version control system, I think we can improve performance here by bundling smaller objects together into 'packs'. The idea is to collect objects together to make tarballs up to say 10MB large. These can then be uploaded and downloaded as a single S3 object and processed on the clients. The directory manifest would list both the object hash and pack hash.

We should also special case the 0 byte file case. There are 1,271 0 byte files in my sample tests archive. There's no need to do any network activity at all for 0 byte files, the hash of which can be statically defined.

478 of these 0 byte files are .gitkeep files from web-platform.
I think packs will make a huge difference. 99% of the files are 50k or less. 95% of the files are 8k or less.
(In reply to Mike Hommey [:glandium] from comment #13)
> (In reply to Gregory Szorc [:gps] from comment #12)
> > One thing to watch out for is TCP connection non-reuse. It's very easy to
> > establish a separate TCP connection for every HTTP request.
> 
> It's even worse with HTTPS, which is more likely what uploads use.

I wanted to verify this, so I ran tcpdump while the upload was running and indeed saw thousands of connections being made. It turns out the way I was distributing jobs to the workers was defeating the connection pooling built into boto. After reworking that part, I can re-use connections for longer (although Amazon seems to disconnect them after ~5s or some number of requests). As a result the wall clock time for uploading everything drops to ~2m38s.
I've got a machine set up watching for new tests.zip uploads and then processes them into this S3 backed object storage. So far I'm seeing about 99% hit rate. The bulk of the cache misses are from .xpi and .jar files right now, and the differences there are in the timestamps of the files in the metadata rather than the contents themselves. Going to look at making those deterministic.
Yeah, I noticed that. I think we could simply swap out the zip command for a call to dozip.py, which uses the packaging code (which fakes all the timestamps):
http://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/dozip.py
Damn, I reviewed that and forgot about it. I was about to suggest writing something like it.
Seems to work!

https://hg.mozilla.org/try/rev/159ddfb1b70b

where else should I be making this change?

unzip -l ./xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/addons/upgradeable1x2-3_2.xpi
Archive:  ./xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/addons/upgradeable1x2-3_2.xpi
  Length      Date    Time    Name
---------  ---------- -----   ----
      614  01-01-2010 00:00   ../../../toolkit/mozapps/extensions/test/addons/upgradeable1x2-3_2/install.rdf
---------                     -------
      614                     1 file
Non-deterministic XPI generation is a known problem with the build system. Bug 988938 tracks fixing it.
That bug seems way over-scoped for the problem at hand. The simple s/zip/python dozip.py/ change in catlee's patch will deliver deterministic XPIs without all the build-fu that bug wants for fixing other problems.
I fully support a zip wrapper that uses mozpack for .zip generation.
Is that not dozip.py?
It is! I think we're just talking around each other.
Attachment #8535922 - Flags: review?(gps)
Comment on attachment 8535922 [details] [diff] [review]
use dozip.py to package test XPIs

Review of attachment 8535922 [details] [diff] [review]:
-----------------------------------------------------------------

We have a few other places where XPIs are created as part of the build. Not sure if they are in the tests archive, however.

Also, dozip.py is only marginally better than `zip`. There is still room to optimize it to not write the destination file if it hasn't changed (AFAICT it doesn't utilize a FileAvoidWrite). But this is all follow-up fodder.
Attachment #8535922 - Flags: review?(gps) → review+
Whiteboard: [leave open]
:Mossop, any ideas why these xpcshell tests fail when we force-set the date in the test XPIs?
Flags: needinfo?(dtownsend)
I saw this in the pushlog, and had a spare moment, so I'll offer some drive-by troubleshooting:

The log had errors along the lines of "Invalid XPI: [Exception... "Component returned failure code: 0x80520006 (NS_ERROR_FILE_TARGET_DOES_NOT_EXIST) ...", so I diffed one of the xpis that was failing. In particular, I was looking at test_bootstrap1_1.xpi.

Without the patch:
test_bootstrap1_1.xpi:
  /
    bootstrap.js
    intall.rdf
    version.jsm
And services/sync/tests/unit/test_addons_engine.js passes

With the patch:
test_bootstrap1_1.xpi:
  /../../../toolkit/mozapps/extensions/test/addons/test_bootstrap1_1/
    bootstrap.js
    intall.rdf
    version.jsm
And services/sync/tests/unit/test_addons_engine.js fails with the exception above.

It looks like dozip.py wants to store paths relative to dist/bin, and you need to set --base-dir=<something> to make it stop doing that. --base-dir=$$dir seems like it does the right thing. (It at makes the new xpi match the old one, and the test passes on my machine.)
What he said, looks like the directory entries in the zip files are incorrect. I never knew zip files could include .. parts in the directory, I wonder how safe our zip extraction code is in that case...
Flags: needinfo?(dtownsend)
With that change made, I still have some windows xpcshell test failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5991a247dccb&filter-searchStr=xpcshell
Depends on: 1126944
The issue I'm hitting right now is maintenance of the object manifest. It's important to have a list of which objects already exist in S3 so we avoid having to HEAD every single object.

After a few days of sucking in all test packages from production load, we have 1,385,797 objects in the bucket, and the manifest is 23MB compressed.
Can we stick it at a known URL and write a script to automate amalgamating the outputs of the build machines (by watching pulse or whatever) and updating that copy?

Also, I wonder if it wouldn't be worthwhile to let the build system indicate things that are going to change ever build (like the C++ unit test binaries) so we don't bother listing them in that manifest.
It's already at a known URL, it's just much bigger than I expected it to be!

That being said, I think it's manageable with some caching. Even if it's updated every hour, downloading an extra 20-30MB per hour isn't much overhead.

Being able to distinguish the high-churn files from the others would be great. Perhaps we can sample the past N days of build manifests to figure out which objects to include in the global manifest?
So I diffed the XPIs created with dozip.py vs zip. The most interesting difference, and the one I suspect is causing us problems here, is that the files are marked as 'text' in zip's version, but marked as 'binary' in dozip.py's version. This corresponds to the 'internal_attr' field in JarCdirEntry.

There are various other differences as well. I'll attach a diff of zipinfo output.
Attached patch xpis.diffSplinter Review
(In reply to Chris AtLee [:catlee] from comment #37)
> So I diffed the XPIs created with dozip.py vs zip. The most interesting
> difference, and the one I suspect is causing us problems here, is that the
> files are marked as 'text' in zip's version, but marked as 'binary' in
> dozip.py's version. This corresponds to the 'internal_attr' field in
> JarCdirEntry.

I'm going to make another guess. The non-msdos file attributes you see are actually unix permissions, 0x81B6 is 100666, i.e. world readable/writable. The version made by dozip however has permissions of 0, if the file were extracted with those permissions nothing would be able to access it. I know the zip code in Firefox does obey permissions so maybe that is the problem.
I took a different approach here. Instead of figuring out how to get dozip.py to generate compatible XPIs as native zip, I wrote a tool to zero-out the dates inside an existing zip file. This seems to have worked well on my system for generating consistent output.

There are a bunch of places in the build system where we're generating these XPIs. I'm wondering if it's better to run the stripping process in package-tests rather than piecemeal in the various modules that are creating them?
Attachment #8535922 - Attachment is obsolete: true
Attachment #8561388 - Flags: feedback?
Attachment #8561388 - Flags: feedback? → feedback?(ted)
Comment on attachment 8561388 [details] [diff] [review]
strip times from xpis

Review of attachment 8561388 [details] [diff] [review]:
-----------------------------------------------------------------

You're probably right that doing this in test-package-stage makes more sense than doing it every single place we write a zip file.

Overall I find this code hard to follow with all the bytearray manipulation it's doing. Maybe encapsulating a little bit more of the contents of main into a generator or something would make it easier to follow the flow without getting bogged down?

::: toolkit/mozapps/installer/stripziptimes.py
@@ +2,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# This script strips timestamps from zip files

gps would probably want you to put this under python/mozbuild/mozbuild/action.

@@ +58,5 @@
> +
> +
> +def main(args):
> +    parser = argparse.ArgumentParser()
> +    parser.add_argument("zip", help="Path to zip file to write")

Technically this is "to modify" since you read it and can write it.

@@ +60,5 @@
> +def main(args):
> +    parser = argparse.ArgumentParser()
> +    parser.add_argument("zip", help="Path to zip file to write")
> +    parser.add_argument("--inplace", dest="in_place", help="update file in-place", action="store_true", default=False)
> +    parser.add_argument("-o", "--output", dest="output", help="where to write output; default is stdout")

I would make the defaults here what you're using in the tree instead of passing --inplace everywhere. Keeping an option for testing is fine, but mostly YAGNI.

@@ +71,5 @@
> +
> +    # Make a copy of the zip's data so we can modify it in memory
> +    new_data = bytearray(jarfile._data.tobytes())
> +
> +    for e in range(cdir['cdir_entries']):

Some of these variable names are not very self-describing and it makes this code kind of hard to follow.
Attachment #8561388 - Flags: feedback?(ted) → feedback+
catlee and I talked about this in person today, and we think this approach just has too many problems to be usable in practice.

However! We discussed an alternate approach that I will file shortly that should be workable.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Oh, bug 917999 already exists we'll use that.
You need to log in before you can comment on or make changes to this bug.