1059943 - Upload test package files to S3 keyed by filename+hash

Reporter

Description

•

10 years ago

Right now we zip up all the test files + harnesses into one big zip and upload that to FTP next to the build. This sucks because the zip file is huge and we spend a lot of time uploading and downloading those bits.

Most of the files in the test package don't actually change between builds--we have a lot of tests and most of them get added once and then hardly ever change.

catlee and I talked about using blobber to store test package files, but that makes blobber a SPOF for test runs which is not great. We think we can wrap the useful functionality of blobber into a client-side library and then just upload directly to s3 using the filename + hash as the key. This would mean that each build only has to upload files that have never previously been uploaded, which ought to be a small, manageable number.

We'd want to upload each file from the test package to S3, then generate a manifest of relative_file_path -> URL for each file so that clients can fetch all the files they need.

(not currently active) Ted Mielczarek

Reporter

Comment 1

•

10 years ago

Preliminary investigation looks promising. I took the test packages from win{32,64}, linux{32,64} and mac from today's nightly builds and ran them through these scripts:
https://github.com/luser/test-package-analysis

The output was:
Total input files 284773
Total unique hashes 53868
Total original size 1841028511
Size saved by removing duplicates 1152708118
% of original after de-duplication: 37%

Which means that 73% of the bytes of our test zip contents (uncompressed) are duplicated across platforms.

(not currently active) Ted Mielczarek

Reporter

Comment 2

•

10 years ago

Using two inbound Linux64 builds that came one after the other:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1409332139/firefox-34.0a1.en-US.linux-x86_64.tests.zip
http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-linux64/1409336582/firefox-34.0a1.en-US.linux-x86_64.tests.zip

Total input files 113924
Total unique hashes 52363
Total original size 811478830
Size saved by removing duplicates 648517161
% of original after de-duplication: 20%

Chris AtLee [:catlee]

Comment 3

•

10 years ago

I still want to do this, but realized we need some cheap way to test if an object should be uploaded or not.

If a test packages contains ~50k objects, we can't reasonably HEAD each object in S3 to determine if it should be uploaded. It's too expensive time, network and price-wise. Perhaps we can come up with cached manifests of objects that exist, or use some kind of hierarchical naming/hashing scheme to determine quickly which objects need uploading.

(not currently active) Ted Mielczarek

Reporter

Comment 4

•

10 years ago

We could certainly use some sort of simple local cache, even just a JSON file or something would probably work. I'm not sure how we'd work the mechanics of it, maybe have a bootstrap json cache file that the machine can fetch before it starts uploading, and refresh the bootstrap file periodically by reading the uploaded JSON manifests?

James Lal [:lightsofapollo]

Comment 5

•

10 years ago

Purely out of curiosity have the download/upload times been tested talking directly to/from s3? Upload download should be _very_ fast if correctly optimized (for ec2 talking to s3 anyway) from an earlier test I observed 1-5s uploads for the test.zips and 2-10s downloads.

I am hoping some combination of diffs + talking to s3 instead of ftp will yield some big wins...

Gregory Szorc [:gps]

Comment 6

•

10 years ago

Make S3 objects idempotent and have names derived from a hash of their content.

The filesystem is then your cache. Store fetched objects in

/s3cache/bucket1/A/B/XXXXXXXXX
/s3cache/bucket1/C/D/XXXXXXXXX

Assemble objects into a working directory using your manifest.

Congratulations, you just built the beginnings of a source control system :)

Mike Hommey [:glandium]

Comment 7

•

10 years ago

FWIW, s3 uploads have a high latency. They need to be parallelized if you want to upload more than one thing quickly.

Gregory Szorc [:gps]

Comment 8

•

10 years ago

Did you write code for this in sccache?

It's not too difficult to put a thread pool in front of boto. e.g. https://github.com/sstoiana/s3funnel/blob/master/s3funnel/__init__.py

I wonder how high you can make the concurrency before the GIL becomes a problem. I hate the GIL so much.

Mike Hommey [:glandium]

Comment 9

•

10 years ago

sccache uses separate processes for compilation already, so it's just benefiting from that.

Chris AtLee [:catlee]

Comment 10

•

10 years ago

I'm working on some code that uses the multiprocessing module to do the uploads in separate processes. It avoids the GIL issue. I'm also gzip compressing the content before uploading.

Chris AtLee [:catlee]

Comment 11

•

10 years ago

I did some simple timings working with a recent win32 test package with 73,148 files. All tasks were done with 32 concurrent workers. The object manifest is a simple compressed list of all object hashes contained in the bucket. Clients try and load the object manifest first, so they can avoid HEAD requests for every object they have locally. Objects are gzip compressed on upload and decompressed after being downloaded.

Initial upload, nothing is cached on S3, no object manifest: 5m17s
Upload, everything cached on S3, no object manifest: 3m53s
Without the object manifest, the client has to do a look up for each of the 73k files.

Upload, everything cached on S3, with complete object manifest: 0m05s

Download & populate working dir with nothing cached locally: 5m04s
Download & populate working dir with everything cached, but empty working dir: 0m30s
Download & populate working dir with everything cached and populated working dir: 0m04s

The worst case times here are an order of magnitude better on instances in EC2 vs. on my laptop; I was getting ~15 minutes for a full fresh upload from my system. The latency to S3 is obviously much better from EC2.

So this approach definitely looks promising. I'd like to try this out on some branch to see how the rate of files changing impacts performance.

Gregory Szorc [:gps]

Comment 12

•

10 years ago

One thing to watch out for is TCP connection non-reuse. It's very easy to establish a separate TCP connection for every HTTP request. You may want to verify your S3 activity all occurs on the same socket from the worker. I suspect it is since you are achieving 7 req/s/worker in your 5:17 figure. But you never know.

Mike Hommey [:glandium]

Comment 13

•

10 years ago

(In reply to Gregory Szorc [:gps] from comment #12)
> One thing to watch out for is TCP connection non-reuse. It's very easy to
> establish a separate TCP connection for every HTTP request.

It's even worse with HTTPS, which is more likely what uploads use.

Chris AtLee [:catlee]

Comment 14

•

10 years ago

Going even further down the path into inventing a new version control system, I think we can improve performance here by bundling smaller objects together into 'packs'. The idea is to collect objects together to make tarballs up to say 10MB large. These can then be uploaded and downloaded as a single S3 object and processed on the clients. The directory manifest would list both the object hash and pack hash.

We should also special case the 0 byte file case. There are 1,271 0 byte files in my sample tests archive. There's no need to do any network activity at all for 0 byte files, the hash of which can be statically defined.

478 of these 0 byte files are .gitkeep files from web-platform.

Chris AtLee [:catlee]

Comment 15

•

10 years ago

I think packs will make a huge difference. 99% of the files are 50k or less. 95% of the files are 8k or less.

Chris AtLee [:catlee]

Comment 16

•

10 years ago

(In reply to Mike Hommey [:glandium] from comment #13)
> (In reply to Gregory Szorc [:gps] from comment #12)
> > One thing to watch out for is TCP connection non-reuse. It's very easy to
> > establish a separate TCP connection for every HTTP request.
> 
> It's even worse with HTTPS, which is more likely what uploads use.

I wanted to verify this, so I ran tcpdump while the upload was running and indeed saw thousands of connections being made. It turns out the way I was distributing jobs to the workers was defeating the connection pooling built into boto. After reworking that part, I can re-use connections for longer (although Amazon seems to disconnect them after ~5s or some number of requests). As a result the wall clock time for uploading everything drops to ~2m38s.

Chris AtLee [:catlee]

Comment 17

•

10 years ago

I've got a machine set up watching for new tests.zip uploads and then processes them into this S3 backed object storage. So far I'm seeing about 99% hit rate. The bulk of the cache misses are from .xpi and .jar files right now, and the differences there are in the timestamps of the files in the metadata rather than the contents themselves. Going to look at making those deterministic.

(not currently active) Ted Mielczarek

Reporter

Comment 18

•

10 years ago

Yeah, I noticed that. I think we could simply swap out the zip command for a call to dozip.py, which uses the packaging code (which fakes all the timestamps):
http://dxr.mozilla.org/mozilla-central/source/toolkit/mozapps/installer/dozip.py

Mike Hommey [:glandium]

Comment 19

•

10 years ago

Damn, I reviewed that and forgot about it. I was about to suggest writing something like it.

Chris AtLee [:catlee]

Comment 20

•

10 years ago

Seems to work!

https://hg.mozilla.org/try/rev/159ddfb1b70b

where else should I be making this change?

unzip -l ./xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/addons/upgradeable1x2-3_2.xpi
Archive:  ./xpcshell/tests/toolkit/mozapps/extensions/test/xpcshell/addons/upgradeable1x2-3_2.xpi
  Length      Date    Time    Name
---------  ---------- -----   ----
      614  01-01-2010 00:00   ../../../toolkit/mozapps/extensions/test/addons/upgradeable1x2-3_2/install.rdf
---------                     -------
      614                     1 file

Gregory Szorc [:gps]

Comment 21

•

10 years ago

Non-deterministic XPI generation is a known problem with the build system. Bug 988938 tracks fixing it.

(not currently active) Ted Mielczarek

Reporter

Comment 22

•

10 years ago

That bug seems way over-scoped for the problem at hand. The simple s/zip/python dozip.py/ change in catlee's patch will deliver deterministic XPIs without all the build-fu that bug wants for fixing other problems.

Gregory Szorc [:gps]

Comment 23

•

10 years ago

I fully support a zip wrapper that uses mozpack for .zip generation.

Chris AtLee [:catlee]

Comment 24

•

10 years ago

Is that not dozip.py?

(not currently active) Ted Mielczarek

Reporter

Comment 25

•

10 years ago

It is! I think we're just talking around each other.

Chris AtLee [:catlee]

Comment 26

•

10 years ago

Attached patch use dozip.py to package test XPIs (obsolete) — Details — Splinter Review

Attachment #8535922 - Flags: review?(gps)

Gregory Szorc [:gps]

Comment 27

•

10 years ago

Comment on attachment 8535922 [details] [diff] [review]
use dozip.py to package test XPIs

Review of attachment 8535922 [details] [diff] [review]:
-----------------------------------------------------------------

We have a few other places where XPIs are created as part of the build. Not sure if they are in the tests archive, however.

Also, dozip.py is only marginally better than `zip`. There is still room to optimize it to not write the destination file if it hasn't changed (AFAICT it doesn't utilize a FileAvoidWrite). But this is all follow-up fodder.

Attachment #8535922 - Flags: review?(gps) → review+

Chris AtLee [:catlee]

Updated

•

9 years ago

Whiteboard: [leave open]

Chris AtLee [:catlee]

Comment 28

•

9 years ago

Comment on attachment 8535922 [details] [diff] [review]
use dozip.py to package test XPIs

https://hg.mozilla.org/integration/mozilla-inbound/rev/dd3798ef3f43

Ryan VanderMeulen [:RyanVM]

Comment 29

•

9 years ago

Backed out for xpcshell failures.
https://hg.mozilla.org/integration/mozilla-inbound/rev/3ea1323faff9

https://treeherder.mozilla.org/logviewer.html#?job_id=5829267&repo=mozilla-inbound

Chris AtLee [:catlee]

Comment 30

•

9 years ago

:Mossop, any ideas why these xpcshell tests fail when we force-set the date in the test XPIs?

Flags: needinfo?(dtownsend)

Brian O'Keefe [:bokeefe]

Comment 31

•

9 years ago

I saw this in the pushlog, and had a spare moment, so I'll offer some drive-by troubleshooting:

The log had errors along the lines of "Invalid XPI: [Exception... "Component returned failure code: 0x80520006 (NS_ERROR_FILE_TARGET_DOES_NOT_EXIST) ...", so I diffed one of the xpis that was failing. In particular, I was looking at test_bootstrap1_1.xpi.

Without the patch:
test_bootstrap1_1.xpi:
  /
    bootstrap.js
    intall.rdf
    version.jsm
And services/sync/tests/unit/test_addons_engine.js passes

With the patch:
test_bootstrap1_1.xpi:
  /../../../toolkit/mozapps/extensions/test/addons/test_bootstrap1_1/
    bootstrap.js
    intall.rdf
    version.jsm
And services/sync/tests/unit/test_addons_engine.js fails with the exception above.

It looks like dozip.py wants to store paths relative to dist/bin, and you need to set --base-dir=<something> to make it stop doing that. --base-dir=$$dir seems like it does the right thing. (It at makes the new xpi match the old one, and the test passes on my machine.)

Dave Townsend [:mossop]

Comment 32

•

9 years ago

What he said, looks like the directory entries in the zip files are incorrect. I never knew zip files could include .. parts in the directory, I wonder how safe our zip extraction code is in that case...

Flags: needinfo?(dtownsend)

Chris AtLee [:catlee]

Comment 33

•

9 years ago

With that change made, I still have some windows xpcshell test failures:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5991a247dccb&filter-searchStr=xpcshell

Chris AtLee [:catlee]

Updated

•

9 years ago

Depends on: 1126944

Chris AtLee [:catlee]

Comment 34

•

9 years ago

The issue I'm hitting right now is maintenance of the object manifest. It's important to have a list of which objects already exist in S3 so we avoid having to HEAD every single object.

After a few days of sucking in all test packages from production load, we have 1,385,797 objects in the bucket, and the manifest is 23MB compressed.

(not currently active) Ted Mielczarek

Reporter

Comment 35

•

9 years ago

Can we stick it at a known URL and write a script to automate amalgamating the outputs of the build machines (by watching pulse or whatever) and updating that copy?

Also, I wonder if it wouldn't be worthwhile to let the build system indicate things that are going to change ever build (like the C++ unit test binaries) so we don't bother listing them in that manifest.

Chris AtLee [:catlee]

Comment 36

•

9 years ago

It's already at a known URL, it's just much bigger than I expected it to be!

That being said, I think it's manageable with some caching. Even if it's updated every hour, downloading an extra 20-30MB per hour isn't much overhead.

Being able to distinguish the high-churn files from the others would be great. Perhaps we can sample the past N days of build manifests to figure out which objects to include in the global manifest?

Chris AtLee [:catlee]

Comment 37

•

9 years ago

So I diffed the XPIs created with dozip.py vs zip. The most interesting difference, and the one I suspect is causing us problems here, is that the files are marked as 'text' in zip's version, but marked as 'binary' in dozip.py's version. This corresponds to the 'internal_attr' field in JarCdirEntry.

There are various other differences as well. I'll attach a diff of zipinfo output.

Chris AtLee [:catlee]

Comment 38

•

9 years ago

Attached patch xpis.diff — Details — Splinter Review

Dave Townsend [:mossop]

Comment 39

•

9 years ago

(In reply to Chris AtLee [:catlee] from comment #37)
> So I diffed the XPIs created with dozip.py vs zip. The most interesting
> difference, and the one I suspect is causing us problems here, is that the
> files are marked as 'text' in zip's version, but marked as 'binary' in
> dozip.py's version. This corresponds to the 'internal_attr' field in
> JarCdirEntry.

I'm going to make another guess. The non-msdos file attributes you see are actually unix permissions, 0x81B6 is 100666, i.e. world readable/writable. The version made by dozip however has permissions of 0, if the file were extracted with those permissions nothing would be able to access it. I know the zip code in Firefox does obey permissions so maybe that is the problem.

Chris AtLee [:catlee]

Comment 40

•

9 years ago

Attached patch strip times from xpis — Details — Splinter Review

I took a different approach here. Instead of figuring out how to get dozip.py to generate compatible XPIs as native zip, I wrote a tool to zero-out the dates inside an existing zip file. This seems to have worked well on my system for generating consistent output.

There are a bunch of places in the build system where we're generating these XPIs. I'm wondering if it's better to run the stripping process in package-tests rather than piecemeal in the various modules that are creating them?

Attachment #8535922 - Attachment is obsolete: true

Attachment #8561388 - Flags: feedback?

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8561388 - Flags: feedback? → feedback?(ted)

(not currently active) Ted Mielczarek

Reporter

Comment 41

•

9 years ago

Comment on attachment 8561388 [details] [diff] [review]
strip times from xpis

Review of attachment 8561388 [details] [diff] [review]:
-----------------------------------------------------------------

You're probably right that doing this in test-package-stage makes more sense than doing it every single place we write a zip file.

Overall I find this code hard to follow with all the bytearray manipulation it's doing. Maybe encapsulating a little bit more of the contents of main into a generator or something would make it easier to follow the flow without getting bogged down?

::: toolkit/mozapps/installer/stripziptimes.py
@@ +2,5 @@
> +# This Source Code Form is subject to the terms of the Mozilla Public
> +# License, v. 2.0. If a copy of the MPL was not distributed with this
> +# file, You can obtain one at http://mozilla.org/MPL/2.0/.
> +
> +# This script strips timestamps from zip files

gps would probably want you to put this under python/mozbuild/mozbuild/action.

@@ +58,5 @@
> +
> +
> +def main(args):
> +    parser = argparse.ArgumentParser()
> +    parser.add_argument("zip", help="Path to zip file to write")

Technically this is "to modify" since you read it and can write it.

@@ +60,5 @@
> +def main(args):
> +    parser = argparse.ArgumentParser()
> +    parser.add_argument("zip", help="Path to zip file to write")
> +    parser.add_argument("--inplace", dest="in_place", help="update file in-place", action="store_true", default=False)
> +    parser.add_argument("-o", "--output", dest="output", help="where to write output; default is stdout")

I would make the defaults here what you're using in the tree instead of passing --inplace everywhere. Keeping an option for testing is fine, but mostly YAGNI.

@@ +71,5 @@
> +
> +    # Make a copy of the zip's data so we can modify it in memory
> +    new_data = bytearray(jarfile._data.tobytes())
> +
> +    for e in range(cdir['cdir_entries']):

Some of these variable names are not very self-describing and it makes this code kind of hard to follow.

Attachment #8561388 - Flags: feedback?(ted) → feedback+

(not currently active) Ted Mielczarek

Reporter

Comment 42

•

9 years ago

catlee and I talked about this in person today, and we think this approach just has too many problems to be usable in practice.

However! We discussed an alternate approach that I will file shortly that should be workable.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → WONTFIX

(not currently active) Ted Mielczarek

Reporter

Comment 43

•

9 years ago

Oh, bug 917999 already exists we'll use that.

use dozip.py to package test XPIs 10 years ago Chris AtLee [:catlee] 825 bytes, patch	gps : review+	Details \| Diff \| Splinter Review
xpis.diff 9 years ago Chris AtLee [:catlee] 5.99 KB, patch		Details \| Diff \| Splinter Review
strip times from xpis 9 years ago Chris AtLee [:catlee] 9.11 KB, patch	ted : feedback+	Details \| Diff \| Splinter Review