Closed Bug 1306189 Opened 8 years ago Closed 6 years ago

Determine root cause of truncated zip files ("Check bug 1306189 for details on downloading a truncated zip file." or "[test-linux.sh:error] error unzipping mozharness.zip")

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: intermittent-bug-filer, Assigned: jhford)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file, 1 obsolete file)

This was reproducible if I run "rm -rf build" before running the Mozharness script for testing.
Assignee: nobody → armenzg
Comment on attachment 8796140 [details]
Bug 1306189 - Create upload dir before writing to it + more logging.

https://reviewboard.mozilla.org/r/82096/#review80618

Can you add the following, too?

iff --git a/testing/mozharness/mozharness/base/script.py b/testing/mozharness/mozharness/base/script.py
--- a/testing/mozharness/mozharness/base/script.py
+++ b/testing/mozharness/mozharness/base/script.py
@@ -388,19 +388,21 @@ class ScriptMixin(PlatformMixin):
         # Bug 1302237 - URLError: <urlopen error [Errno 104] Connection reset by peer>
         # Bug 1301807 - BadStatusLine: ''
         response = urllib2.urlopen(request)
 
         if parsed_url.scheme in ('http', 'https'):
             expected_file_size = int(response.headers.get('Content-Length'))
 
         self.info('Http code: {}'.format(response.getcode()))
-        for k in ('Content-Encoding', 'Content-Type', 'via', 'x-amz-cf-id',
-                  'x-amz-version-id', 'x-cache'):
+        for k in ('Content-Encoding', 'Content-Type', 'via', 'x-cache'):
             self.info('{}: {}'.format(k, response.headers.get(k)))
+        for k in response.headers.keys():
+            if k.lower().startswith('x-amz-'):
+                self.info('{}: {}'.format(k, response.headers.get(k)))
 
         file_contents = response.read()
         obtained_file_size = len(file_contents)
         self.info('Expected file size: {}'.format(expected_file_size))
         self.info('Obtained file size: {}'.format(obtained_file_size))
 
         if obtained_file_size != expected_file_size:
             raise FetchedIncorrectFilesize(

::: testing/mozharness/mozharness/base/script.py:692
(Diff revision 1)
>          try:
>              function(**kwargs)
>          except zipfile.BadZipfile:
>              # Bug 1305752 - Sometimes a good download turns out to be a
>              # corrupted zipfile. Let's upload the file for inspection
> -            filepath = os.path.join(self.query_abs_dirs()['abs_upload_dir'], url.split('/')[-1])
> +            upload_dir = self.query_abs_dirs()['abs_upload_dir']

This is still going to end up in the uploads dir, which is only copied to ~/artifacts on success.  Just use a hard-coded /home/worker/artifacts?
Attachment #8796140 - Flags: review?(dustin) → review-
Any reason to not just log all of the headers?

Can the consuming script echo a base64 encoded copy of the file if it's 1024 bytes or fewer, or the first and last 1024 bytes, again as base64 strings?  Can it also run du <file> and file <file>?

That way we can see whether it's an incomplete file or an S3 error message
from a test usw1 machine:

ubuntu@ip-172-31-9-14:~$ curl -v -L https://queue.taskcluster.net/v1/task/MhzwnYN3QuG9mhnS0ZLl7g/artifacts/public/build/target.crashreporter-symbols.zip > /dev/null
* Hostname was NOT found in DNS cache
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0*   Trying 54.235.141.80...
* Connected to queue.taskcluster.net (54.235.141.80) port 443 (#0)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
*        subject: C=US; ST=California; L=Mountain View; O=Mozilla Corporation; CN=auth.taskcluster.net
*        start date: 2016-03-17 00:00:00 GMT
*        expire date: 2019-03-22 12:00:00 GMT
*        subjectAltName: queue.taskcluster.net matched
*        issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*        SSL certificate verify ok.
> GET /v1/task/MhzwnYN3QuG9mhnS0ZLl7g/artifacts/public/build/target.crashreporter-symbols.zip HTTP/1.1
> User-Agent: curl/7.35.0
> Host: queue.taskcluster.net
> Accept: */*
> 
< HTTP/1.1 303 See Other
* Server Cowboy is not blacklisted
< Server: Cowboy
< Connection: keep-alive
< X-Powered-By: Express
< Strict-Transport-Security: max-age=7776000
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
< Access-Control-Request-Method: *
< Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
< Location: https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-west-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip
< Vary: Accept
< Content-Type: text/plain; charset=utf-8
< Content-Length: 29
< Date: Thu, 29 Sep 2016 14:20:33 GMT
< Via: 1.1 vegur
< 
* Ignoring the response-body
{ [data not shown]
100    29  100    29    0     0     35      0 --:--:-- --:--:-- --:--:--    35
* Connection #0 to host queue.taskcluster.net left intact
* Issue another request to this URL: 'https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-west-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip'
* Hostname was NOT found in DNS cache
*   Trying 54.225.222.159...
* Connected to cloud-mirror.taskcluster.net (54.225.222.159) port 443 (#1)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
*        subject: C=US; ST=California; L=Mountain View; O=Mozilla Corporation; CN=auth.taskcluster.net
*        start date: 2016-03-17 00:00:00 GMT
*        expire date: 2019-03-22 12:00:00 GMT
*        subjectAltName: cloud-mirror.taskcluster.net matched
*        issuer: C=US; O=DigiCert Inc; CN=DigiCert SHA2 Secure Server CA
*        SSL certificate verify ok.
> GET /v1/redirect/s3/us-west-1/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip HTTP/1.1
> User-Agent: curl/7.35.0
> Host: cloud-mirror.taskcluster.net
> Accept: */*
>
< HTTP/1.1 302 Found
* Server Cowboy is not blacklisted
< Server: Cowboy
< Connection: keep-alive
< X-Powered-By: Express
< Strict-Transport-Security: max-age=7776000
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT
< Access-Control-Request-Method: *
< Access-Control-Allow-Headers: X-Requested-With,Content-Type,Authorization,Accept,Origin
< Location: https://cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip
< Content-Type: application/json; charset=utf-8
< Content-Length: 259
< Etag: W/"103-b30b5cae"
< Date: Thu, 29 Sep 2016 14:20:33 GMT
< Via: 1.1 vegur
<
* Ignoring the response-body
{ [data not shown]
100   259  100   259    0     0    223      0  0:00:01  0:00:01 --:--:--   223
* Connection #1 to host cloud-mirror.taskcluster.net left intact
* Issue another request to this URL: 'https://cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip'
* Hostname was NOT found in DNS cache
*   Trying 54.231.236.4...
* Connected to cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com (54.231.236.4) port 443 (#2)
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Server hello (2):
{ [data not shown]
* SSLv3, TLS handshake, CERT (11):
{ [data not shown]
* SSLv3, TLS handshake, Server key exchange (12):
{ [data not shown]
* SSLv3, TLS handshake, Server finished (14):
{ [data not shown]
* SSLv3, TLS handshake, Client key exchange (16):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
} [data not shown]
* SSLv3, TLS handshake, Finished (20):
} [data not shown]
* SSLv3, TLS change cipher, Client hello (1):
{ [data not shown]
* SSLv3, TLS handshake, Finished (20):
{ [data not shown]
* SSL connection using ECDHE-RSA-AES128-GCM-SHA256
* Server certificate:
*        subject: C=US; ST=Washington; L=Seattle; O=Amazon.com Inc.; CN=*.s3-us-west-1.amazonaws.com
*        start date: 2016-07-18 00:00:00 GMT
*        expire date: 2017-10-26 12:00:00 GMT
*        subjectAltName: cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com matched
*        issuer: C=US; O=DigiCert Inc; OU=www.digicert.com; CN=DigiCert Baltimore CA-2 G2
*        SSL certificate verify ok.
> GET /https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FMhzwnYN3QuG9mhnS0ZLl7g%2F0%2Fpublic%2Fbuild%2Ftarget.crashreporter-symbols.zip HTTP/1.1
> User-Agent: curl/7.35.0
> Host: cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com
> Accept: */*
>
< HTTP/1.1 200 OK
< x-amz-id-2: czf57hvnPvvPN+fOlmDWEvf6+xsrpUFudHRsY42lthoEUNUb8yupL0N7ka269ZQ5KCjImPfC/I8=
< x-amz-request-id: 5790A2B5710C1FE8
< Date: Thu, 29 Sep 2016 14:20:34 GMT
< Last-Modified: Wed, 28 Sep 2016 22:11:45 GMT
< x-amz-expiration: expiry-date="Fri, 30 Sep 2016 00:00:00 GMT", rule-id="us-west-1-1-day"
< ETag: "dfd4ae0e879f8aa8cb53633fdd6d05a1"
< x-amz-meta-addresses: [{"code":200,"url":"https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/MhzwnYN3QuG9mhnS0ZLl7g/0/public/build/target.crashreporter-symbols.zip","t":"2016-09-28T22:05:48.545Z"}]
< x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/MhzwnYN3QuG9mhnS0ZLl7g/0/public/build/target.crashreporter-symbols.zip
< x-amz-meta-upstream-etag: "51c9d15315b6b7ed2d4d25a3c82c0327"
< x-amz-meta-stored: 2016-09-28T22:05:48.545Z
< Accept-Ranges: bytes
< Content-Type: application/zip
< Content-Length: 295497
* Server AmazonS3 is not blacklisted
< Server: AmazonS3
<
{ [data not shown]
100  288k  100  288k    0     0   226k      0  0:00:01  0:00:01 --:--:--  226k
* Connection #2 to host cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com left intact


so we can reproduce manually
That output is a zip file.  It has no table of contents according to unzip:

ubuntu@ip-172-31-9-14:~$ unzip x
Archive:  x
  End-of-central-directory signature not found.  Either this file is not
  a zipfile, or it constitutes one disk of a multi-part archive.  In the
  latter case the central directory and zipfile comment will be found on
  the last disk(s) of this archive.
unzip:  cannot find zipfile directory in one of x or
        x.zip, and cannot find x.ZIP, period.

The directory is at the end of a zip file, suggesting this is truncated.  It looks like it contains symbols:

ubuntu@ip-172-31-9-14:~$ strings -10 x
TestJemalloc/E62C52E5C1BE87150466003D2B28254C0/TestJemalloc.symUT       
,&T~APTj=s
c7L"x}uNh5
8l      .kMD(es
&]5.`b&]5.`b2
G~)TZ51hJX
TestTXMgr/A90500BC60591C71C8D74410D042DC6C0/TestTXMgr.symUT     
!*cBL_`vI!
3G3>=&S#*rN#>
P^UK-8`#4T
\V#cdX`|g]E
u>1t!^u!^I
eEksI6EW4E
^EOV3U\YE9a.
TestCOMPtrEq/5DF88692AF1479817997BD15E32BFA4D0/TestCOMPtrEq.symUT       
TestBlockingSocket/3BD9340DB24DDC8A3BF9C56E42ABF19D0/TestBlockingSocket.symUT   
        3).]|/1\STs
fB] 4:LPfon
y:-ve   8S\x
M1G-*F_<i6\
TestCOMArray/12771A92A039EC8FFC3A0E39B788A8B30/TestCOMArray.symUT       
<B6!(+.l3A%
Jchg.Wh|F?
TestAudioMixer/532992C60DD92D56E877D135859471B80/TestAudioMixer.symUT
the length of the file is 295497 for the broken usw1 copy, 114583749 for the working use1 copy.
Interestingly, the hexdump of these two files is wildly different

US-East-1
===============================================
0000000 4b50 0403 0014 0002 0008 0000 3c21 b151
0000010 3710 1bde 0000 5553 0000 0015 0000 6572
0000020 7466 7365 2f74 7561 6f74 616d 6974 6e6f
0000030 702e c579 6b3c db73 9238 f5df b02b a5cc
0000040 dd44 8cc8 789d 9266 faec 14ee 4e5b 2354
0000050 3a4b 4e49 3666 6293 2451 7164 914c 3e5c


US-West-1
===============================================
0000000 4b50 0403 0014 0000 0008 acf8 493c 6ccd
0000010 3a44 fa54 0000 23b6 0004 003f 001c 6554
0000020 7473 654a 616d 6c6c 636f 452f 3236 3543
0000030 4532 4335 4231 3845 3137 3035 3634 3036
0000040 3330 3244 3242 3238 3435 3043 542f 7365
0000050 4a74 6d65 6c61 6f6c 2e63 7973 556d 0954
Attachment #8796140 - Attachment is obsolete: true
Attachment #8796140 - Flags: review?(dustin)
I'm not seeing that difference:

ubuntu@ip-172-31-9-14:~$ for f in target.crashreporter-symbols.zip-us* corrupt ; do ls -l $f; hexdump $f | head; done
-rw-rw-r-- 1 ubuntu ubuntu 114583749 Sep 29 15:16 target.crashreporter-symbols.zip-use1
0000000 4b50 0403 0014 0000 0008 acf8 493c 6ccd
0000010 3a44 fa54 0000 23b6 0004 003f 001c 6554
0000020 7473 654a 616d 6c6c 636f 452f 3236 3543
0000030 4532 4335 4231 3845 3137 3035 3634 3036
0000040 3330 3244 3242 3238 3435 3043 542f 7365
0000050 4a74 6d65 6c61 6f6c 2e63 7973 556d 0954
0000060 0300 38a4 57ec 38a4 57ec 7875 000b 0401
0000070 01f4 0000 f404 0001 d400 5b5c db73 9638
0000080 d77e d0af 6ee3 a6cd 3bc5 5741 5757 b639
0000090 edd3 5c99 b13c b333 2fb5 9029 6d04 644e
-rw-rw-r-- 1 ubuntu ubuntu 295497 Sep 29 15:19 target.crashreporter-symbols.zip-usw1
0000000 4b50 0403 0014 0000 0008 acf8 493c 6ccd
0000010 3a44 fa54 0000 23b6 0004 003f 001c 6554
0000020 7473 654a 616d 6c6c 636f 452f 3236 3543
0000030 4532 4335 4231 3845 3137 3035 3634 3036
0000040 3330 3244 3242 3238 3435 3043 542f 7365
0000050 4a74 6d65 6c61 6f6c 2e63 7973 556d 0954
0000060 0300 38a4 57ec 38a4 57ec 7875 000b 0401
0000070 01f4 0000 f404 0001 d400 5b5c db73 9638
0000080 d77e d0af 6ee3 a6cd 3bc5 5741 5757 b639
0000090 edd3 5c99 b13c b333 2fb5 9029 6d04 644e
-rw-rw-r-- 1 ubuntu ubuntu 114583749 Sep 29 15:14 target.crashreporter-symbols.zip-usw2
0000000 4b50 0403 0014 0000 0008 acf8 493c 6ccd
0000010 3a44 fa54 0000 23b6 0004 003f 001c 6554
0000020 7473 654a 616d 6c6c 636f 452f 3236 3543
0000030 4532 4335 4231 3845 3137 3035 3634 3036
0000040 3330 3244 3242 3238 3435 3043 542f 7365
0000050 4a74 6d65 6c61 6f6c 2e63 7973 556d 0954
0000060 0300 38a4 57ec 38a4 57ec 7875 000b 0401
0000070 01f4 0000 f404 0001 d400 5b5c db73 9638
0000080 d77e d0af 6ee3 a6cd 3bc5 5741 5757 b639
0000090 edd3 5c99 b13c b333 2fb5 9029 6d04 644e

to verify, us-east-1 and us-west-2 are identical:
ubuntu@ip-172-31-9-14:~$ diff target.crashreporter-symbols.zip-us{e1,w2}
ubuntu@ip-172-31-9-14:~$ 

where was the us-east-1 download in comment #9 from?
John's going to check with docker-cloud about how we can hook up container logging to papertrail.

The logging we do have from cloud-mirror is this (from the curl output above)

< x-amz-meta-addresses: [{"code":200,"url":"https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/MhzwnYN3QuG9mhnS0ZLl7g/0/public/build/target.crashreporter-symbols.zip","t":"2016-09-28T22:05:48.545Z"}]
< x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/MhzwnYN3QuG9mhnS0ZLl7g/0/public/build/target.crashreporter-symbols.zip
< x-amz-meta-upstream-etag: "51c9d15315b6b7ed2d4d25a3c82c0327"
< x-amz-meta-stored: 2016-09-28T22:05:48.545Z

I verified in us-east-1 that the Etag for the original and the copy will not match, so that difference is not helpful.  The timestamp is useful, though, in that it lets us find the time the copy was made.

I'm running

dustin@lamport ~/tmp/cloud-mirror $ docker run -e DOCKERCLOUD_USER='...' -e DOCKERCLOUD_PASS='...' dockercloud/cli service logs -t0 -f cloud-mirror-copier 2>&1 | multilog s10485760 .

so if we see another instance of this where the source task was created after Thu Sep 29 17:11:50 UTC 2016, we should be able to see the copy in the logs.
Comment on attachment 8796311 [details]
Bug 1306189 - Add proper signature + better logging + stop trying to upload the corrupted file.

https://reviewboard.mozilla.org/r/82198/#review80818

::: testing/mozharness/mozharness/base/script.py:692
(Diff revision 1)
>          try:
>              function(**kwargs)
>          except zipfile.BadZipfile:
> -            # Bug 1305752 - Sometimes a good download turns out to be a
> -            # corrupted zipfile. Let's upload the file for inspection
> -            filepath = os.path.join(self.query_abs_dirs()['abs_upload_dir'], url.split('/')[-1])
> +            # Bug 1306189 - Sometimes a good download turns out to be a
> +            # corrupted zipfile. Let's create a signature that is easy to match
> +            self.fatal('Check bug 1306189 for details on downloading a truncated zip file.')

++
Attachment #8796311 - Flags: review?(dustin) → review+
Pushed by armenzg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/42722bb6b23c
Add proper signature + better logging + stop trying to upload the corrupted file. r=dustin
Keywords: leave-open
Passing to dustin to investigate the root issue of zip files being truncated on the server side.

Moving to the TaskCluster product and removed intermittent-orange (as the patch landed should create a unique signature).
Assignee: armenzg → dustin
Component: Mozharness → General
Product: Release Engineering → Taskcluster
QA Contact: jlund
Summary: Intermittent-infra IOError: [Errno 2] No such file or directory: '/home/worker/workspace/build/upload/target.crashreporter-symbols.zip' → Determine root cause of truncated zip files (was Intermittent-infra IOError: [Errno 2] No such file or directory: '/home/worker/workspace/build/upload/target.crashreporter-symbols.zip')
Just for sanity, did you verify that the contents of the truncated file are exactly the same as the first however many bytes of the full file?
Summary: Determine root cause of truncated zip files (was Intermittent-infra IOError: [Errno 2] No such file or directory: '/home/worker/workspace/build/upload/target.crashreporter-symbols.zip') → Determine root cause of truncated zip files ("Check bug 1306189 for details on downloading a truncated zip file.")
Sort of -- John found them to be different (comment 9) and I found them to match (comment 11)
From us-west-2, right now, I get sent directly from the queue to the bucket, with Content-Length: 666179.  From us-west-1, I get

< HTTP/1.1 200 OK
< x-amz-id-2: pIZg6YtCptfu+r/UBhEWTuxVYpU/daX18chBpk0K98qWU4hBSZeZpBTTiuP7zYoJ4ZzUvtK6aJ8=
< x-amz-request-id: 9D0ED50924968A1B
< Date: Mon, 03 Oct 2016 15:53:14 GMT
< Last-Modified: Mon, 03 Oct 2016 13:40:14 GMT
< x-amz-expiration: expiry-date="Wed, 05 Oct 2016 00:00:00 GMT", rule-id="us-west-1-1-day"
< ETag: "e9dd103fcd876c748b79284623b87fda"
< x-amz-meta-addresses: [{"code":200,"url":"https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip","t":"2016-10-03T13:34:13.563Z"}]
< x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
< x-amz-meta-upstream-etag: "f53b7be128045dd069161606704a3baa"
< x-amz-meta-stored: 2016-10-03T13:34:13.563Z
< Accept-Ranges: bytes
< Content-Type: application/zip
< Content-Length: 34380

Examining the first 160 bytes of the files shows the same content, so I expect this is a truncation.

The "upstream etag" in the above matches the etag I get with the us-west-2 request.
The artifact has the correct length in us-east-1 (https://cloud-mirror-production-us-west-1.s3-us-west-1.amazonaws.com/https%3A%2F%2Fs3-us-west-2.amazonaws.com%2Ftaskcluster-public-artifacts%2FHv4rirrYRtmcYfD974aCTw%2F0%2Fpublic%2Fbuild%2Fmozharness.zip)

Here's some logging from cloud-mirror for this us-east-1 copy, with other stuff mixed in.  I've annotated the lines I *think* are relevant with "*".

*2016-10-03T13:34:11.854317269Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main received message!
*2016-10-03T13:34:11.854327449Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main this message checks out
*2016-10-03T13:34:11.854370392Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-east-1 putting https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:11.909190552Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:s3-storage-provider upload completed
 2016-10-03T13:34:11.909262400Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:S3StorageProvider:s3_us-west-1 completed S3 upload
 2016-10-03T13:34:11.909620150Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/LvENO_XfTnW4Y31PG_dB2g/0/public/build/target.mochitest.tests.zip 74126733 bytes in 4.097237224000001 seconds
*2016-10-03T13:34:11.924573925Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-east-1 creating read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:11.931181904Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production received 0 messages
 2016-10-03T13:34:11.931280246Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production handling message efe69f48-50eb-480e-ac00-f53babc0502c
 2016-10-03T13:34:11.931660690Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main received message!
 2016-10-03T13:34:11.931680480Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main this message checks out
 2016-10-03T13:34:11.931818626Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-east-1 putting https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Pvjd56KdRI2lFw4QrshkcQ/0/public/build/fennec-52.0a1.en-US.android-arm.crashreporter-symbols.zip
 2016-10-03T13:34:11.939652691Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production received 0 messages
 2016-10-03T13:34:11.939721793Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production handling message 7abf1ee7-4517-48e6-812f-a53457ed4d36
 2016-10-03T13:34:11.939766585Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main received message!
 2016-10-03T13:34:11.939802508Z Mon, 03 Oct 2016 13:34:11 GMT cloud-proxy:main this message checks out
 2016-10-03T13:34:11.939854616Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-east-1 putting https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/L1mvNlJTRoaCZNdXO68WNQ/0/public/test_info//mochitest-devtools-chrome-chunked_errorsummary.log
*2016-10-03T13:34:11.942201426Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:follow-redirects Follwed all redirects: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
*2016-10-03T13:34:11.942975137Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:CacheManager:s3_us-east-1 created read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
*2016-10-03T13:34:11.944342169Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 starting S3 upload
*2016-10-03T13:34:11.944404824Z Mon, 03 Oct 2016 13:34:11 GMT cloud-mirror:s3-storage-provider initiating upload
 2016-10-03T13:34:11.994837279Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production removed message be98d6b4-e898-4507-8776-44d47b95856d from queue
 2016-10-03T13:34:11.994888790Z Mon, 03 Oct 2016 13:34:11 GMT queue:QueueListener:692406183521/cloud-mirror-production handler successfully processed message be98d6b4-e898-4507-8776-44d47b95856d
 2016-10-03T13:34:12.009627957Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:CacheManager:s3_us-east-1 creating read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/L1mvNlJTRoaCZNdXO68WNQ/0/public/test_info//mochitest-devtools-chrome-chunked_errorsummary.log
 2016-10-03T13:34:12.011278419Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:CacheManager:s3_us-east-1 creating read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Pvjd56KdRI2lFw4QrshkcQ/0/public/build/fennec-52.0a1.en-US.android-arm.crashreporter-symbols.zip
 2016-10-03T13:34:12.037686650Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:follow-redirects Follwed all redirects: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Pvjd56KdRI2lFw4QrshkcQ/0/public/build/fennec-52.0a1.en-US.android-arm.crashreporter-symbols.zip
 2016-10-03T13:34:12.038206536Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:CacheManager:s3_us-east-1 created read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Pvjd56KdRI2lFw4QrshkcQ/0/public/build/fennec-52.0a1.en-US.android-arm.crashreporter-symbols.zip
 2016-10-03T13:34:12.038293470Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:follow-redirects Follwed all redirects: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/L1mvNlJTRoaCZNdXO68WNQ/0/public/test_info//mochitest-devtools-chrome-chunked_errorsummary.log
 2016-10-03T13:34:12.038774397Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:CacheManager:s3_us-east-1 created read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/L1mvNlJTRoaCZNdXO68WNQ/0/public/test_info//mochitest-devtools-chrome-chunked_errorsummary.log
 2016-10-03T13:34:12.039792234Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 starting S3 upload
 2016-10-03T13:34:12.039919796Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 starting S3 upload
 2016-10-03T13:34:12.039923300Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:s3-storage-provider initiating upload
 2016-10-03T13:34:12.040095973Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:s3-storage-provider initiating upload
 2016-10-03T13:34:12.341981977Z Mon, 03 Oct 2016 13:34:12 GMT statsum Submitting 27 data-points with dimensionality 3 to project: cloud-mirror
 2016-10-03T13:34:12.343265117Z Mon, 03 Oct 2016 13:34:12 GMT statsum Submitting data-points to: https://statsum.taskcluster.net/v1/project/cloud-mirror
 2016-10-03T13:34:12.601867785Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:s3-storage-provider upload completed
 2016-10-03T13:34:12.601929824Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 completed S3 upload
 2016-10-03T13:34:12.602083131Z Mon, 03 Oct 2016 13:34:12 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/L1mvNlJTRoaCZNdXO68WNQ/0/public/test_info//mochitest-devtools-chrome-chunked_errorsummary.log 0 bytes in 0.5631460070000001 seconds
 2016-10-03T13:34:12.680216542Z Mon, 03 Oct 2016 13:34:12 GMT queue:QueueListener:692406183521/cloud-mirror-production removed message 7abf1ee7-4517-48e6-812f-a53457ed4d36 from queue
 2016-10-03T13:34:12.680273554Z Mon, 03 Oct 2016 13:34:12 GMT queue:QueueListener:692406183521/cloud-mirror-production handler successfully processed message 7abf1ee7-4517-48e6-812f-a53457ed4d36
 2016-10-03T13:34:13.200873325Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production received 0 messages
 2016-10-03T13:34:13.200919578Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production handling message 6fdf775a-4c02-46ca-bfc8-12ecac8b413d
 2016-10-03T13:34:13.201003478Z Mon, 03 Oct 2016 13:34:13 GMT cloud-proxy:main received message!
 2016-10-03T13:34:13.201025230Z Mon, 03 Oct 2016 13:34:13 GMT cloud-proxy:main this message checks out
 2016-10-03T13:34:13.201065105Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 putting https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
*2016-10-03T13:34:13.256635114Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:s3-storage-provider upload completed
*2016-10-03T13:34:13.256873767Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 completed S3 upload
*2016-10-03T13:34:13.257013780Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 666179 bytes in 1.31393689 seconds
Here's the same for the us-west-1 copy, at 13:34:13 just as the S3 metadata promised:

 2016-10-03T13:34:13.201003478Z Mon, 03 Oct 2016 13:34:13 GMT cloud-proxy:main received message!
 2016-10-03T13:34:13.201025230Z Mon, 03 Oct 2016 13:34:13 GMT cloud-proxy:main this message checks out
 2016-10-03T13:34:13.201065105Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 putting https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:13.256635114Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:s3-storage-provider upload completed
 2016-10-03T13:34:13.256873767Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 completed S3 upload
 2016-10-03T13:34:13.257013780Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 666179 bytes in 1.31393689 seconds
 2016-10-03T13:34:13.270749360Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 creating read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:13.284721575Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:follow-redirects Follwed all redirects: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:13.285119191Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 created read stream for https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip
 2016-10-03T13:34:13.286021891Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:S3StorageProvider:s3_us-west-1 starting S3 upload
 2016-10-03T13:34:13.286078670Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:s3-storage-provider initiating upload
 2016-10-03T13:34:13.338140036Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production removed message 143042b3-ecd9-4d07-8ba5-b36d01b7422a from queue
 2016-10-03T13:34:13.338204244Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production handler successfully processed message 143042b3-ecd9-4d07-8ba5-b36d01b7422a
 2016-10-03T13:34:13.348434800Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:s3-storage-provider upload completed
 2016-10-03T13:34:13.348513742Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:S3StorageProvider:s3_us-east-1 completed S3 upload
 2016-10-03T13:34:13.348583441Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/LvENO_XfTnW4Y31PG_dB2g/0/public/build/target.mochitest.tests.zip 74126733 bytes in 4.7369786970000005 seconds
 2016-10-03T13:34:13.425401839Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production removed message 5c37f227-e3a5-4855-963f-0357f84f5cb9 from queue
 2016-10-03T13:34:13.425428269Z Mon, 03 Oct 2016 13:34:13 GMT queue:QueueListener:692406183521/cloud-mirror-production handler successfully processed message 5c37f227-e3a5-4855-963f-0357f84f5cb9
 2016-10-03T13:34:13.664843637Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:s3-storage-provider upload completed
 2016-10-03T13:34:13.664907727Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:S3StorageProvider:s3_us-west-1 completed S3 upload
 2016-10-03T13:34:13.664994984Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 666179 bytes in 0.379772142 seconds

note that it indicates the correct size in the completion message.  This overlaps, too, with the us-east-1 copy .. maybe relevant?
That last log line comes from

https://github.com/taskcluster/cloud-mirror/blob/master/src/cache-manager.js#L106
 this.debug(`uploaded ${rawUrl} ${m.bytes} bytes in ${duration/1000} seconds`);

where m is a stream meter.  So it's measuring the total bytes read from the input stream.  I don't see any handling of Content-Length, but then the SDK is using multipart uploads, so maybe that's not appropriate.

There's a similar report here (but for the PHP SDK):
  https://github.com/aws/aws-sdk-php/issues/564
Some broader grepping of the logs shows an interesting set of phenomena:

 2016-10-03T13:34:13.257013780Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 666179 bytes in 1.31393689 seconds
 2016-10-03T13:34:13.664994984Z Mon, 03 Oct 2016 13:34:13 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 666179 bytes in 0.379772142 seconds
 2016-10-03T13:40:13.141331892Z Mon, 03 Oct 2016 13:40:13 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Hv4rirrYRtmcYfD974aCTw/0/public/build/mozharness.zip 34380 bytes in 359.57729642099997 seconds

(trolling through treeherder for other instances:)

 2016-10-03T13:42:05.009741282Z Mon, 03 Oct 2016 13:42:05 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 159453196 bytes in 7.567680615 seconds
 2016-10-03T13:42:05.704680409Z Mon, 03 Oct 2016 13:42:05 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 159453196 bytes in 8.143963533 seconds
 2016-10-03T13:42:15.031537867Z Mon, 03 Oct 2016 13:42:15 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 159453196 bytes in 4.500056646 seconds
 2016-10-03T13:42:15.269754483Z Mon, 03 Oct 2016 13:42:15 GMT cloud-mirror:CacheManager:s3_us-west-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 159453196 bytes in 4.203620871 seconds
 2016-10-03T13:42:34.842992886Z Mon, 03 Oct 2016 13:42:34 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 159453196 bytes in 7.959859654 seconds
 2016-10-03T13:47:02.016014542Z Mon, 03 Oct 2016 13:47:02 GMT cloud-mirror:CacheManager:s3_us-east-1 uploaded https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/Wd54dlTSSJmXfPW4kKQimw/0/public/build/target.crashreporter-symbols.zip 0 bytes in 304.693212035 seconds

I've summed all of the available data up in https://gist.github.com/djmitche/c2dd3630960f4784e822a7f0c54295c9

So it looks like we do multiple copies of each object within a few minutes of one another, and in some cases the last of those ends with an error after 300s+ with only a truncated file read.  But in at least one case (SlX0CIF8QiaNy6oL78cOcA target.web-platform.tests.zip) the stream-meter reports reading the correct number of bytes, but not writing them.

In no case do I see "error from stream-meter" in the logs, but I'm not sure if that means there are no errors in the stream, or just no errors in the stream *meter*.
John, can I toss this over to you?  I think we've narrowed the problem to incomplete transfers from S3 (where the logged sizes are wrong) or to S3 (where the logged sizes are correct but the resulting S3 object is truncated).

Maybe there's more error-handling possible here?  If not, I suspect that the "Content-Length" header from the source stream is reliable enough, so checking that against the other content lengths (from the stream meter and maybe a final HEAD request against the copied object?) would help detect the issue.  I see that the SDK's upload method can take a ContentLength parameter, too.  That might cause it to do some better error-checking and retrying of uploads.

I was surprised to see the multiple copies of the same object to the same region -- does that seem wrong to you, too?

I doubt we can reliably test failed uploads in integration tests, but I can help re-create failures manually, if that's useful.
Whoops, helps to adjust bug fields too!

Also, I'm happy to keep working on this, too -- just at the end of where I can be very useful without your (John's) expertise.
Flags: needinfo?(jhford)
One more thing :)

If it would be useful, we could scan all three artifact buckets and look for objects with the same keys but different sizes, to find more instances of this error.
Summary: Determine root cause of truncated zip files ("Check bug 1306189 for details on downloading a truncated zip file.") → Determine root cause of truncated zip files ("Check bug 1306189 for details on downloading a truncated zip file." or "[test-linux.sh:error] error unzipping mozharness.zip")
Hi John, any updates about this?
Anything we can do in our side?
Assignee: dustin → jhford
Depends on: 1309907
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #38)
> Hi John, any updates about this?
> Anything we can do in our side?

Hi Armen, I've been making some code changes which should hopefully make this easier to debug, as well as a couple small improvements which should lessen the impact.  They have not landed yet.

I'm also still debugging the problem.  Right now, it seems as though something really strange is happening for gzip and deflate encoded resources.  I've added a new check that verifies that the number of bytes that are uploaded to S3 is the same as the Content-Length: header on the original request.  For most files this check is perfect.

The problem is that for Gzip and Deflate encoded files, we seem to reliably uploading more bytes than we're downloading, which is weird.  In this case, it's 'http://httpbin.org/gzip'.  For my tests here, I'm seeing

{
  "gzipped": true, 
  "headers": {
    "Accept-Encoding": "*", 
    "Host": "httpbin.org"
  }, 
  "method": "GET", 
  "origin": "178.19.212.229"
}

We're reliably getting issues.  The upstream copy of the file is:
localhost:~/taskcluster/cloud-mirror $ hexdump expected.gz | tee expected.hex
0000000 ef1f bdbf 0008 ef07 bdbf bfef 57bd ef02
0000010 bdbf ef25 bdbf 0e31 bfef 20bd 0410 5e7b
0000020 bfef efbd bdbf bfef 12bd 6824 ef12 bdbf
0000030 614b bfef 0fbd bfef 07bd 3812 4501 ef38
0000040 bdbf 4d2f ef2c bdbf 473d ef5c bdbf bfef
0000050 efbd bdbf 4a6e ef6b bdbf bfef 5cbd 062b
0000060 bfef 6bbd 1f6e ef3c bdbf bfef 12bd 033e
0000070 bfef 55bd bfef 26bd ef51 bdbf bfef 7bbd
0000080 bfef 7cbd ef19 bdbf bfef efbd bdbf 144b
0000090 2701 bfef 77bd ce45 72b4 ef27 bdbf bfef
00000a0 efbd bdbf 3172 bfef 22bd bfef d9bd ef8f
00000b0 bdbf 7237 bfef 7ebd d30f 1cb8 ef1b bdbf
00000c0 bfef 63bd ef2e bdbf bfef efbd bdbf bfef
00000d0 efbd bdbf bfef 38bd bfef efbd bdbf 0373
00000e0 bfef 5dbd bfef 00bd ef66 bdbf bfef 64bd
00000f0 bfef 00bd 0000                         
00000f6


And the Amazon copy of the file is:
localhost:~/taskcluster/cloud-mirror $ hexdump actual.gz | tee actual.hex
0000000 ef1f bdbf 0008 ef05 bdbf bfef 57bd ef02
0000010 bdbf ef25 bdbf 0e31 bfef 20bd 0410 5e7b
0000020 bfef efbd bdbf bfef 12bd 6824 ef12 bdbf
0000030 614b bfef 0fbd bfef 07bd 3812 4501 ef38
0000040 bdbf 4d2f ef2c bdbf 473d ef5c bdbf bfef
0000050 efbd bdbf 4a6e ef6b bdbf bfef 5cbd 062b
0000060 bfef 6bbd 1f6e ef3c bdbf bfef 12bd 033e
0000070 bfef 55bd bfef 26bd ef51 bdbf bfef 7bbd
0000080 bfef 7cbd ef19 bdbf bfef efbd bdbf 144b
0000090 2701 bfef 77bd ce45 72b4 ef27 bdbf bfef
00000a0 efbd bdbf 3172 bfef 22bd bfef d9bd ef8f
00000b0 bdbf 7237 bfef 7ebd d30f 1cb8 ef1b bdbf
00000c0 bfef 63bd ef2e bdbf bfef efbd bdbf bfef
00000d0 efbd bdbf bfef 38bd bfef efbd bdbf 0373
00000e0 bfef 5dbd bfef 00bd ef66 bdbf bfef 64bd
00000f0 bfef 00bd 0000                         
00000f6

And the differences are: 

localhost:~/taskcluster/cloud-mirror $ diff actual.hex expected.hex 
1c1
< 0000000 ef1f bdbf 0008 ef05 bdbf bfef 57bd ef02
---
> 0000000 ef1f bdbf 0008 ef07 bdbf bfef 57bd ef02


I'm not sure what's going on with ef05 vs ef07.  I'm about to re-read the zlib spec [1], but from my last time reading it, this was something to do with a difference in the generation time of the archive, though that should not be happening as nothing should be touching the gzip encoding, but I am hitting the /gzip endpoint two different times, and it might correctly be setting the time-of-generation header in the GZIP file.  That wouldn't explain the difference in length and bytes uploaded numbers, but is a good first step.

[1] http://www.zlib.org/rfc-gzip.html#file-format
I notice that this problem seems to happen in batches, affecting only one particular revision and platform at a time. That seems odd to me; does it make sense to anyone else?

Examples:
Oct 24, fx-team,  215f9686117673a2c914ed207bc7da9bb8d741ad, win32-pgo, 16 failures
Oct 21, autoland, 4b2363bd94e72797913817ae8bb9195afd978169, linux64-debug, 32 failures
Oct 18, fx-team,  dc89484d4b45abf442162e5ea2dd46f9de40197d, linux64-opt, 20 failures
Actually, maybe it's not odd: I suppose it just suggests that a particular file is not accessible, for a certain time period. (I had been thinking of this bug as an intermittent network problem, a server being down or a connection timing out. Now I see that it is more about a specific file, for a period of time.)
Right -- once a file is mirrored incorrectly, it will "poison" a bunch of downstream jobs.

By the way, John's got a fix for this brewing -- https://github.com/taskcluster/cloud-mirror/pull/23
this had a large spike yesterday of 59 failures reported- we might want to focus on this as it is becoming more of a nuisance for developers and sheriffs.
John Ford is heads down working on this issue and he believes it should be resolved with https://github.com/taskcluster/cloud-mirror/pull/23
I've deployed the change that I believe will fix this issue today.
Flags: needinfo?(jhford)
The 7 days trend looks in the right direction.

Good job jhford!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Armen, we should probably back out the debug logging that's in this bug.
Status: RESOLVED → REOPENED
Flags: needinfo?(armenzg)
Resolution: FIXED → ---
Depends on: 1320938
I created bug 1320938 for the backout.
Flags: needinfo?(armenzg)
jhford, getting this on your radar.  Not sure if you're aware of any issues that might have attributed to this on September 27th.
Flags: needinfo?(jhford)
No aware of anything that would've had a specific effect on Sept 27.  When we have the new artifact api is being used in production for these builds, finding the system on which the root of this issue occurs should be easy because we'll have consistency failures either when uploading or when downloading the bundles.
Flags: needinfo?(jhford)
Whiteboard: [stockwell infra]
The immediate issue here is fixed.

I think that root-causing this particular issue is too late, and the artifact API will surface any related issues later, at which point we won't remember to look at this bug :)
Status: REOPENED → RESOLVED
Closed: 8 years ago6 years ago
Resolution: --- → FIXED
No longer blocks: 1433059
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: