Closed Bug 1305752 Opened 5 years ago Closed 5 years ago

Intermittent-infra BadZipfile: File is not a zip file

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(firefox52 fixed)

RESOLVED DUPLICATE of bug 1306189
Tracking Status
firefox52 --- fixed

People

(Reporter: philor, Assigned: armenzg)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1300345 +++

https://treeherder.mozilla.org/logviewer.html#?job_id=4062041&repo=autoland

 [task 2016-09-27T14:32:11.573744Z] 14:32:11     INFO - Fetch https://queue.taskcluster.net/v1/task/fJ6_ZueeTBCcSt22Qb-AHA/artifacts/public/build/target.common.tests.zip into memory

 [task 2016-09-27T14:32:11.872192Z] 14:32:11     INFO - Expected file size: 16970

 [task 2016-09-27T14:32:11.876818Z] 14:32:11     INFO - Running post-action listener: _resource_record_post_action

 [task 2016-09-27T14:32:11.876885Z] 14:32:11     INFO - Running post-action listener: set_extra_try_arguments

 [task 2016-09-27T14:32:11.877239Z] 14:32:11     INFO - [mozharness: 2016-09-27 14:32:11.873795Z] Finished download-and-extract step (failed)

 [task 2016-09-27T14:32:11.880050Z] 14:32:11    FATAL - Uncaught exception: Traceback (most recent call last):

 [task 2016-09-27T14:32:11.881801Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 2020, in run

 [task 2016-09-27T14:32:11.882774Z] 14:32:11    FATAL -     self.run_action(action)

 [task 2016-09-27T14:32:11.885250Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 1959, in run_action

 [task 2016-09-27T14:32:11.886042Z] 14:32:11    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)

 [task 2016-09-27T14:32:11.889723Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 1899, in _possibly_run_method

 [task 2016-09-27T14:32:11.890950Z] 14:32:11    FATAL -     return getattr(self, method_name)()

 [task 2016-09-27T14:32:11.893371Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/scripts/desktop_unittest.py", line 531, in download_and_extract

 [task 2016-09-27T14:32:11.894292Z] 14:32:11    FATAL -     suite_categories=target_categories)

 [task 2016-09-27T14:32:11.897164Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 553, in download_and_extract

 [task 2016-09-27T14:32:11.908863Z] 14:32:11    FATAL -     self._download_test_packages(suite_categories, extract_dirs)

 [task 2016-09-27T14:32:11.909757Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 464, in _download_test_packages

 [task 2016-09-27T14:32:11.911448Z] 14:32:11    FATAL -     extract_dirs=unpack_dirs)

 [task 2016-09-27T14:32:11.912220Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 684, in download_unpack

 [task 2016-09-27T14:32:11.913196Z] 14:32:11    FATAL -     function(**kwargs)

 [task 2016-09-27T14:32:11.914962Z] 14:32:11    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 556, in unzip

 [task 2016-09-27T14:32:11.915713Z] 14:32:11    FATAL -     with zipfile.ZipFile(compressed_file) as bundle:

 [task 2016-09-27T14:32:11.921691Z] 14:32:11    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 714, in __init__

 [task 2016-09-27T14:32:11.925240Z] 14:32:11    FATAL -     self._GetContents()

 [task 2016-09-27T14:32:11.926851Z] 14:32:11    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 748, in _GetContents

 [task 2016-09-27T14:32:11.929961Z] 14:32:11    FATAL -     self._RealGetContents()

 [task 2016-09-27T14:32:11.937317Z] 14:32:11    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 763, in _RealGetContents

 [task 2016-09-27T14:32:11.942097Z] 14:32:11    FATAL -     raise BadZipfile, "File is not a zip file"

 [task 2016-09-27T14:32:11.943868Z] 14:32:11    FATAL - BadZipfile: File is not a zip file

 [task 2016-09-27T14:32:11.945209Z] 14:32:11    FATAL - Running post_fatal callback...

 [task 2016-09-27T14:32:11.948238Z] 14:32:11    FATAL - Exiting -1
gps: I remember we decided to not retry to unzip a file if what we downloaded is corrupted. If the extraction fails, would you be OK with trying *once* more to download the file into memory and try again to extract?

In this log, we downloaded [1][2] with a expected file size of 16970 (lies!) and failed to unzip.
I downloaded the file on my machine and it extracts normally.

[1] https://queue.taskcluster.net/v1/task/fJ6_ZueeTBCcSt22Qb-AHA/artifacts/public/build/target.common.tests.zip
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=4062041&repo=autoland#L582
Flags: needinfo?(gps)
Depends on: 1305775
We have a bit more information in the logs [1] from bug 1305775, however, it does not help determine what is causing this.
> Expected file size: 574026
> Content-Encoding None
> Content-Type application/zip
> Http code 200

Any ideas on what could be going on?

I think we might want to try to download once more and then unzip.

[1] https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=4110637#L614
 [task 2016-09-28T03:53:31.179372Z] 03:53:31     INFO - Fetch https://queue.taskcluster.net/v1/task/SDlWxl5OQTOX9u4ZNytuTA/artifacts/public/build/target.reftest.tests.zip into memory
 [task 2016-09-28T03:53:31.463397Z] 03:53:31     INFO - Expected file size: 574026
 [task 2016-09-28T03:53:31.464822Z] 03:53:31     INFO - Content-Encoding None
 [task 2016-09-28T03:53:31.464887Z] 03:53:31     INFO - Content-Type application/zip
 [task 2016-09-28T03:53:31.465545Z] 03:53:31     INFO - Http code 200
 [task 2016-09-28T03:53:31.472609Z] 03:53:31     INFO - Running post-action listener: _resource_record_post_action
 [task 2016-09-28T03:53:31.473238Z] 03:53:31     INFO - Running post-action listener: set_extra_try_arguments
 [task 2016-09-28T03:53:31.473324Z] 03:53:31     INFO - [mozharness: 2016-09-28 03:53:31.472948Z] Finished download-and-extract step (failed)
 [task 2016-09-28T03:53:31.475168Z] 03:53:31    FATAL - Uncaught exception: Traceback (most recent call last):
 [task 2016-09-28T03:53:31.475815Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 2022, in run
 [task 2016-09-28T03:53:31.476019Z] 03:53:31    FATAL -     self.run_action(action)
 [task 2016-09-28T03:53:31.476597Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 1961, in run_action
 [task 2016-09-28T03:53:31.477142Z] 03:53:31    FATAL -     self._possibly_run_method(method_name, error_if_missing=True)
 [task 2016-09-28T03:53:31.477611Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 1901, in _possibly_run_method
 [task 2016-09-28T03:53:31.478145Z] 03:53:31    FATAL -     return getattr(self, method_name)()
 [task 2016-09-28T03:53:31.478636Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/scripts/desktop_unittest.py", line 531, in download_and_extract
 [task 2016-09-28T03:53:31.479144Z] 03:53:31    FATAL -     suite_categories=target_categories)
 [task 2016-09-28T03:53:31.479671Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 553, in download_and_extract
 [task 2016-09-28T03:53:31.480235Z] 03:53:31    FATAL -     self._download_test_packages(suite_categories, extract_dirs)
 [task 2016-09-28T03:53:31.480689Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/mozilla/testing/testbase.py", line 464, in _download_test_packages
 [task 2016-09-28T03:53:31.481245Z] 03:53:31    FATAL -     extract_dirs=unpack_dirs)
 [task 2016-09-28T03:53:31.481753Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 686, in download_unpack
 [task 2016-09-28T03:53:31.482256Z] 03:53:31    FATAL -     function(**kwargs)
 [task 2016-09-28T03:53:31.482816Z] 03:53:31    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 558, in unzip
 [task 2016-09-28T03:53:31.483309Z] 03:53:31    FATAL -     with zipfile.ZipFile(compressed_file) as bundle:
 [task 2016-09-28T03:53:31.483806Z] 03:53:31    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 714, in __init__
 [task 2016-09-28T03:53:31.484365Z] 03:53:31    FATAL -     self._GetContents()
 [task 2016-09-28T03:53:31.484900Z] 03:53:31    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 748, in _GetContents
 [task 2016-09-28T03:53:31.485393Z] 03:53:31    FATAL -     self._RealGetContents()
 [task 2016-09-28T03:53:31.485887Z] 03:53:31    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 763, in _RealGetContents
 [task 2016-09-28T03:53:31.486413Z] 03:53:31    FATAL -     raise BadZipfile, "File is not a zip file"
 [task 2016-09-28T03:53:31.486939Z] 03:53:31    FATAL - BadZipfile: File is not a zip file
 [task 2016-09-28T03:53:31.487421Z] 03:53:31    FATAL - Running post_fatal callback...
 [task 2016-09-28T03:53:31.487918Z] 03:53:31 FATAL - Exiting -1
(sandbox) dustin@lamport ~/p/m-c (bug1269443) $ wget https://queue.taskcluster.net/v1/task/SDlWxl5OQTOX9u4ZNytuTA/artifacts/public/build/target.reftest.tests.zip
...
-rw-rw-r-- 1 dustin dustin 32102631 Sep 28 03:45 target.reftest.tests.zip

and

(sandbox) dustin@lamport ~/p/m-c (bug1269443) $ curl -L -v https://queue.taskcluster.net/v1/task/SDlWxl5OQTOX9u4ZNytuTA/artifacts/public/build/target.reftest.tests.zip >/dev/null
...
< HTTP/1.1 200 OK
< Content-Type: application/zip
< Content-Length: 32102631

so somehow this file is truncated on the server side when it is read.  I don't think S3 does that (its writes are atomic).. yet there it is!

I don't think retrying will help here.

John, do you have any idea how this might occur?  I might guess an error message, but the status is 200 and the size is 574026 bytes, which is a lot for an error response!  If you can dig up some cloud-mirror logging for this request, or see a way to add some logging output that might help debug, that'd be great.

Armen, what would you think about dropping the file contents as a TC artifact (in ~/artifacts) when this occurs?  It would be good to know if the contents are a prefix of the zipfile, or some XML blob, or what..
Flags: needinfo?(gps) → needinfo?(jhford)
Really, that all sounds somewhat similar to bug 1219934.
Which of these fields would be useful?
(Pdb) pp(response.headers.keys())
['content-length',
 'via',
 'x-cache',
 'accept-ranges',
 'server',
 'last-modified',
 'connection',
 'etag',
 'x-amz-cf-id',
 'date',
 'x-amz-version-id',
 'content-type',
 'age']
(Pdb) pp(response.headers.values())
['32102631',
 '1.1 2e30fde54499f7dddca8f5e56939db61.cloudfront.net (CloudFront)',
 'Hit from cloudfront',
 'bytes',
 'AmazonS3',
 'Wed, 28 Sep 2016 03:45:18 GMT',
 'close',
 '"098bdf06f370a606a75637840c4c0b3d"',
 'jjJzvka1TC6MwPTxn3CD3KzWv7bPfHFnQCJoNCLu0cjWoggh3X1kQg==',
 'Wed, 28 Sep 2016 15:22:41 GMT',
 'jwmF11fp84D3PjyPIbbVORt_v1ecEQvR',
 'application/zip',
 '182']
Armen, x-amz-* would be useful, as well as content-length (already logged), via and x-cache.  In fact, all of it would be useful.  Yes, this sounds a lot like bug 1219934 but I think we need to get to the bottom of it -- this is not a failure mode AWS services should be producing.
I checked the cloudfront logs for the download in comment 3.  It was not in cloudfront, so bug 1306040 is probably unrelated.  Aremen's going to try to get more headers and a copy of the corrupt file.
Flags: needinfo?(jhford)
Assignee: nobody → armenzg
This is what the output will look like bar my development exception raising:
12:57:05     INFO - Fetch https://queue.taskcluster.net/v1/task/SDlWxl5OQTOX9u4ZNytuTA/runs/0/artifacts/public%2Fbuild%2Ftarget.jsshell.zip into memory
12:57:05     INFO - Http code: 200
12:57:05     INFO - Content-Encoding: None
12:57:05     INFO - Content-Type: application/zip
12:57:05     INFO - via: 1.1 70bf723bd71722ac07a40107e9b08c63.cloudfront.net (CloudFront)
12:57:05     INFO - x-amz-cf-id: 2Q8tkGRI9XAoOZs9g-XCyDXKJQDHWqUrMaGil9wPNemil9Gd7irU8g==
12:57:05     INFO - x-amz-version-id: RrdLOF_lcmNQdfBVjfzMCaqofCMQ5IAK
12:57:05     INFO - x-cache: Hit from cloudfront
12:57:08     INFO - Expected file size: 8904439
12:57:08     INFO - Obtained file size: 8904439
12:57:08     INFO - Storing corrupted file to /home/armenzg/repos/firefox/testing/mozharness/build/upload/public%2Fbuild%2Ftarget.jsshell.zip
12:57:08    FATAL - Traceback (most recent call last):
12:57:08    FATAL - 
12:57:08    FATAL -   File "/home/armenzg/repos/firefox/testing/mozharness/mozharness/base/script.py", line 689, in download_unpack
12:57:08    FATAL -     function(**kwargs)
12:57:08    FATAL - 
12:57:08    FATAL -   File "/home/armenzg/repos/firefox/testing/mozharness/mozharness/base/script.py", line 559, in unzip
12:57:08    FATAL -     raise zipfile.BadZipfile('hey')
12:57:08    FATAL - 
12:57:08    FATAL - BadZipfile: hey
12:57:08    FATAL - 
12:57:08    FATAL - Running post_fatal callback...
12:57:08    FATAL - Exiting -1
Are we using TLS (https://) everywhere in the artifact upload, download, and even internal proxying lifecycle? TLS provides additional prevention against stream corruption and will provide a buffer against things like bit flips.

Regarding CloudFront, when I worked on a Firefox hotfix add-on to download a Firefox installer, we were seeing a surprising number of hash verification failures on installer downloads from CloudFront. Don't over-estimate the ability of S3 or CloudFront to actually deliver bits accurately, sadly :/
Comment on attachment 8795792 [details]
Bug 1305752 - If the file we download is a corrupted zip file store it for inspection.

https://reviewboard.mozilla.org/r/81734/#review80370
Attachment #8795792 - Flags: review?(dustin) → review+
Yes, https everywhere for artifact handling.  But you're right, this could just be CF barfs (in which case, bug 1306040 will help) or S3 barfs (in which case, well, that's what you get for using Other People's Computers, right?)
Pushed by armenzg@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7eda0dd1c0a8
If the file we download is a corrupted zip file store it for inspection. r=dustin
https://hg.mozilla.org/mozilla-central/rev/7eda0dd1c0a8
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
We're down to 0 since last Thursday.
Well, are we?  There are 100 different bugs for this issue, similar to the docker-image link intermittent (which is now closer to 1000 bugs).  Filing a new bug for every 2-3 instances of an infra issue is a great way to make it seem like our intermittent rates are low, but it doesn't actually *make* those intermittent rates low.
Resolution: FIXED → DUPLICATE
Duplicate of bug: 1306189
Whiteboard: [stockwell infra]
re: comment 23

These were caused by a bad file being copied.  John purged the file from cloud-mirror and things returned to normal.
You need to log in before you can comment on or make changes to this bug.