Open Bug 1323112 Opened 8 years ago Updated 4 months ago

Intermittent raise BadZipfile("Bad CRC-32 for file %r" % self.name)

Categories

(Release Engineering :: Applications: MozharnessCore, defect)

defect
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: bulk-close-intermittents, intermittent-failure, Whiteboard: [stockwell infra])

Attachments

(1 file)

969 bytes, text/x-python-script
Details
this bug spiked this past weekend, specifically December 19th on win8- Why does this seem to show up on Sunday evenings only?
this is win8 debug and we get an error unpacking:
BadZipfile: Bad CRC-32 for file 'web-platform/tests/media-source/webm/test-v-128k-320x240-24fps-8kfr.webm' 

it seems to be all related to:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=863c2b61bd27bb6099104933134d3be7c052551a&filter-searchStr=win

but I download the .zip file and unzip or python zipfile.extract() results in success.  Is this a timing issue?  Maybe the .zip file is not generated when we start running tests?  It looks as if the file was generated about 4 minutes prior, maybe that is not enough time and there is a pending handle on the .zip file?

what is odd is that retriggers on the failed jobs still result in the same failure.  it would be nice to figure out why our automation fails when it works manually.

Looking at the source code where we fail, I see a comment related to this mentioning a retry- but there is no code to retry:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#563

:armenzg, I see that you touched this code recently, could you weigh in on the comment at line 563 and your thoughts on this specific error?
I forgot to ni Armen:)
Flags: needinfo?(armenzg)
Attached file unzip.py
The comment referred to is probably incorrect.
We don't retry for BadZipfile; we abort and dump the exception:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#682-687

41/44 of the tagged jobs are in reference to a specific revision [1], thus, the spike.

As whimboo points out we can't reproduce this issue locally.
My assumption is that the file we download is not the same one that cloud mirror gives to the test machine or that the extract() method behaves slightly different on Windows than on my Mac machine. The attached code tries to extract the code in a similar way as Mozharness would do.

Loaning a Windows machine and trying there could yield more information.

Notice that in Mozharness we actually read the file into memory rather than downloading to disk. I assume this should not have an impact.

Notice that it also specifically affects web platform tests rather than all suites and even then, not all of them failed.
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=863c2b61bd27bb6099104933134d3be7c052551a&filter-searchStr=windows%20x64%20debug%20web&group_state=expanded

I've also noticed that for different jobs, it happened on different files (target.mochitest.tests.zip vs firefox-53.0a1.en-US.win64.web-platform.tests.zip) and even the same file worked later on.

It failed:
09:26:38     INFO - Fetch https://queue.taskcluster.net/v1/task/OCoe-sWdQCqEKUobEwDwjw/artifacts/public/build/firefox-53.0a1.en-US.win64.web-platform.tests.zip into memory
09:26:57    FATAL - BadZipfile: Bad CRC-32 for file 'web-platform/tests/media-source/webm/test-v-128k-320x240-24fps-8kfr.webm'

It worked (I don't understand why the machine would use 21 hours instead of 9):
21:34:42     INFO - Fetch https://queue.taskcluster.net/v1/task/OCoe-sWdQCqEKUobEwDwjw/artifacts/public/build/firefox-53.0a1.en-US.win64.common.tests.zip into memory
https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-central-win64-debug/1482121475/mozilla-central_win8_64-debug_test-web-platform-tests-e10s-3-bm112-tests1-windows-build4.txt.gz

How far should we investigate this? My assumption is this doesn't happen often enough to require engineering time.

Possible solutions:
* Download the file once more and try one more time
* Dump the exact URL

STR:
* wget https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/public/build/target.mochitest.tests.zip
* python unzip.py

[1]
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1323112&startday=2016-12-11&endday=2016-12-20


[task 2016-12-12T20:23:36.365854Z] 20:23:36     INFO - Downloading and extracting to /home/worker/workspace/build/tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, mochitest/* from https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/public/build/target.mochitest.tests.zip
[task 2016-12-12T20:23:36.373149Z] 20:23:36     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/public/build/target.mochitest.tests.zip'}, attempt #1
[task 2016-12-12T20:23:36.375902Z] 20:23:36     INFO - Fetch https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/public/build/target.mochitest.tests.zip into memory
[task 2016-12-12T20:23:45.672087Z] 20:23:45     INFO - Expected file size: 74845730
[task 2016-12-12T20:23:45.675772Z] 20:23:45     INFO - Obtained file size: 74845730
[task 2016-12-12T20:23:57.953344Z] 20:23:57    FATAL - Traceback (most recent call last):
[task 2016-12-12T20:23:57.956006Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.960371Z] 20:23:57    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 684, in download_unpack
[task 2016-12-12T20:23:57.962159Z] 20:23:57    FATAL -     function(**kwargs)
[task 2016-12-12T20:23:57.963859Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.965584Z] 20:23:57    FATAL -   File "/home/worker/workspace/mozharness/mozharness/base/script.py", line 566, in unzip
[task 2016-12-12T20:23:57.967284Z] 20:23:57    FATAL -     bundle.extract(entry, path=extract_to)
[task 2016-12-12T20:23:57.969611Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.971416Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 952, in extract
[task 2016-12-12T20:23:57.973149Z] 20:23:57    FATAL -     return self._extract_member(member, path, pwd)
[task 2016-12-12T20:23:57.974846Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.976527Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 997, in _extract_member
[task 2016-12-12T20:23:57.978190Z] 20:23:57    FATAL -     shutil.copyfileobj(source, target)
[task 2016-12-12T20:23:57.979782Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.981449Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj
[task 2016-12-12T20:23:57.983074Z] 20:23:57    FATAL -     buf = fsrc.read(length)
[task 2016-12-12T20:23:57.984668Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.986423Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 583, in read
[task 2016-12-12T20:23:57.988105Z] 20:23:57    FATAL -     data = self.read1(n - len(buf))
[task 2016-12-12T20:23:57.989793Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.993687Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 643, in read1
[task 2016-12-12T20:23:57.995514Z] 20:23:57    FATAL -     self._update_crc(data, eof=eof)
[task 2016-12-12T20:23:57.997256Z] 20:23:57    FATAL - 
[task 2016-12-12T20:23:57.999044Z] 20:23:57    FATAL -   File "/usr/lib/python2.7/zipfile.py", line 598, in _update_crc
[task 2016-12-12T20:23:58.000835Z] 20:23:58    FATAL -     raise BadZipfile("Bad CRC-32 for file %r" % self.name)
[task 2016-12-12T20:23:58.002456Z] 20:23:58    FATAL - 
[task 2016-12-12T20:23:58.004297Z] 20:23:58    FATAL - BadZipfile: Bad CRC-32 for file 'mochitest/browser/devtools/client/inspector/markup/test/lib_jquery_1.6_min.js'
Flags: needinfo?(armenzg)
jhford: hey John, is there a way that Mozharness could figure out exactly where a file is downloaded from TaskCluster?
I believe users outside of Amazon get a copy of the file from a different location.
It might be useful to debug when we can't uncompress a file.

Could you also check if you have any issues unzipping these two files with the script attached? Notice that I hardcoded the path on the script (I got lazy):
https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/public/build/target.mochitest.tests.zip
https://queue.taskcluster.net/v1/task/OCoe-sWdQCqEKUobEwDwjw/artifacts/public/build/firefox-53.0a1.en-US.win64.web-platform.tests.zip

Thanks!
Flags: needinfo?(jhford)
we had 100+ instances happen today- luckily this happens in spurts, but this is nasty when it does happen.
(In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #6)
> jhford: hey John, is there a way that Mozharness could figure out exactly
> where a file is downloaded from TaskCluster?

Yes, mozharness would need to log the Location: headers of all redirects when resolving an artifact URL.  This means that more than likely, mozharness would need to implement its own redirection handling, or instrument the existing implementation.

> I believe users outside of Amazon get a copy of the file from a different
> location.

Yes, that's 100% by design.  All requests which originate from AWS's us-west-2 get served raw S3 urls.  All requests which originate from AWS's other regions get served cloud-mirror urls.  All remaining requests get CDN urls.

A cloud-url redirects either to the us-west-2 url if the request times out, or else it redirects to an s3 in the same region as the request originated from.

All queue artifact urls get at least one 300-series redirect and all handling code should handle an arbitraty number of redirects.

> It might be useful to debug when we can't uncompress a file.

Yes.  I am working to reimplement a large part of the Taskcluster artifact api this quarter.  One of the goals I have for that is developing a better way to download artifacts, which might include implementing a small program which can be used to do all the download/retry/redirect logic.  Once this work is done, we'll better be able to ensure that artifacts are stored and uploaded only if valid.

> Could you also check if you have any issues unzipping these two files with
> the script attached? Notice that I hardcoded the path on the script (I got
> lazy):
> https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/
> public/build/target.mochitest.tests.zip
> https://queue.taskcluster.net/v1/task/OCoe-sWdQCqEKUobEwDwjw/artifacts/
> public/build/firefox-53.0a1.en-US.win64.web-platform.tests.zip

From where would I be downloading this file?  My desktop (cdn), us-west-2 or a different aws region?

> Thanks!

You're welcome!
Flags: needinfo?(jhford)
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #12)
> (In reply to Armen Zambrano [:armenzg] (EDT/UTC-4) from comment #6)
> > Could you also check if you have any issues unzipping these two files with
> > the script attached? Notice that I hardcoded the path on the script (I got
> > lazy):
> > https://queue.taskcluster.net/v1/task/J-hzrcDWQXepJLWcBmryIQ/artifacts/
> > public/build/target.mochitest.tests.zip
> > https://queue.taskcluster.net/v1/task/OCoe-sWdQCqEKUobEwDwjw/artifacts/
> > public/build/firefox-53.0a1.en-US.win64.web-platform.tests.zip
> 
> From where would I be downloading this file?  My desktop (cdn), us-west-2 or
> a different aws region?
> 
I believe I was running this script from my local machine.

FYI I will be gone Friday until the end of March.
this comes in spurts, :gbrown, do you think this is really infra related, or something in our harnesses that we could fix with more time.
Flags: needinfo?(gbrown)
I think it is infra related. I'm hoping jhford's work - comment 12 - brings about resolution here.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell infra]
:jhford, could you give an update here on your work to fix this issue?  about 2 months ago you mentioned some work that would help out here- we had a large spike yesterday and it would be good to get an update.
Flags: needinfo?(jhford)
Possible causes of this issue from Taskcluster should be detectable and/or resolved by the new artifact api that I'm working on.
Flags: needinfo?(jhford)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INCOMPLETE
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
Recent failure log:

https://treeherder.mozilla.org/logviewer.html#?job_id=188522941&repo=autoland&lineNumber=263

01:33:55     INFO - Downloaded 72536800 bytes.
01:33:55     INFO - Setting property build_url to https://queue.taskcluster.net/v1/task/Fkau-fvyQ--uz8ljj9672w/artifacts/public/build/target.dmg
01:33:55     INFO - retry: Calling query_symbols_url with args: (), kwargs: {'raise_on_failure': True}, attempt #1
01:33:55     INFO - Setting property symbols_url to https://queue.taskcluster.net/v1/task/Fkau-fvyQ--uz8ljj9672w/artifacts/public/build/target.crashreporter-symbols.zip
01:33:55     INFO - Downloading and extracting to /Users/cltbld/tasks/task_1531808327/build/symbols these dirs * from https://queue.taskcluster.net/v1/task/Fkau-fvyQ--uz8ljj9672w/artifacts/public/build/target.crashreporter-symbols.zip
01:33:55     INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': 'https://queue.taskcluster.net/v1/task/Fkau-fvyQ--uz8ljj9672w/artifacts/public/build/target.crashreporter-symbols.zip'}, attempt #1
01:33:55     INFO - Fetch https://queue.taskcluster.net/v1/task/Fkau-fvyQ--uz8ljj9672w/artifacts/public/build/target.crashreporter-symbols.zip into memory
01:33:58     INFO - Content-Length response header: 143977258
01:33:58     INFO - Bytes received: 143977258
01:34:00    FATAL - Traceback (most recent call last):
01:34:00    FATAL - 
01:34:00    FATAL -   File "/Users/cltbld/tasks/task_1531808327/mozharness/mozharness/base/script.py", line 729, in download_unpack
01:34:00    FATAL -     function(**kwargs)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/Users/cltbld/tasks/task_1531808327/mozharness/mozharness/base/script.py", line 609, in unzip
01:34:00    FATAL -     bundle.extract(entry, path=extract_to)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 1028, in extract
01:34:00    FATAL -     return self._extract_member(member, path, pwd)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 1084, in _extract_member
01:34:00    FATAL -     shutil.copyfileobj(source, target)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 49, in copyfileobj
01:34:00    FATAL -     buf = fsrc.read(length)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 632, in read
01:34:00    FATAL -     data = self.read1(n - len(buf))
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 692, in read1
01:34:00    FATAL -     self._update_crc(data, eof=eof)
01:34:00    FATAL - 
01:34:00    FATAL -   File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 647, in _update_crc
01:34:00    FATAL -     raise BadZipfile("Bad CRC-32 for file %r" % self.name)
01:34:00    FATAL - 
01:34:00    FATAL - BadZipfile: Bad CRC-32 for file 'XUL/841C1CED9493390CABFB4175DF40C05D0/XUL.sym'
01:34:00    FATAL - 
01:34:00    FATAL - Running post_fatal callback...
01:34:00    FATAL - Exiting -1
01:34:00     INFO - Running post-action listener: _resource_record_post_action
01:34:00     INFO - Running post-action listener: find_tests_for_coverage
01:34:00     INFO - Running post-action listener: find_tests_for_verification
01:34:00     INFO - Running post-action listener: set_extra_try_arguments
01:34:00     INFO - Running post-action listener: setup_coverage_tools
01:34:00     INFO - [mozharness: 2018-07-17 08:34:00.897748Z] Finished download-and-extract step (failed)
01:34:00     INFO - Running post-run listener: _resource_record_post_run
01:34:00     INFO - Running post-run listener: copy_logs_to_upload_dir
01:34:00     INFO - Copying logs to upload dir...
01:34:00     INFO - mkdir: /Users/cltbld/tasks/task_1531808327/build/upload/logs
[taskcluster 2018-07-17T08:34:00.929Z] Exit Code: 255
[taskcluster 2018-07-17T08:34:00.930Z] === Task Finished ===
[taskcluster 2018-07-17T08:34:00.930Z] Task Duration: 1m6.694790973s
[taskcluster 2018-07-17T08:34:01.283Z] Uploading artifact public/logs/localconfig.json from file logs/localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:01.945Z] Uploading artifact public/logs/log_critical.log from file logs/log_critical.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:02.309Z] Uploading artifact public/logs/log_error.log from file logs/log_error.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:02.690Z] Uploading artifact public/logs/log_fatal.log from file logs/log_fatal.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:03.077Z] Uploading artifact public/logs/log_info.log from file logs/log_info.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:03.439Z] Uploading artifact public/logs/log_raw.log from file logs/log_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster 2018-07-17T08:34:03.834Z] Uploading artifact public/logs/log_warning.log from file logs/log_warning.log with content encoding "gzip", mime type "text/plain" and expiry 2019-07-17T08:14:04.703Z
[taskcluster:error] Uploading error artifact public/test_info from file build/blobber_upload_dir with message "Could not read directory '/Users/cltbld/tasks/task_1531808327/build/blobber_upload_dir'", reason "file-missing-on-worker" and expiry 2019-07-17T08:14:04.703Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory '/Users/cltbld/tasks/task_1531808327/build/blobber_upload_dir'
[taskcluster:error] exit status 255
[taskcluster:error] file-missing-on-worker: Could not read directory '/Users/cltbld/tasks/task_1531808327/build/blobber_upload_dir'
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
https://wiki.mozilla.org/Bug_Triage#Intermittent_Test_Failure_Cleanup
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → INCOMPLETE
See Also: → 1760700
Assignee: nobody → gbrown
Status: REOPENED → RESOLVED
Closed: 6 years ago2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

(In reply to Natalia Csoregi [:nataliaCs] from comment #64)

Recent log: https://treeherder.mozilla.org/logviewer?job_id=397311129&repo=mozilla-central&lineNumber=1593

[task 2022-11-22T12:08:47.221Z] 12:08:47     INFO - Downloading https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/d7qh90YzQgahXSkDa8opJg/artifacts/public/build/target.zip to Z:\task_166911850942579\build\installer.zip
[task 2022-11-22T12:08:47.221Z] 12:08:47     INFO - retry: Calling _download_file with args: (), kwargs: {'url': 'https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/d7qh90YzQgahXSkDa8opJg/artifacts/public/build/target.zip', 'file_name': 'Z:\\task_166911850942579\\build\\installer.zip'}, attempt #1
[task 2022-11-22T12:08:48.679Z] 12:08:48     INFO - Downloaded 156469142 bytes.

If I download locally, I get a file of the exact same size.

[task 2022-11-22T12:10:57.212Z] 12:10:57    ERROR -  mozinstall.mozinstall.InstallError: Failed to install "Z:\task_166911850942579\build\installer.zip (Bad CRC-32 for file 'firefox/fonts/TwemojiMozilla.ttf')"

Locally, I can unzip without any problem. Also, other tasks used the same zip without incident (but 3 tasks failed exactly the same way, on the same zip and the same member, TwoemojiMozilla.ttf).

Duplicate of this bug: 1760700
See Also: 1760700
Duplicate of this bug: 1678300

Most failures from the last year have been on Windows, especially in Windows HW tests.

Status: REOPENED → RESOLVED
Closed: 2 years ago1 year ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Assignee: gbrown → nobody
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: