corrupt complete mar: _lzma.LZMAError: Corrupt input data
Categories
(Release Engineering :: Release Automation: Updates, defect)
Tracking
(Not tracked)
People
(Reporter: mozilla, Assigned: mozilla)
References
Details
Attachments
(1 file)
(I wasn't sure where to file this bug.)
The previous instance of this was in bug 1695898, for win64 lt repacks in a nightly graph. Now we're hitting it on mozilla-release for macos64 en-US builds in the Firefox 87.0 release graph. The issue is twofold:
- During the
repackage
task, we somehow create a mar file that downstream partial update generation tasks can't decompress (_lzma.LZMAError: Corrupt input data
). - We don't detect that the mar file is invalid, so the repackage task goes green and unblocks a bunch of downstream tasks. By the time we find the error in the partial update generation task, we have run many downstream tasks; rerunning this upstream task means we need to re-run all downstream tasks in order, or we end up with an invalid nightly or release that can break all subsequent nightlies or releases.
I propose we:
- Find why we're creating broken mar files, and fix it, and/or
- Detect that the mar file is broken during the repackage task, and retry or fail out if it is.
https://firefox-ci-tc.services.mozilla.com/tasks/euS2QZzKQze4u5kaYU6kaw is the failing partial update generation task in Fx87.0, and https://firefox-ci-tc.services.mozilla.com/tasks/Y2kolD9GTO25C2NX3INsWw/runs/0 is the repackage task that produced the broken mar file.
Assignee | ||
Comment 1•3 years ago
|
||
I think we're compressing in https://searchfox.org/mozilla-central/source/tools/update-packaging/make_full_update.sh during the repackage.
:agashlin suggests:
That's not great... The easiest thing would be to do a trial decompression (unxz -t) of the data before packing it into the MAR.
We could add that call into the make_full_update.sh script, or we could inspect the full mar at the end of the mar.py module in repackage.
He also says:
Is there any way to determine whether this job happened to be running on the same hardware as the one that failed two weeks ago? Can you report the machines to AWS for them to run a diagnostic?
I don't know the answer to this; I asked the #taskcluster matrix channel.
Assignee | ||
Comment 2•3 years ago
|
||
Aryx noticed the nightly repackage worker and the recent release repackage worker were in different AWS regions, which rules out the same hardware.
Assignee | ||
Comment 3•3 years ago
|
||
Lots of discussion in matrix #install-update. Currently we're poking at something something taskcluster-artifacts cdn, in which case a repackage task test wouldn't solve the problem.
Assignee | ||
Comment 4•3 years ago
•
|
||
So the mac complete mars appear to be the same file between runs: the sha256 in the cot artifacts stays the same, and agashlin found the md5 etag was the same, probably ruling out an upload error. This was not the case in the win64 lt bustage in bug 1695898 - the sha256 changed between runs there.
We currently suspect the CDN caching a partial download, or something? Could also point to partial update weirdness, but I don't know how. agashlin did see:
Maybe, for some reason it goes through firefoxci.taskcluster-artifacts.net the first (failingh) time, but s3.us-west-2.amazonaws.com the second (successful) time
Testing for a corrupt mar in-task wouldn't fix the mac issue we hit in fx87, but it may have caught the win64 lt issue we hit in bug 1695898.
Comment 5•3 years ago
|
||
If someone can summarize the findings so far, that would be useful. Ideally we could narrow this down to one of:
- data is uploaded incorrectly
- data changes at rest in AWS S3 or CloudFront
- data is downloaded incorrectly
Knowing the nature of the corruption would be useful, too: is the data truncated, but correct until the point of truncation? Or are there bytes differing between good and bad files?
The CDN is a pretty standard CloudFront distribution with an S3 bucket as backend. The queue's REST API redirects to the CloudFront URL unless the caller's IP is in the same region as the bucket, in which case it hands out the bucket URL.
Comment 6•3 years ago
|
||
I'm still struggling to understand what's going wrong here. I had a look at both runs in
https://firefox-ci-tc.services.mozilla.com/tasks/H1JYT38VSImbAOP64SzY7w
and those two runs appear to have created different target.complete.mar
s, as verified by the sha256's in chain-of-trust.json
. I downloaded both from whatever CloudFront (CDN) node is closest to me (response headers suggest that's ATL56-C1
), and got full files with matching sha256's.
Looking at the tasks linked in comment 0, https://firefox-ci-tc.services.mozilla.com/tasks/Y2kolD9GTO25C2NX3INsWw has two runs, both of which produced the same target.complete.mar
, of 98864178 bytes with sha256 bd0078554b488937eedf5b6587d7a4eb10d1516e126660333ca8b0a8247c6bfd
. Downloading those via CDN (again from ATL56-C1
) gives files with the correct length and sha256. The subsequent task downloads it, ending with
2021-03-15 19:14:02,293 - INFO - Downloaded https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Y2kolD9GTO25C2NX3INsWw/artifacts/public/build/target.complete.mar, 98864178 bytes in 3 seconds
which is the correct length, at least. That download was performed on an EC2 instance in us-east-1, so it would have been via cloudfront, but probably not the ALT56-C1
node. Running mar -v
on that file seems to work just fine.
Looking at funsize.py
, nothing looks amiss. Here's the traceback:
File "/home/worker/bin/funsize.py", line 279, in download_and_verify_mars
m.extract(downloads[url]["extracted_path"])
File "/usr/local/lib/python3.8/dist-packages/mardor/reader.py", line 146, in extract
write_to_file(self.extract_entry(e, decompress), f)
File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 103, in write_to_file
for block in src:
File "/usr/local/lib/python3.8/dist-packages/mardor/reader.py", line 126, in extract_entry
for block in stream:
File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 229, in auto_decompress_stream
for block in src:
File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 181, in xz_decompress_stream
decoded = dec.decompress(block)
_lzma.LZMAError: Corrupt input data
(using mar==3.1.0)
So the error appears to be coming straight from the (binary) LZMA library. Confirming versions in the docker image used in CI and on my host:
worker@f761feb421f9:/$ apt list liblzma-dev
Listing... Done
liblzma-dev/bionic,now 5.2.2-1.3 amd64 [installed]
worker@f761feb421f9:/$ exit
lamport ~/p/m-c (c00d2b6) $ apt list liblzma-dev
Listing... Done
liblzma-dev/bionic,now 5.2.2-1.3 amd64 [installed]
And, indeed:
worker@342d9574cd9a:/$ mar -v /tmp/run0.mar
Verification OK
So, the boundaries we have on the issue are that the correct bytes made it to S3 at one point (since they are present now). Either those bytes were changed somewhere between S3 and the invocation of MarReader.extract
, or there's an intermittent issue with the extraction. The latter might be caused by bad RAM on an EC2 instance, for example, although I'd expect that to affect very few cases -- how frequently does this happen?
I'd suggest adding a sha256 hash to the funsize download and printing the hash after the download. Then, next time this issue occurs, you can determine whether the corruption occurred during the download or after.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 7•3 years ago
|
||
Updated•3 years ago
|
Pushed by asasaki@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4a37d885ee55 log sha256 of downloads in funsize.py r=releng-reviewers,bhearsum DONTBUILD
Comment 9•3 years ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 11•3 years ago
|
||
We haven't seen this in 5 months, so we haven't triggered the extra logging.
Resolving until we see this again.
Assignee | ||
Updated•3 years ago
|
Description
•