Closed Bug 1698670 Opened 3 years ago Closed 3 years ago

corrupt complete mar: _lzma.LZMAError: Corrupt input data

Categories

(Release Engineering :: Release Automation: Updates, defect)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: mozilla, Assigned: mozilla)

References

Details

Attachments

(1 file)

(I wasn't sure where to file this bug.)

The previous instance of this was in bug 1695898, for win64 lt repacks in a nightly graph. Now we're hitting it on mozilla-release for macos64 en-US builds in the Firefox 87.0 release graph. The issue is twofold:

  1. During the repackage task, we somehow create a mar file that downstream partial update generation tasks can't decompress (_lzma.LZMAError: Corrupt input data).
  2. We don't detect that the mar file is invalid, so the repackage task goes green and unblocks a bunch of downstream tasks. By the time we find the error in the partial update generation task, we have run many downstream tasks; rerunning this upstream task means we need to re-run all downstream tasks in order, or we end up with an invalid nightly or release that can break all subsequent nightlies or releases.

I propose we:

  1. Find why we're creating broken mar files, and fix it, and/or
  2. Detect that the mar file is broken during the repackage task, and retry or fail out if it is.

https://firefox-ci-tc.services.mozilla.com/tasks/euS2QZzKQze4u5kaYU6kaw is the failing partial update generation task in Fx87.0, and https://firefox-ci-tc.services.mozilla.com/tasks/Y2kolD9GTO25C2NX3INsWw/runs/0 is the repackage task that produced the broken mar file.

I think we're compressing in https://searchfox.org/mozilla-central/source/tools/update-packaging/make_full_update.sh during the repackage.

:agashlin suggests:

That's not great... The easiest thing would be to do a trial decompression (unxz -t) of the data before packing it into the MAR.

We could add that call into the make_full_update.sh script, or we could inspect the full mar at the end of the mar.py module in repackage.

He also says:

Is there any way to determine whether this job happened to be running on the same hardware as the one that failed two weeks ago? Can you report the machines to AWS for them to run a diagnostic?

I don't know the answer to this; I asked the #taskcluster matrix channel.

Aryx noticed the nightly repackage worker and the recent release repackage worker were in different AWS regions, which rules out the same hardware.

Lots of discussion in matrix #install-update. Currently we're poking at something something taskcluster-artifacts cdn, in which case a repackage task test wouldn't solve the problem.

So the mac complete mars appear to be the same file between runs: the sha256 in the cot artifacts stays the same, and agashlin found the md5 etag was the same, probably ruling out an upload error. This was not the case in the win64 lt bustage in bug 1695898 - the sha256 changed between runs there.

We currently suspect the CDN caching a partial download, or something? Could also point to partial update weirdness, but I don't know how. agashlin did see:

Maybe, for some reason it goes through firefoxci.taskcluster-artifacts.net the first (failingh) time, but s3.us-west-2.amazonaws.com the second (successful) time

Testing for a corrupt mar in-task wouldn't fix the mac issue we hit in fx87, but it may have caught the win64 lt issue we hit in bug 1695898.

If someone can summarize the findings so far, that would be useful. Ideally we could narrow this down to one of:

  • data is uploaded incorrectly
  • data changes at rest in AWS S3 or CloudFront
  • data is downloaded incorrectly

Knowing the nature of the corruption would be useful, too: is the data truncated, but correct until the point of truncation? Or are there bytes differing between good and bad files?

The CDN is a pretty standard CloudFront distribution with an S3 bucket as backend. The queue's REST API redirects to the CloudFront URL unless the caller's IP is in the same region as the bucket, in which case it hands out the bucket URL.

I'm still struggling to understand what's going wrong here. I had a look at both runs in
https://firefox-ci-tc.services.mozilla.com/tasks/H1JYT38VSImbAOP64SzY7w
and those two runs appear to have created different target.complete.mars, as verified by the sha256's in chain-of-trust.json. I downloaded both from whatever CloudFront (CDN) node is closest to me (response headers suggest that's ATL56-C1), and got full files with matching sha256's.

Looking at the tasks linked in comment 0, https://firefox-ci-tc.services.mozilla.com/tasks/Y2kolD9GTO25C2NX3INsWw has two runs, both of which produced the same target.complete.mar, of 98864178 bytes with sha256 bd0078554b488937eedf5b6587d7a4eb10d1516e126660333ca8b0a8247c6bfd. Downloading those via CDN (again from ATL56-C1) gives files with the correct length and sha256. The subsequent task downloads it, ending with

2021-03-15 19:14:02,293 - INFO - Downloaded https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Y2kolD9GTO25C2NX3INsWw/artifacts/public/build/target.complete.mar, 98864178 bytes in 3 seconds

which is the correct length, at least. That download was performed on an EC2 instance in us-east-1, so it would have been via cloudfront, but probably not the ALT56-C1 node. Running mar -v on that file seems to work just fine.

Looking at funsize.py, nothing looks amiss. Here's the traceback:

  File "/home/worker/bin/funsize.py", line 279, in download_and_verify_mars
    m.extract(downloads[url]["extracted_path"])
  File "/usr/local/lib/python3.8/dist-packages/mardor/reader.py", line 146, in extract
    write_to_file(self.extract_entry(e, decompress), f)
  File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 103, in write_to_file
    for block in src:
  File "/usr/local/lib/python3.8/dist-packages/mardor/reader.py", line 126, in extract_entry
    for block in stream:
  File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 229, in auto_decompress_stream
    for block in src:
  File "/usr/local/lib/python3.8/dist-packages/mardor/utils.py", line 181, in xz_decompress_stream
    decoded = dec.decompress(block)
_lzma.LZMAError: Corrupt input data

(using mar==3.1.0)

So the error appears to be coming straight from the (binary) LZMA library. Confirming versions in the docker image used in CI and on my host:

worker@f761feb421f9:/$ apt list liblzma-dev
Listing... Done
liblzma-dev/bionic,now 5.2.2-1.3 amd64 [installed]
worker@f761feb421f9:/$ exit
lamport ~/p/m-c (c00d2b6) $ apt list liblzma-dev
Listing... Done
liblzma-dev/bionic,now 5.2.2-1.3 amd64 [installed]

And, indeed:

worker@342d9574cd9a:/$ mar -v /tmp/run0.mar 
Verification OK

So, the boundaries we have on the issue are that the correct bytes made it to S3 at one point (since they are present now). Either those bytes were changed somewhere between S3 and the invocation of MarReader.extract, or there's an intermittent issue with the extraction. The latter might be caused by bad RAM on an EC2 instance, for example, although I'd expect that to affect very few cases -- how frequently does this happen?

I'd suggest adding a sha256 hash to the funsize download and printing the hash after the download. Then, next time this issue occurs, you can determine whether the corruption occurred during the download or after.

Keywords: leave-open
Assignee: nobody → aki
Status: NEW → ASSIGNED
Pushed by asasaki@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4a37d885ee55
log sha256 of downloads in funsize.py r=releng-reviewers,bhearsum DONTBUILD

We haven't seen this in 5 months, so we haven't triggered the extra logging.
Resolving until we see this again.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → INCOMPLETE
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: