Closed Bug 1430535 Opened 6 years ago Closed 6 years ago

Partial updates failing with "destination file crc X does not match expected crc Y"

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

References

Details

I first started looking at this because of update verify failures in DevEdition 59.0b1, for partial updates but not completes. Then I found it's present on Nightly too. While this doesn't block updates it does slow them down a lot.

Here's how it manifests, eg in a last-update.log
...
PREPARE PATCH Contents/Info.plist
EXECUTE PATCH Contents/_CodeSignature/CodeResources
LoadSourceFile: destination file crc -1001769479 does not match expected crc 262245848
LoadSourceFile failed
### execution failed
...

update.status contains "failed: 4". The specific file being patched is dependent on the platform; on win64 it's xul.dll.sig. Firefox/Nightly goes on to download the complete update and apply it.

We see this in telemetry for Nightly:
https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=bucket-4!bucket-9&cumulative=0&end_date=2018-01-12&keys=!__none__!__none__&max_channel_version=nightly%252F59&measure=UPDATE_STATUS_ERROR_CODE_PARTIAL_STARTUP&min_channel_version=nightly%252F59&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-11-13&trim=1&use_submission_date=0
with the number of submissions growing rapidly on 2018-01-11.

The regression range appears to be
> https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=4db166f0442dddc5b9011c722d7499501fedf283&tochange=e61c4485494ec4823da22217f665d73858c57e35
because it's OK in the 20180110221942 mac nightly, and fails in 20180111100722. It's non-trival to reproduce this unless you have access to the update server, and can set up a test channel that ensures a partial update (or apply mar files manually).

Nothing is popping out to me from that set of changes. The two candidates I initially suspected
Bug 1418425 - Remove bz2 and use the zlib crc32 function instead
Bug 1401995 - Migrate funsize to python3
are too late and early respectively. 

Simon, were there any other funsize changes which might fit ? Perhaps we can find something in the logs for generating the funsize images.
Flags: needinfo?(sfraser)
This is probably me. I saw that during the dev phase when I was trying to replace bz2.crc by zlib.crc. This is why we didn't made that change yet (we are just using the crctable from bz2 https://hg.mozilla.org/mozilla-central/rev/2e95ebdcae6d)

However, I did update bsdiff as I didn't see any issue in the CI.
https://hg.mozilla.org/mozilla-central/rev/47fbcaf5ad05
Maybe this is the issue.
(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #0)

> 
> Nothing is popping out to me from that set of changes. The two candidates I
> initially suspected
> Bug 1418425 - Remove bz2 and use the zlib crc32 function instead
> Bug 1401995 - Migrate funsize to python3
> are too late and early respectively. 
> 
> Simon, were there any other funsize changes which might fit ? Perhaps we can
> find something in the logs for generating the funsize images.

The async one is plausible, if there are odd side effects running them in parallel, but if bsdiff changed then that's a strong candidate, as that's the tool that does the grind work
Flags: needinfo?(sfraser)
It does look like it is bsdiff. With that change first in 20180112220334 I tried these (partial) update paths:

20180111100722 --> 20180111220102 works
20180111100722 --> 20180112100121 does too
20180111100722 --> 20180112220334 does as well, unexpectedly with bsdiff modified now
20180111100722 --> 20180113100255 doesn't

IIRC we build mbsdiff as part of nightlies and upload it to
 https://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/mar-tools/linux64/mbsdiff
We pull that back into funsize to generate the partial and there's a 4 hour max-age header on that, so it's kinda plausible any change in the crc function would show up off-by-1 nightly.
hitting this on 58.0 update testing: https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-release-win32/release-mozilla-release_firefox_win32_update_verify-bm72-build1-build274.txt.gz

as well as 59.0b1 again.

Unclear how since it was assumed backout was in place prior to start of release
We needed to back out, then have a nightly run to build a new copy of mbsdiff - it looks like the next nightly was having intermittent signing errors so it may not have uploaded the right files yet.  We did ask cloudops to replace mbsdiff in ftp.m.o, which they did, but something may be caching it.
s3://tc-gp-private-1d-us-east-1/releng/mbsdiff-cache/ is the new hypothesis, confidence 90%. 

I added S3 caching of mbsdiff artifacts to speed up funsize tasks, and of course the source and destination files are the same, so it's not even running the new, fixed, mbsdiff, it's just using the cached copy of the results. 

Short term: Will ask taskcluster team to purge the cache, so we don't have to wait for the 24h expiry period. 

Longer term: I will improve the caching so that it takes account of tooling differences.
I've just added https://github.com/mozilla-releng/releasewarrior-2.0/wiki/Purging-the-Partials-Cache as a note on how to clear the partials artifact cache.
We had green update verify jobs in 59.0b1 build10 \o/. Telemetry for nightly isn't conclusive yet as it only goes up to Sunday builds.
I think we can close this specific bug out that was blocking releases. If there is a longer term fix, perhaps we file a follow up?
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
I will report follow up bugs. 
Just one question, Johan told me that the update verification is going to be part of the regular CI in the next couple of months. Correct?
I just tried to update from the 1/16/2018 build to today's Nightly and I'm getting 'Up to date' message again.
Reopening due to failure to update again today.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
That may be related to bug 1431110. It would be helpful if you could make sure app.update.log is set to true, and report the https://aus5.mozilla.org URL being used (look in the Browser Console).
Thanks Nick, - this is the spew from the app.update.log:

1516231454012	Toolkit.Telemetry	WARN	GCTelemetry::Number of properties exceeded in the GC telemetry data ping
AUS:SVC Checker: checkForUpdates, force: true
AUS:SVC Checker:getUpdateURL - update URL: https://aus5.mozilla.org/update/6/Firefox/59.0a1/20180116220110/WINNT_x86_64-msvc-x64/en-US/nightly/Windows_NT%2010.0.0.0.16299.192%20(x64)(noBug1296630v1)(nowebsense)/ISET:SSE4_2,MEM:8096/default/default/update.xml?force=1
AUS:SVC Checker:checkForUpdates - sending request to: https://aus5.mozilla.org/update/6/Firefox/59.0a1/20180116220110/WINNT_x86_64-msvc-x64/en-US/nightly/Windows_NT%2010.0.0.0.16299.192%20(x64)(noBug1296630v1)(nowebsense)/ISET:SSE4_2,MEM:8096/default/default/update.xml?force=1
AUS:SVC Checker:onLoad - request completed downloading document
AUS:SVC Checker:onLoad - number of updates available: 0
OK, it's definitely bug 1431110 preventing data getting into the update server. The latest it knows about for win64 is the build you're using; same for win32. Linux & Mac are up to date.
Closing this again.  The unrelated windows update issue should also be fixed now, per the other bug.
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
Blocks: 1443036
No longer blocks: 1443036
You need to log in before you can comment on or make changes to this bug.