Closed Bug 1396337 Opened 8 years ago Closed 2 years ago

CDN Max Age causes frequent sha512 mismatches on firefox nightly

Categories

(Release Engineering :: Release Automation, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: graham, Unassigned)

References

Details

(Whiteboard: [releaseduty])

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0 Build ID: 20170902100317 Steps to reproduce: 1. Fetch checksums: nixos$ curl https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-57.0a1.en-US.linux-x86_64.checksums | grep firefox-57.0a1.en-US.linux-x86_64.tar.bz2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 6737 100 6737 0 0 6737 0 0:00:01 --:--:-- 0:00:01 26734 a36165adeb097c09194e4ba46c8c94a65aabe747dad991b0131539bdc73dad86efadf242d0d708e27d0a9275b3da19996e795fc694363af4be7326c94de19f79 sha512 63398393 firefox-57.0a1.en-US.linux-x86_64.tar.bz2 2. fetch nightly and verify: nixos$ curl -O https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-57.0a1.en-US.linux-x86_64.tar.bz2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 60.4M 100 60.4M 0 0 8845k 0 0:00:07 0:00:07 --:--:-- 8569k nixos$ sha512sum firefox-57.0a1.en-US.linux-x86_64.tar.bz2 4ec4e8c2c4b05eb7db002139229922d9c5a2590c33d114505632d22c5e0fdbdd0ff58ef36739d1debfee9cf007531762205d78208d7d11a175421057608a8ab2 firefox-57.0a1.en-US.linux-x86_64.tar.bz2 oops! mismatch! Looks like the max age on the CDN is pretty long: nixos$ curl -I https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-57.0a1.en-US.linux-x86_64.tar.bz2 HTTP/1.1 200 OK Content-Type: application/x-tar Content-Length: 63402604 Connection: keep-alive Date: Sat, 02 Sep 2017 14:51:41 GMT x-amz-replication-status: COMPLETED Last-Modified: Sat, 02 Sep 2017 11:53:42 GMT ETag: "a0d5208ddd4a8d799b658ce9503d8d52" Cache-Control: public, max-age=14400 x-amz-version-id: krhR3XJpTsGJplsrA9s5yh9p6UXspUUB Accept-Ranges: bytes Server: AmazonS3 Age: 10176 X-Cache: Hit from cloudfront Via: 1.1 044470188efe7aea5c8537e1416e3d92.cloudfront.net (CloudFront) X-Amz-Cf-Id: qF0H3dksgrWaeamsv_FrRbP3woVare-jUzHrvs98WGu5xp94nuQwGQ== Actual results: The sha512 of https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-57.0a1.en-US.linux-x86_64.tar.bz2 didn't match the advertised checksum in https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-central/firefox-57.0a1.en-US.linux-x86_64.checksums. Expected results: The checksums should match. Perhaps reducing the max age, or invalidating the CDN contents when the file is replaced would help. I found this because https://github.com/mozilla/nixpkgs-mozilla/'s firefox-overlay requires matching sha512s.
We've also seen reports on reddit or other social media about this. e.g. https://www.reddit.com/r/firefox/comments/6x7ccl/well_that_was_terrifying_i_just_downloaded_an/ Jeremy, any ideas? In bug 1344572 we reduced max-age to 14400. Can we reduce it even further, to something short like 15 minutes? Could we flush the CDN caches when we update files inside the latest directories (related to bug 1393990)
Flags: needinfo?(oremj)
Feel free to reduce the max-age on nightlies, or anything that replaces files instead of creating new ones. Somewhere in the 5-15 minute range should be fine. Let me know if you make the change and I can check to see how it affects cache hit rate.
Flags: needinfo?(oremj)
Mihai, what would be involved in having different cache settings per destination in beetmover?
Flags: needinfo?(mtabara)
(In reply to Chris AtLee [:catlee] from comment #3) > Mihai, what would be involved in having different cache settings per > destination in beetmover? Not sure what you mean here. If you mean destination as in a specific path for each artifact like [1], then: * short-term: we can probably hack setting the cache header depending on the prefix of `path` here[0]. If `nightly` set something, otherwise set something else * long term: see below If you mean different buckets in general as ~nightly vs ~candidates? * we could rewrite a bit the beetmover templates to be something like: locations: destination: A cache: B (more generic value coming from constants.py) + change the logic within beetmover. This would ensure a much better granularity and a generic way of setting the cache control. Does this make sense or did I misunderstand your question? [0]: https://github.com/mozilla-releng/beetmoverscript/blob/master/beetmoverscript/script.py#L247 [1]: https://github.com/mozilla-releng/beetmoverscript/blob/master/beetmoverscript/templates/firefox_nightly.yml#L15
Flags: needinfo?(mtabara)
The end result I'd like to get to is for nightlies to have a short max age, and candidates/releases would have longer. I like your suggestion of modifying the templates to allow each template to override the default cache control settings.
It's .../nightly/latest-mozilla-central/... and friends which need to be shorter to fix this bug. I think .../nightly/YYYY/MM/... can retain a long cache, unless there is some use case were we might overwrite those locations that I've forgotten.
Status: UNCONFIRMED → NEW
Component: General → Release Automation
Ever confirmed: true
Priority: -- → P2
QA Contact: catlee
Whiteboard: [releaseduty]
Priority: P2 → P1

As I said in https://bugzilla.mozilla.org/show_bug.cgi?id=1483637, the proper fix is that cache should be invalidated when a new build is pushed. This would even allow to set longer caching periods (like back to 24 h), since you would now control the cache expiration manually anyway.

The lambda function created in bug 1393990 might be worth another look here, eg call it from beetmover.

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #9)

The lambda function created in bug 1393990 might be worth another look here, eg call it from beetmover

This is a great idea. Maybe we should chain that behavior to the existing beetmover-checksums job which runs at the end of the promotion-graph. Ah wait, we don't have that yet in nightly graphs. Until shippable builds are implemented, sounds like we really need to backport the "post-beetmover-dummy" tasks or alike. Or maybe we should just create a different task. I'd definitely avoid calling that lambda function 500 times for all the locales in the graph though.

Quick follow-up here, I had a conversation with CloudOps earlier today and conclusions were:

  • we could use the lambda call programatically if we wanted to
  • most likely we'd hit some API limits if we called this with each beetmover job, rather than once at the end of the release graph
  • we still need to check if there are some costs associated with the recursive items (costwise or timewise or w\e)
  • we could be calling the CDN purge after beetmover jobs will have been transferred the files and then once again after the beetmover checksums but there are still issues around that (regex tweaks for just the latter, still a timeframe window in which we could be having a mismatch, etc)
  • we currently set a cache-control max-age policy of 4h for all artifacts that beetmover is sending. We could downgrade that to 1h or 5 mins for nightly artifacts and leave it be for the rest of them (or not set it and default to 24h or w\e)

Bottom-line, we should be exploring the possibility of setting the max-age to something smaller for nightly arfifacts and ideally solve this without the lambda API CDN cache call.

As far as I can tell, max-age on files in latest-mozilla-central is 300s, and 43200 on things like https://archive.mozilla.org/pub/firefox/nightly/2023/05/2023-05-31-04-17-23-mozilla-central/firefox-115.0a1.en-US.linux-x86_64.tar.bz2

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.