esr102 update-verify failing due to stale responses from balrog
Categories
(Release Engineering :: Release Automation: Updates, defect, P2)
Tracking
(Not tracked)
People
(Reporter: jcristau, Unassigned)
References
Details
Attachments
(1 file)
7.29 KB,
application/json
|
Details |
We initially didn't have the rules properly set up on esr-localtest-next, but even after we did, some UV tasks kept failing, e.g. https://firefoxci.taskcluster-artifacts.net/duMtwYDiQ2-oRrNAPymUnQ/2/public/logs/live_backing.log
[task 2022-06-21T08:53:06.429Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
[task 2022-06-21T08:53:06.429Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '17636', 'Rule-Data-Version': '23', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Mon, 20 Jun 2022 20:49:58 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '379', 'Age': '43388', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}
[...]
[task 2022-06-21T08:56:05.462Z] Using https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
[task 2022-06-21T08:56:05.464Z] Retrieving 'https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1' from cache...
[task 2022-06-21T08:56:05.468Z] Got this response:
[task 2022-06-21T08:56:05.468Z] <?xml version="1.0"?>
[task 2022-06-21T08:56:05.468Z] <updates>
[task 2022-06-21T08:56:05.468Z] <update appVersion="91.11.0" buildID="20220617193124" detailsURL="https://www.mozilla.org/fr/firefox/91.11.0/releasenotes/" displayVersion="91.11.0esr" type="minor">
[task 2022-06-21T08:56:05.468Z] <patch type="complete" URL="https://archive.mozilla.org/pub/firefox/candidates/91.11.0esr-candidates/build1/update/linux-x86_64/fr/firefox-91.11.0esr.complete.mar" hashFunction="sha512" hashValue="c52581d7e94b78f6620cd150b5818a9754d1db888f23e6d5fa7c7f0f887f7fe23dc7c7e11965860a62f246a592e92659e023aa4e1ebcc684fec5f1075ea2cd9c" size="62097260"/>
[task 2022-06-21T08:56:05.468Z] </update>
[task 2022-06-21T08:56:05.471Z] </updates>
[task 2022-06-21T08:56:05.471Z]
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: expected buildID 20220620182119 does not match actual 20220617193124
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: [68.5.0 fr complete] download_mars returned non-zero exit code: 1
Same request locally gets the correct response:
$ curl -i https://aus5.mozilla.org/update/3/https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
HTTP/2 200
server: nginx/1.20.1
date: Tue, 21 Jun 2022 09:29:55 GMT
content-type: text/xml; charset=utf-8
content-length: 569
vary: Accept-Encoding
rule-id: 17782
rule-data-version: 3
strict-transport-security: max-age=31536000;
x-content-type-options: nosniff
content-security-policy: default-src 'none'; frame-ancestors 'none'
x-proxy-cache-status: MISS
via: 1.1 google
cache-control: public,max-age=90
alt-svc: clear
<?xml version="1.0"?>
<updates>
<update appVersion="102.0" buildID="20220620182119" detailsURL="https://www.mozilla.org/fr/firefox/102.0/releasenotes/" displayVersion="102.0esr" type="minor">
<patch type="complete" URL="https://archive.mozilla.org/pub/firefox/candidates/102.0esr-candidates/build1/update/linux-x86_64/fr/firefox-102.0esr.complete.mar" hashFunction="sha512" hashValue="b8c2a3ac8ab84a3b3d40b5b572c83f64858aeafb1fe133785e371e68c8949f41dcff80903aa4401940873d39b1fda9e3861386eb1730444daa935162a7b856cc" size="62063823"/>
</update>
</updates>
Comment 1•2 years ago
|
||
:dlactin noted:
We do have 'Serve while stale' enabled and set to 1 day on the Cloud CDN for Balrog: https://cloud.google.com/cdn/docs/serving-stale-content
So we could see old content being served if the origin is having issues. Just going to see if anything lines up in the logs.
Reporter | ||
Comment 2•2 years ago
|
||
Did you find evidence of origin issues when this happened? (Nothing jumped out at me in grafana when I was looking at this, but I can easily have missed something...)
Comment 3•2 years ago
|
||
Comment 4•2 years ago
|
||
I was able to track down the log entries in BQ and they match the pattern that is described here: https://cloud.google.com/cdn/docs/serving-stale-content
- Content requested happens to already be cached, we have it configured to serve stale for up to 1 day past the age in the headers. So the stale content is served.
- A new revalidation request is made to the origin server with the
Cloud-CDN-Google
user agent to revalidate cache.
I misspoke when I said the old content would be served only in the case of errors:
Serving stale content avoids latency when the cache gets a request for content that has just expired. Rather than having to wait for a synchronous revalidation to the backend, the cache serves the stale content that has just expired, while triggering a separate revalidation.
I have attached the request logs for one of the entries.
Reporter | ||
Comment 5•2 years ago
•
|
||
This happened again, this time with beta (103.0b5): https://treeherder.mozilla.org/jobs?repo=mozilla-beta&revision=ff6aefbc9eb1d58bfcc3dcdddeafa326b4d5da06&searchStr=release-update-verify
https://firefoxci.taskcluster-artifacts.net/Lt53GafMQFGfg3z21gscQw/0/public/logs/live_backing.log updated the firefox-beta-localtest rule:
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Balrog request to https://aus4-admin.mozilla.org/api/rules/firefox-beta-localtest
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Data sent: {"mapping": "Firefox-103.0b5-build1", "data_version": 1017}
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Using cached token
2022-07-05 21:13:47,067 - urllib3.connectionpool - DEBUG - https://aus4-admin.mozilla.org:443 "POST /api/rules/firefox-beta-localtest HTTP/1.1" 200 26
2022-07-05 21:13:47,068 - balrogclient.api - DEBUG - REQUEST STATS: {"timestamp": 1657055627.068059, "method": "POST", "url": "https://aus4-admin.mozilla.org/api/rules/firefox-beta-localtest", "status_code": 200, "elapsed_secs": 0.17641258239746094}
Then a lot of the update-verify tasks failed, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/QsLjquyJTH6cfwjZPntHIA:
run 0:
[task 2022-07-05T22:08:32.600Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/98.0/20220304153049/Linux_x86-gcc3/en-CA/beta-localtest/default/default/default/update.xml?force=1
[task 2022-07-05T22:08:32.600Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:29:51 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '13121', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}
run 1:
[task 2022-07-05T23:01:57.838Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/99.0/20220315185755/Linux_x86-gcc3/en-CA/beta-localtest/default/default/default/update.xml?force=1
[task 2022-07-05T23:01:57.838Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:27:19 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '16478', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}
Both responses were cached between 18:27 and 18:29 UTC.
Reporter | ||
Comment 6•2 years ago
|
||
Looks like a staging release on try (https://treeherder.mozilla.org/jobs?repo=try&revision=d3bab0305e441ec1e6408c0775bc605517941478&searchStr=uv) explains the timing of the cached responses.
Comment 7•2 years ago
|
||
Checking in since this is a P1 with no assignee: What's the plan? Do we expect bug 1778728 to resolve most issues?
Reporter | ||
Comment 8•2 years ago
|
||
Probably lower priority now.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 11•1 year ago
|
||
I'm hitting this on stage very frequently, which makes me wonder if it has something to do with how quickly pods are cycled in and out. If individual pods don't expire caches properly I could easily see this being a symptom.
Reporter | ||
Comment 12•1 year ago
|
||
The "serve while stale" is reduced from 1 day to 1 minute now, which should make this go away.
https://github.com/mozilla-services/cloudops-infra/pull/5044
Description
•