Closed Bug 1775226 Opened 2 years ago Closed 1 year ago

esr102 update-verify failing due to stale responses from balrog

Categories

(Release Engineering :: Release Automation: Updates, defect, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jcristau, Unassigned)

References

Details

Attachments

(1 file)

We initially didn't have the rules properly set up on esr-localtest-next, but even after we did, some UV tasks kept failing, e.g. https://firefoxci.taskcluster-artifacts.net/duMtwYDiQ2-oRrNAPymUnQ/2/public/logs/live_backing.log

[task 2022-06-21T08:53:06.429Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
[task 2022-06-21T08:53:06.429Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '17636', 'Rule-Data-Version': '23', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Mon, 20 Jun 2022 20:49:58 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '379', 'Age': '43388', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}
[...]
[task 2022-06-21T08:56:05.462Z] Using  https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
[task 2022-06-21T08:56:05.464Z] Retrieving 'https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1' from cache...
[task 2022-06-21T08:56:05.468Z] Got this response:
[task 2022-06-21T08:56:05.468Z] <?xml version="1.0"?>
[task 2022-06-21T08:56:05.468Z] <updates>
[task 2022-06-21T08:56:05.468Z]     <update appVersion="91.11.0" buildID="20220617193124" detailsURL="https://www.mozilla.org/fr/firefox/91.11.0/releasenotes/" displayVersion="91.11.0esr" type="minor">
[task 2022-06-21T08:56:05.468Z]         <patch type="complete" URL="https://archive.mozilla.org/pub/firefox/candidates/91.11.0esr-candidates/build1/update/linux-x86_64/fr/firefox-91.11.0esr.complete.mar" hashFunction="sha512" hashValue="c52581d7e94b78f6620cd150b5818a9754d1db888f23e6d5fa7c7f0f887f7fe23dc7c7e11965860a62f246a592e92659e023aa4e1ebcc684fec5f1075ea2cd9c" size="62097260"/>
[task 2022-06-21T08:56:05.468Z]     </update>
[task 2022-06-21T08:56:05.471Z] </updates>
[task 2022-06-21T08:56:05.471Z] 
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: expected buildID 20220620182119 does not match actual 20220617193124
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: [68.5.0 fr complete] download_mars returned non-zero exit code: 1

Same request locally gets the correct response:

$ curl -i https://aus5.mozilla.org/update/3/https://aus5.mozilla.org/update/3/Firefox/68.5.0/20200206211857/Linux_x86_64-gcc3/fr/esr-localtest-next/default/default/default/update.xml?force=1
HTTP/2 200 
server: nginx/1.20.1
date: Tue, 21 Jun 2022 09:29:55 GMT
content-type: text/xml; charset=utf-8
content-length: 569
vary: Accept-Encoding
rule-id: 17782
rule-data-version: 3
strict-transport-security: max-age=31536000;
x-content-type-options: nosniff
content-security-policy: default-src 'none'; frame-ancestors 'none'
x-proxy-cache-status: MISS
via: 1.1 google
cache-control: public,max-age=90
alt-svc: clear

<?xml version="1.0"?>
<updates>
    <update appVersion="102.0" buildID="20220620182119" detailsURL="https://www.mozilla.org/fr/firefox/102.0/releasenotes/" displayVersion="102.0esr" type="minor">
        <patch type="complete" URL="https://archive.mozilla.org/pub/firefox/candidates/102.0esr-candidates/build1/update/linux-x86_64/fr/firefox-102.0esr.complete.mar" hashFunction="sha512" hashValue="b8c2a3ac8ab84a3b3d40b5b572c83f64858aeafb1fe133785e371e68c8949f41dcff80903aa4401940873d39b1fda9e3861386eb1730444daa935162a7b856cc" size="62063823"/>
    </update>
</updates>

:dlactin noted:

We do have 'Serve while stale' enabled and set to 1 day on the Cloud CDN for Balrog: https://cloud.google.com/cdn/docs/serving-stale-content
So we could see old content being served if the origin is having issues. Just going to see if anything lines up in the logs.

Did you find evidence of origin issues when this happened? (Nothing jumped out at me in grafana when I was looking at this, but I can easily have missed something...)

Flags: needinfo?(dlactin)

I was able to track down the log entries in BQ and they match the pattern that is described here: https://cloud.google.com/cdn/docs/serving-stale-content

  1. Content requested happens to already be cached, we have it configured to serve stale for up to 1 day past the age in the headers. So the stale content is served.
  2. A new revalidation request is made to the origin server with the Cloud-CDN-Google user agent to revalidate cache.

I misspoke when I said the old content would be served only in the case of errors:

Serving stale content avoids latency when the cache gets a request for content that has just expired. Rather than having to wait for a synchronous revalidation to the backend, the cache serves the stale content that has just expired, while triggering a separate revalidation.

I have attached the request logs for one of the entries.

Flags: needinfo?(dlactin)

This happened again, this time with beta (103.0b5): https://treeherder.mozilla.org/jobs?repo=mozilla-beta&revision=ff6aefbc9eb1d58bfcc3dcdddeafa326b4d5da06&searchStr=release-update-verify

https://firefoxci.taskcluster-artifacts.net/Lt53GafMQFGfg3z21gscQw/0/public/logs/live_backing.log updated the firefox-beta-localtest rule:

2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Balrog request to https://aus4-admin.mozilla.org/api/rules/firefox-beta-localtest
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Data sent: {"mapping": "Firefox-103.0b5-build1", "data_version": 1017}
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Using cached token
2022-07-05 21:13:47,067 - urllib3.connectionpool - DEBUG - https://aus4-admin.mozilla.org:443 "POST /api/rules/firefox-beta-localtest HTTP/1.1" 200 26
2022-07-05 21:13:47,068 - balrogclient.api - DEBUG - REQUEST STATS: {"timestamp": 1657055627.068059, "method": "POST", "url": "https://aus4-admin.mozilla.org/api/rules/firefox-beta-localtest", "status_code": 200, "elapsed_secs": 0.17641258239746094}

Then a lot of the update-verify tasks failed, e.g. https://firefox-ci-tc.services.mozilla.com/tasks/QsLjquyJTH6cfwjZPntHIA:
run 0:

[task 2022-07-05T22:08:32.600Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/98.0/20220304153049/Linux_x86-gcc3/en-CA/beta-localtest/default/default/default/update.xml?force=1
[task 2022-07-05T22:08:32.600Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:29:51 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '13121', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}

run 1:

[task 2022-07-05T23:01:57.838Z] Finished downloading https://aus5.mozilla.org/update/3/Firefox/99.0/20220315185755/Linux_x86-gcc3/en-CA/beta-localtest/default/default/default/update.xml?force=1
[task 2022-07-05T23:01:57.838Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:27:19 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '16478', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}

Both responses were cached between 18:27 and 18:29 UTC.

Severity: -- → S2
Priority: -- → P1

Looks like a staging release on try (https://treeherder.mozilla.org/jobs?repo=try&revision=d3bab0305e441ec1e6408c0775bc605517941478&searchStr=uv) explains the timing of the cached responses.

Depends on: 1778728
See Also: → 1779105

Checking in since this is a P1 with no assignee: What's the plan? Do we expect bug 1778728 to resolve most issues?

Flags: needinfo?(jcristau)

Probably lower priority now.

Flags: needinfo?(jcristau)
Priority: P1 → P2
See Also: → 1190779

I'm hitting this on stage very frequently, which makes me wonder if it has something to do with how quickly pods are cycled in and out. If individual pods don't expire caches properly I could easily see this being a symptom.

The "serve while stale" is reduced from 1 day to 1 minute now, which should make this go away.
https://github.com/mozilla-services/cloudops-infra/pull/5044

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: