Open Bug 1775226 Opened 5 months ago Updated 4 months ago

esr102 update-verify failing due to stale responses from balrog


(Release Engineering :: Release Automation: Updates, defect, P2)


(Not tracked)


(Reporter: jcristau, Unassigned)




(1 file)

We initially didn't have the rules properly set up on esr-localtest-next, but even after we did, some UV tasks kept failing, e.g.

[task 2022-06-21T08:53:06.429Z] Finished downloading
[task 2022-06-21T08:53:06.429Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '17636', 'Rule-Data-Version': '23', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Mon, 20 Jun 2022 20:49:58 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '379', 'Age': '43388', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}
[task 2022-06-21T08:56:05.462Z] Using
[task 2022-06-21T08:56:05.464Z] Retrieving '' from cache...
[task 2022-06-21T08:56:05.468Z] Got this response:
[task 2022-06-21T08:56:05.468Z] <?xml version="1.0"?>
[task 2022-06-21T08:56:05.468Z] <updates>
[task 2022-06-21T08:56:05.468Z]     <update appVersion="91.11.0" buildID="20220617193124" detailsURL="" displayVersion="91.11.0esr" type="minor">
[task 2022-06-21T08:56:05.468Z]         <patch type="complete" URL="" hashFunction="sha512" hashValue="c52581d7e94b78f6620cd150b5818a9754d1db888f23e6d5fa7c7f0f887f7fe23dc7c7e11965860a62f246a592e92659e023aa4e1ebcc684fec5f1075ea2cd9c" size="62097260"/>
[task 2022-06-21T08:56:05.468Z]     </update>
[task 2022-06-21T08:56:05.471Z] </updates>
[task 2022-06-21T08:56:05.471Z] 
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: expected buildID 20220620182119 does not match actual 20220617193124
[task 2022-06-21T08:56:05.474Z] TEST-UNEXPECTED-FAIL: [68.5.0 fr complete] download_mars returned non-zero exit code: 1

Same request locally gets the correct response:

$ curl -i
HTTP/2 200 
server: nginx/1.20.1
date: Tue, 21 Jun 2022 09:29:55 GMT
content-type: text/xml; charset=utf-8
content-length: 569
vary: Accept-Encoding
rule-id: 17782
rule-data-version: 3
strict-transport-security: max-age=31536000;
x-content-type-options: nosniff
content-security-policy: default-src 'none'; frame-ancestors 'none'
x-proxy-cache-status: MISS
via: 1.1 google
cache-control: public,max-age=90
alt-svc: clear

<?xml version="1.0"?>
    <update appVersion="102.0" buildID="20220620182119" detailsURL="" displayVersion="102.0esr" type="minor">
        <patch type="complete" URL="" hashFunction="sha512" hashValue="b8c2a3ac8ab84a3b3d40b5b572c83f64858aeafb1fe133785e371e68c8949f41dcff80903aa4401940873d39b1fda9e3861386eb1730444daa935162a7b856cc" size="62063823"/>

:dlactin noted:

We do have 'Serve while stale' enabled and set to 1 day on the Cloud CDN for Balrog:
So we could see old content being served if the origin is having issues. Just going to see if anything lines up in the logs.

Did you find evidence of origin issues when this happened? (Nothing jumped out at me in grafana when I was looking at this, but I can easily have missed something...)

Flags: needinfo?(dlactin)

I was able to track down the log entries in BQ and they match the pattern that is described here:

  1. Content requested happens to already be cached, we have it configured to serve stale for up to 1 day past the age in the headers. So the stale content is served.
  2. A new revalidation request is made to the origin server with the Cloud-CDN-Google user agent to revalidate cache.

I misspoke when I said the old content would be served only in the case of errors:

Serving stale content avoids latency when the cache gets a request for content that has just expired. Rather than having to wait for a synchronous revalidation to the backend, the cache serves the stale content that has just expired, while triggering a separate revalidation.

I have attached the request logs for one of the entries.

Flags: needinfo?(dlactin)

This happened again, this time with beta (103.0b5): updated the firefox-beta-localtest rule:

2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Balrog request to
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Data sent: {"mapping": "Firefox-103.0b5-build1", "data_version": 1017}
2022-07-05 21:13:46,891 - balrogclient.api - DEBUG - Using cached token
2022-07-05 21:13:47,067 - urllib3.connectionpool - DEBUG - "POST /api/rules/firefox-beta-localtest HTTP/1.1" 200 26
2022-07-05 21:13:47,068 - balrogclient.api - DEBUG - REQUEST STATS: {"timestamp": 1657055627.068059, "method": "POST", "url": "", "status_code": 200, "elapsed_secs": 0.17641258239746094}

Then a lot of the update-verify tasks failed, e.g.
run 0:

[task 2022-07-05T22:08:32.600Z] Finished downloading
[task 2022-07-05T22:08:32.600Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:29:51 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '13121', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}

run 1:

[task 2022-07-05T23:01:57.838Z] Finished downloading
[task 2022-07-05T23:01:57.838Z] {'Server': 'nginx/1.20.1', 'Vary': 'Accept-Encoding', 'Rule-ID': '25', 'Rule-Data-Version': '1017', 'Strict-Transport-Security': 'max-age=31536000;', 'X-Content-Type-Options': 'nosniff', 'Content-Security-Policy': "default-src 'none'; frame-ancestors 'none'", 'X-Proxy-Cache-Status': 'MISS', 'Content-Encoding': 'gzip', 'Via': '1.1 google', 'Date': 'Tue, 05 Jul 2022 18:27:19 GMT', 'Content-Type': 'text/xml; charset=utf-8', 'Content-Length': '433', 'Age': '16478', 'Cache-Control': 'public,max-age=90', 'Alt-Svc': 'clear'}

Both responses were cached between 18:27 and 18:29 UTC.

Severity: -- → S2
Priority: -- → P1

Looks like a staging release on try ( explains the timing of the cached responses.

Depends on: 1778728
See Also: → 1779105

Checking in since this is a P1 with no assignee: What's the plan? Do we expect bug 1778728 to resolve most issues?

Flags: needinfo?(jcristau)

Probably lower priority now.

Flags: needinfo?(jcristau)
Priority: P1 → P2
You need to log in before you can comment on or make changes to this bug.