Closed Bug 896299 Opened 11 years ago Closed 11 years ago

Highwinds returns 'HTTP/1.1 408 Request Timeout' during RelEng release automation

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

x86
All
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

Details

[We've spoken about this on IRC a couple of times, time to have it in a bug]

Since we switched from Akamai to Highwinds as our CDN in North America we routinely get 408 timeout errors when trying to verify bouncer links. For example, see the end of http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/23.0b7-candidates/build1/logs/release-mozilla-beta-final_verification-bm66-build1-build6.txt.gz:

Fri Jul 19 03:54:29 PDT 2013:                  mar url: http://download.mozilla.org/?product=firefox-23.0b7-partial-23.0b5&os=osx&lang=nb-NO&force=1

Fri Jul 19 03:54:30 PDT 2013:      The HTTP headers were:
Fri Jul 19 03:54:30 PDT 2013:          HTTP/1.1 302 Found
Fri Jul 19 03:54:30 PDT 2013:          Server: Apache
Fri Jul 19 03:54:30 PDT 2013:          X-Backend-Server: bouncer3.webapp.phx1.mozilla.com
Fri Jul 19 03:54:30 PDT 2013:          Cache-Control: max-age=15
Fri Jul 19 03:54:30 PDT 2013:          Content-Type: text/html; charset=UTF-8
Fri Jul 19 03:54:30 PDT 2013:          Date: Fri, 19 Jul 2013 10:21:47 GMT
Fri Jul 19 03:54:30 PDT 2013:          Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/23.0b7/update/mac/nb-NO/firefox-23.0b5-23.0b7.partial.mar
Fri Jul 19 03:54:30 PDT 2013:          Transfer-Encoding: chunked
Fri Jul 19 03:54:30 PDT 2013:          Connection: Keep-Alive
Fri Jul 19 03:54:30 PDT 2013:          Set-Cookie: dmo=10.8.81.216.1374229307467326; path=/; expires=Sat, 19-Jul-14 10:21:47 GMT
Fri Jul 19 03:54:30 PDT 2013:          X-Cache-Info: caching
Fri Jul 19 03:54:30 PDT 2013:          
Fri Jul 19 03:54:30 PDT 2013:          HTTP/1.1 408 Request Timeout
Fri Jul 19 03:54:30 PDT 2013:          Date: Fri, 19 Jul 2013 10:22:01 GMT
Fri Jul 19 03:54:30 PDT 2013:          Connection: close
Fri Jul 19 03:54:30 PDT 2013:          Accept-Ranges: bytes
Fri Jul 19 03:54:30 PDT 2013:          Cache-Control: no-cache
Fri Jul 19 03:54:30 PDT 2013:          Content-Length: 0
Fri Jul 19 03:54:30 PDT 2013:          Content-Type: application/octet-stream
Fri Jul 19 03:54:30 PDT 2013:          X-HW-Deprecated: 1374229321.pj007s2
Fri Jul 19 03:54:30 PDT 2013:          X-HW: 1374229306.dop001.sj2.t,1374229321.cds007.sj2.p

We're doing a HEAD request for files the CDN won't have seen before; many of these requests are done in parallel. If we rerun the job again as little as 30 minutes later then we don't get errors.

Are Highwinds converting the HEAD into a GET, or just much more aggressive about timeouts than Akamai ?
I have an ticket opened with Highwinds on this already (SUP-174774 for reference). Will update here when I have some more details.
I've made a small tweak to their timeout config, and am inquiring about larger changes.

Based on a cursory reading of the documentation for Akamai and Highwinds, it appears that this is how they work:

Akamai makes the initial request and 3 retries, with 5-second timeouts each

Highwinds defaults to the initial request plus 1 retry, 15-second timeouts each


I've changed Highwinds to do the initial request plus 2 retries, 10-second timeouts each... split the difference. I'm hesitant to go further unilaterally without input from them.

I'm also asking about "Origin Shielding", which is their way of doing a mid-tier cache... essentially their edge nodes all query one of their POPs for our content, which then queries us. I believe right now we are not using anything like this. Akamai has a mid-tier cache, but I don't know how it works specifically... it has the same basic goal, but I suspect a different implementation.
Highwinds confirmed my reading is correct. We have also made some additional changes beyond what I described in comment 2:

1) Origin Shielding is enabled. This means all requests to the Origin (us) will come from a single POP, and that POP will then serve the others. This reduces their load on us, improves overall cache hit rate, and thus reduces the likelihood of timeouts reaching the end user.

2) The "negative TTL" for timeout errors has been reduced from 60 seconds to 20 seconds. Any request that does result in a 408 will be remembered for a much shorter time window, so any transient issues are more quickly overcome.


With these 2 changes plus my retry/TTL changes from comment 2, I believe we should be much better off here now. If these errors persist, please let us know how frequent they are now as compared to before, so we can know if we're on the right track.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.