Closed Bug 896299 Opened 12 years ago Closed 12 years ago

Highwinds returns 'HTTP/1.1 408 Request Timeout' during RelEng release automation

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: nthomas, Unassigned)

Details

Nick Thomas [:nthomas] (UTC+12)

Reporter

Description

•

12 years ago

[We've spoken about this on IRC a couple of times, time to have it in a bug] Since we switched from Akamai to Highwinds as our CDN in North America we routinely get 408 timeout errors when trying to verify bouncer links. For example, see the end of http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/23.0b7-candidates/build1/logs/release-mozilla-beta-final_verification-bm66-build1-build6.txt.gz: Fri Jul 19 03:54:29 PDT 2013: mar url: http://download.mozilla.org/?product=firefox-23.0b7-partial-23.0b5&os=osx&lang=nb-NO&force=1 Fri Jul 19 03:54:30 PDT 2013: The HTTP headers were: Fri Jul 19 03:54:30 PDT 2013: HTTP/1.1 302 Found Fri Jul 19 03:54:30 PDT 2013: Server: Apache Fri Jul 19 03:54:30 PDT 2013: X-Backend-Server: bouncer3.webapp.phx1.mozilla.com Fri Jul 19 03:54:30 PDT 2013: Cache-Control: max-age=15 Fri Jul 19 03:54:30 PDT 2013: Content-Type: text/html; charset=UTF-8 Fri Jul 19 03:54:30 PDT 2013: Date: Fri, 19 Jul 2013 10:21:47 GMT Fri Jul 19 03:54:30 PDT 2013: Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/23.0b7/update/mac/nb-NO/firefox-23.0b5-23.0b7.partial.mar Fri Jul 19 03:54:30 PDT 2013: Transfer-Encoding: chunked Fri Jul 19 03:54:30 PDT 2013: Connection: Keep-Alive Fri Jul 19 03:54:30 PDT 2013: Set-Cookie: dmo=10.8.81.216.1374229307467326; path=/; expires=Sat, 19-Jul-14 10:21:47 GMT Fri Jul 19 03:54:30 PDT 2013: X-Cache-Info: caching Fri Jul 19 03:54:30 PDT 2013: Fri Jul 19 03:54:30 PDT 2013: HTTP/1.1 408 Request Timeout Fri Jul 19 03:54:30 PDT 2013: Date: Fri, 19 Jul 2013 10:22:01 GMT Fri Jul 19 03:54:30 PDT 2013: Connection: close Fri Jul 19 03:54:30 PDT 2013: Accept-Ranges: bytes Fri Jul 19 03:54:30 PDT 2013: Cache-Control: no-cache Fri Jul 19 03:54:30 PDT 2013: Content-Length: 0 Fri Jul 19 03:54:30 PDT 2013: Content-Type: application/octet-stream Fri Jul 19 03:54:30 PDT 2013: X-HW-Deprecated: 1374229321.pj007s2 Fri Jul 19 03:54:30 PDT 2013: X-HW: 1374229306.dop001.sj2.t,1374229321.cds007.sj2.p We're doing a HEAD request for files the CDN won't have seen before; many of these requests are done in parallel. If we rerun the job again as little as 30 minutes later then we don't get errors. Are Highwinds converting the HEAD into a GET, or just much more aggressive about timeouts than Akamai ?

Jake Maul [:jakem]

Comment 1

•

12 years ago

I have an ticket opened with Highwinds on this already (SUP-174774 for reference). Will update here when I have some more details.

Jake Maul [:jakem]

Comment 2

•

12 years ago

I've made a small tweak to their timeout config, and am inquiring about larger changes. Based on a cursory reading of the documentation for Akamai and Highwinds, it appears that this is how they work: Akamai makes the initial request and 3 retries, with 5-second timeouts each Highwinds defaults to the initial request plus 1 retry, 15-second timeouts each I've changed Highwinds to do the initial request plus 2 retries, 10-second timeouts each... split the difference. I'm hesitant to go further unilaterally without input from them. I'm also asking about "Origin Shielding", which is their way of doing a mid-tier cache... essentially their edge nodes all query one of their POPs for our content, which then queries us. I believe right now we are not using anything like this. Akamai has a mid-tier cache, but I don't know how it works specifically... it has the same basic goal, but I suspect a different implementation.

Jake Maul [:jakem]

Comment 3

•

12 years ago

Highwinds confirmed my reading is correct. We have also made some additional changes beyond what I described in comment 2: 1) Origin Shielding is enabled. This means all requests to the Origin (us) will come from a single POP, and that POP will then serve the others. This reduces their load on us, improves overall cache hit rate, and thus reduces the likelihood of timeouts reaching the end user. 2) The "negative TTL" for timeout errors has been reduced from 60 seconds to 20 seconds. Any request that does result in a 408 will be remembered for a much shorter time window, so any transient issues are more quickly overcome. With these 2 changes plus my retry/TTL changes from comment 2, I believe we should be much better off here now. If these errors persist, please let us know how frequent they are now as compared to before, so we can know if we're on the right track.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

9 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Highwinds returns 'HTTP/1.1 408 Request Timeout' during RelEng release automation

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

Tracking

(Not tracked)

People

(Reporter: nthomas, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated