Closed
Bug 1281500
Opened 9 years ago
Closed 9 years ago
[prod] Investigate - bouncer-tests are sporadically failing due to proxy timeouts
Categories
(Infrastructure & Operations :: Infrastructure: Other, task)
Infrastructure & Operations
Infrastructure: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mbrandt, Assigned: bhourigan)
References
Details
WebQA's tests are experiencing intermittent failures that present as a symptom of a Proxy timeout error.
* example of a failing testrun - https://webqa-ci.mozilla.com/view/Buildmaster/job/bouncer.prod/22144/
* test runs can be viewed here - https://webqa-ci.mozilla.com/view/Buildmaster/job/bouncer.prod/
E AssertionError: Failing URL: <a href="https://download-installer.cdn.mozilla.net/pub/firefox/releases/38.5.1esr/win32/en-US/Firefox%20Setup%2038.5.1esr.exe.">https://download-installer.cdn.mozilla.net/pub/firefox/releases/38.5.1esr/win32/en-US/Firefox%20Setup%2038.5.1esr.exe.</a>
E Error message: HTTPSConnectionPool(host='download-installer.cdn.mozilla.net', port=443): Max retries exceeded with url: /pub/firefox/releases/38.5.1esr/win32/en-US/Firefox%20Setup%2038.5.1esr.exe (Caused by ProxyError('Cannot connect to proxy.', timeout('timed out',)))
| Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(cshields)
Updated•9 years ago
|
Assignee: nobody → infra
Component: Bouncer → Infrastructure: Other
Product: Webtools → Infrastructure & Operations
QA Contact: cshields
Version: Trunk → unspecified
Comment 1•9 years ago
|
||
Last night while this was going on I tried to nail down if this was a specific proxy or not (as-is there are 4 in a round-robin dns). The problem was experienced in all proxies.
One thing to note, the webqa1 hits seem to be VERY bursty and will ramp up to a couple thousand TCP ports on a proxy before calming down to double digits (normal). I haven't found any limits that this would be exhausting yet.
| Reporter | ||
Comment 2•9 years ago
|
||
What's the current status of the investigation?
Comment 3•9 years ago
|
||
Is this still happening? Stephen mentioned it went away, and nothing (knowingly) changed. :(
Flags: needinfo?(cshields)
| Reporter | ||
Comment 4•9 years ago
|
||
It does appear to have stopped occurring. I do wish we understood root cause. oremj, is there more we can do or should we bump this to wfm?
Flags: needinfo?(oremj)
Comment 5•9 years ago
|
||
If it's failing on the proxy, there isn't anything I can do. Comment 1 indicates this may be partially caused by overloading the proxy by running too many tests at once. Do we need to proxy requests? Maybe this could be run from a jenkins slave that runs in AWS?
Flags: needinfo?(oremj)
(In reply to Corey Shields [:cshields] from comment #3)
> Is this still happening? Stephen mentioned it went away, and nothing
> (knowingly) changed. :(
It's returned, sporadically; see https://webqa-ci.mozilla.com/view/Bouncer/job/bouncer.prod/lastFailedBuild/consoleFull
Is there any reason to think that the changes in bug 1278930 would have any effect(s)?
Flags: needinfo?(cshields)
| Assignee | ||
Comment 7•9 years ago
|
||
(In reply to Stephen Donner [:stephend] from comment #6)
> (In reply to Corey Shields [:cshields] from comment #3)
> > Is this still happening? Stephen mentioned it went away, and nothing
> > (knowingly) changed. :(
>
> It's returned, sporadically; see
> https://webqa-ci.mozilla.com/view/Bouncer/job/bouncer.prod/lastFailedBuild/
> consoleFull
>
> Is there any reason to think that the changes in bug 1278930 would have any
> effect(s)?
Hi Stephen
I've been meaning to follow up with you on this. I performed the work referenced in bug 1278930 and those were OS level / security updates only. No major changes were made to Squid or it's configuration.
We've seen sporadic reports of this in the past and while we cannot reproduce these results I've requested a test VM in bug 1283105 to be used for load testing and performance tuning. I expect to make additional progress on this after the holiday weekend, and I will keep you updated as progress is made.
Until then please feel free to ping me here, on irc, or via email, but my response may be delayed. If it's an emergency please escalate to the MOC and they will handle the escalation path.
Flags: needinfo?(cshields)
Looks like this was fixed over in bug 1289697 - how to mark? Root cause seems to be a duplicate...
| Assignee | ||
Comment 10•9 years ago
|
||
:stephend, I'm glad that this also fixed your issue as well. I'm sorry we didn't catch this sooner. It was a simple yet difficult problem to track down. I'll mark this as fixed with see also to 1289697. Since the other bug is moco only, I'll paste my summary below for everyone to see.
For everyone's edification I will provide context and technical background on the exact problem that the proxies were running into.
Historically our proxies were very under utilized with throughput less than 1 request/s and less than 1kB/s averaged over 30 days. Of course, the proxy load has increased since we're moving more and more things behind them.
The QA team's workload is unlike other workloads and sends large bursts of traffic which were causing squid to exhaust socket descriptors. Squid had a hard limit of 4k, which is the system default. As the workload was intermittent this made effective troubleshooting difficult without a way to reproduce the workload on demand.
We raised the soft/hard descriptor limit to 256k, which is half of the system's global 512k limit. These systems are dedicated to squid, so it makes sense to allow squid a high degree of parallelism. After the change was pushed via puppet I confirmed that the nofile limit is effectively 256k via a gdb trick:
# gdb -p <process of squid>
(gdb) set $rlim = &{0ll, 0ll}
(gdb) print getrlimit(7, $rlim)
$1 = 0
(gdb) print *$rlim
$2 = {262144, 262144}
After this was fully deployed :whimboo confirmed that the tests are now passing and Squid is effectively able to handle what it's asked to do.
Assignee: infra → bhourigan
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•