Marking this as "critical" severity because we need a stable, performant app/app environment, ahead of tomorrow's PHX1 migration for SUMO on staging - see bug 1212082. Our sumo.stage.saucelabs job is suffering from timeouts/failures, mostly in the form of "read timeouts" on port 443, with our default timeout of 10 seconds. This corresponds with high "Request Queuing," times in the 13,600ms range (!) as I understand New Relic's charts for staging. New Relic: https://rpm.newrelic.com/accounts/263620/applications/2779107 Latest Web QA test results for SUMO: https://webqa-ci.mozilla.com/view/SUMO/ (See screenshot.)
:mythmon, who can we get/do you need, to help you help us, with this? Thanks!
Here's an example test run: https://webqa-ci.mozilla.com/view/SUMO/job/sumo.stage.saucelabs/701/HTML_Report/
From our IRC conversation, I believe this started around October 17th, which doesn't correlate with any changes I know about. The most recent deploy to stage was the 15th, according to the deploy logs at http://supportadm.private.phx1.mozilla.com/chief/support.stage/logs/?C=M;O=D My only idea is that the increased latency from the webheads to the database is causing very slow page loads, but this doesn't match up timing wise. It also seems unlikely that it could add over 10 seconds of request queuing time, and New Relic isn't showing that increase as DB time.
Snippets was experiencing a similar problem. It was traced back to the tests calling pages that were very database heavy. Since the move to SCL3 is in progress, the route from the webheads to the database has much higher latency than it used to be. This made the Snippet's page load too slow for the tests, causing the timeout. I'm now much more confident in my earlier hypothesis that this is related to the increased database latency. Further, we ran the Jenkins tests again and they passed, so this is a transient problem that is hard to reproduce. If the problem comes back before the move, I recommend we mark the failing tests as xfail until after the route between the servers and the databases is fast again. I don't think this should block the move to SCL3.
(In reply to Mike Cooper [:mythmon] from comment #4) > Snippets was experiencing a similar problem. It was traced back to the tests > calling pages that were very database heavy. Since the move to SCL3 is in > progress, the route from the webheads to the database has much higher > latency than it used to be. This made the Snippet's page load too slow for > the tests, causing the timeout. > > I'm now much more confident in my earlier hypothesis that this is related to > the increased database latency. Further, we ran the Jenkins tests again and > they passed, so this is a transient problem that is hard to reproduce. > > If the problem comes back before the move, I recommend we mark the failing > tests as xfail until after the route between the servers and the databases > is fast again. I don't think this should block the move to SCL3. Actually, while I don't think we still have root cause, we do have a specific problem + remedy, thanks to :cyliang: 3:28 PM <cyliang> stephend mythmon: so, it looks like some apache processes on sumo stage with signal 2 ( which is novel to me) https://cliang.pastebin.mozilla.org/8850062 3:28 PM <cyliang> And there was a core-dump this morning. 3:29 PM <stephend> cyliang: ouch :-( 3:29 PM <mythmon> cyliang: signal 2 is sigint, aka ctrl+c 3:29 PM <cyliang> mythmon: which is weird in the apache logs. 3:30 PM <mythmon> huh. especially spread over so many hours. 3:30 PM <jakem> (http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history for those following along) 3:31 PM <mythmon> cyliang: i think we have apache set to kill workers after a certain number of responses. could it be using sigint for that? 3:34 PM <cyliang> So, I did whack apache in the head on SUMO. And, to be clear: after she did that, we started passing again on SUMO staging, and haven't-yet (knock on wood) failed, since: https://webqa-ci.mozilla.com/view/SUMO/job/sumo.stage.saucelabs/
As an important aside, the "similar problem" Mike mentions (and I first noted, on limited information, myself) is, rather, that pytest-mozwebqa's default timeout  for its use of the Python Requests module (to check that a particular dev/staging/prod) instance is up, and ready to run tests) is 10 seconds, which is/should be reasonable. I can confirm that on both snippets.stage (see bug 1217249) and sumo.stage.* jobs we saw "E ReadTimeout: HTTPSConnectionPool(host='support.allizom.org', port=443): Read timed out. (read timeout=10)" errors, whose root cause(s) might both be related to the strange/temporary PHX1 <-> SCL3 network setup, or not. (Hard to tell.) What *is* clear in this case is that SUMO's WSGI/Apache processes on staging were problematic, were kicked/restarted, and things have been good since; for snippets.stage (again, see bug above), the network-traversal + code issues are more-clearly problematic and the latter will be addressed.  https://github.com/mozilla/pytest-mozwebqa/blob/master/pytest_mozwebqa/pytest_mozwebqa.py#L18
Note that without this timeout set on the request, it would default to waiting forever for a response, which we've had issues with in the past. I do think 10 seconds without any response is something we should address rather than ignore by increasing our timeout. I'm not entirely against increasing this timeout, or perhaps making it possible to override, but we've stopped development of pytest-mozwebqa now in favour of pytest-selenium, so I'd want to make any changes to the latter and migrate any projects to the new plugin.
Re-resolving and re-assigning to give :cyliang credit for kicking WSGI/Apache, which was the symptom of the problem we were seeing, here.