Closed Bug 1161675 Opened 9 years ago Closed 9 years ago

Middleware timeouts on staging

Categories

(Socorro Graveyard :: Middleware, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stephend, Unassigned)

References

()

Details

(Whiteboard: [fromAutomation])

Both manually and through our Selenium automation, we're seeing a ton of really slow GETs and/or actual timeouts, just doing normal reports/queries.

11:34 AM <•phrawzty> 11:33:51 < nagios-phx1> (IRC) Tue 11:33:51 PDT [1116] socorro-mware1.stage.webapp.phx1.mozilla.com:SSH is CRITICAL: CRITICAL - Socket timeout after 15 seconds (http://m.mozilla.org/SSH)

See also https://saucelabs.com/jobs/e1f9bf1b36e140d68c1cf8464d8de0b7 and https://webqa-ci.mozilla.com/job/socorro.stage.saucelabs/189/
From a recent socorro.stage failure:

AssertionError: Base URL did not return status code 200 or 401. (URL: https://crash-stats.allizom.org, Response: 500, Headers: {'content-length': '1266', 'content-encoding': 'gzip', 'x-backend-server': 'socorro2.stage.webapp.phx1.mozilla.com', 'vary': 'Cookie,Accept-Encoding', 'server': 'Apache', 'connection': 'close', 'set-cookie': 'anoncsrf=FEnkSnieQnEot2LquER9YClIhbJ7id1d; expires=Thu, 07-May-2015 09:06:19 GMT; httponly; Max-Age=7200; Path=/; secure', 'date': 'Thu, 07 May 2015 07:06:18 GMT', 'x-frame-options': 'DENY', 'content-type': 'text/html; charset=utf-8'})
We're on new hardware these days. Can we close this bug now?

Note, we're still on crash-stats.mocotoolsstaging.net instead of crash-stats.allizom.org unfortunately.
Flags: needinfo?(stephen.donner)
(In reply to Peter Bengtsson [:peterbe] from comment #2)
> We're on new hardware these days. Can we close this bug now?
> 
> Note, we're still on crash-stats.mocotoolsstaging.net instead of
> crash-stats.allizom.org unfortunately.

Hard to tell; IIRC, this might'be been due to something that Adrian fixed, search-performance wise?  Also, though, a retest to close this issue out is probably in order, and that seems to depend on bug 1177806, no?
Flags: needinfo?(stephen.donner)
Saw this again in https://webqa-ci.mozilla.com/job/socorro.stage.saucelabs/650/HTML_Report/ - where are we at with this, Peter?
Flags: needinfo?(peterbe)
(In reply to Stephen Donner [:stephend] - please :need-info? me! from comment #4)
> Saw this again in
> https://webqa-ci.mozilla.com/job/socorro.stage.saucelabs/650/HTML_Report/ -
> where are we at with this, Peter?

What did that test do that took longer than 10 seconds? 

I'd love to have a report that shows the average and median response times per URL pattern. That'd give us focus on what to attack first.
Flags: needinfo?(peterbe)
This was a read timeout: "Once your client has connected to the server and sent the HTTP request, the read timeout is the number of seconds the client will wait for the server to send a response. (Specifically, it’s the number of seconds that the client will wait between bytes sent from the server. In 99.9% of cases, this is the time before the server sends the first byte)." from http://docs.python-requests.org/en/latest/user/advanced/#timeouts

These tests aren't able to provide statistics on response times, but that's perhaps something we could set up if it would help to investigate this. I wonder if that's something we could set up with Pingdom or similar.
Perhaps I'm straying from the topic at hand but that would be really interesting to see. 

A chart of what the response times are per URL over time. Then we'd be able to see where things are most f'ed up.
Peter - anything about this particular test + middleware responses, that make it fail more often than others in our suite, that you can tell?

https://webqa-ci.mozilla.com/job/socorro.stage.saucelabs/lastCompletedBuild/testReport/tests.test_layout/TestLayout/test_that_products_are_sorted_correctly/history/

Here's the test: https://github.com/mozilla/Socorro-Tests/blob/master/tests/test_layout.py#L15
Flags: needinfo?(peterbe)
No good reason. It's just the home page being sadly unreliable :(


I'd love to be able to run that whole suite on my laptop. But instead of crash-stats.allizom.org I want to use localhost:8000 

Where are the instructions for that? Or can someone guide me so I don't have to spend too long understand how to get it to work locally.
Flags: needinfo?(peterbe)
(In reply to Peter Bengtsson [:peterbe] from comment #9)
> No good reason. It's just the home page being sadly unreliable :(
> 
> 
> I'd love to be able to run that whole suite on my laptop. But instead of
> crash-stats.allizom.org I want to use localhost:8000 
> 
> Where are the instructions for that? Or can someone guide me so I don't have
> to spend too long understand how to get it to work locally.

Peter, that's one of the main benefits (among others) of upgrading/converting to pytest-selenium, IIRC - Cue: Dave Hunt, author of pytest-selenium, and Matt, Socorro lead
(In reply to Stephen Donner [:stephend] - please :need-info? me! from comment #10)
> (In reply to Peter Bengtsson [:peterbe] from comment #9)
> > No good reason. It's just the home page being sadly unreliable :(
> > 
> > 
> > I'd love to be able to run that whole suite on my laptop. But instead of
> > crash-stats.allizom.org I want to use localhost:8000 
> > 
> > Where are the instructions for that? Or can someone guide me so I don't have
> > to spend too long understand how to get it to work locally.
> 
> Peter, that's one of the main benefits (among others) of
> upgrading/converting to pytest-selenium, IIRC - Cue: Dave Hunt, author of
> pytest-selenium, and Matt, Socorro lead

/me dim. So how do I run it?
This suite isn't using pytest-selenium yet, but you can still run against a local instance by providing a base URL. Hopefully the following will help you to run the tests locally.

Clone the repo:
> git clone https://github.com/mozilla/Socorro-Tests.git
> cd Socorro-Tests

Create a virtual environment:
> virtualenv .venv
> source .venv/bin/activate

Install the dependencies:
> pip install -r requirements.txt

Run the tests:
> py.test --driver Firefox --baseurl http://localhost:8000

If you want to run a specific test file, pass the path as the final argument. If you want a specific test, use the -k argument with the value of the test name, but this will match partial test names too.

Note that some tests are likely dependent on data being present, and some won't run unless you provide the --destructive command line option (these tests modify/create/delete data). More information on running the tests can be found in the README: https://github.com/mozilla/Socorro-Tests/blob/master/README.md
It works! As in, I was able to run the tests. 

However, it doesn't work great locally. I think what happens is that it loads up a page but doesn't stick around to wait till all the AJAX requests have finished on that page. Because I run the website with `./manage.py runserver` it's notorious also for being single-threaded and can only handle one request at a time. So if it loads a page that on page load starts 5 ajax requests, the django runserver will have to deal with them one at a time. If the first 4 ajax requests are slow, the 5th one might have to timeout. And whilst those 5 requests have started (but not been responded to by django) if you kill the browerser you'll get a lot of errors like this: https://gist.github.com/peterbe/684522ec1e5ec0506335

I think that if we're going to stresstest to see where weaknesses are, we should use a different tool. My current favorite is locust.io (because Swedish people wrote it, of course). It doesn't do any selenium testing at all but it's for testing individual URLs. I.e. you have to write a test that opens a specific AJAX query for example. 

Also, locust supports making concurrent queries so if you want to run it against your laptop version of Socorro you have to start it with gunicorn or uwsgi locally. 

But what would be really interesting is to find out what a list of realistic URLs are. We could do that by looking at a days worth of Nginx log files are strip everything that isn't GET and isn't static assets.
By the way, we're getting off-topic on this bug. I think the immediate solution is either...

* accept that stage is flaky
** it's stage
** it only has 1 webhead

* upgrade stage to have more webheads
** prod uses 7 webheads so we can steal one from there :)

If you really think it's worth demanding more from stage in terms of reliability (and not just QA'ing features) we can assign this bug to JP.
Apologies for being late to the party, I'm playing a bit of catch up here. At this time, the primary goal is to have tests that reliably test stage for regressions to prevent them from making it to production.

If I'm not mistaken, we've chosen 10 seconds as the maximum response time that our sites should return content to a user/client. I'm hesitant to change this. The UI tests APIs aren't intended to do load testing, they do expect a response from the server in a reasonable amount of time.

Best case scenario is stage should be a solid platform for running our tests against [manual + automation]. For the short/medium term, solution #2 is in our best interest
> * upgrade stage to have more webheads

++ offtopic, locust is a great load testing tool!
* Yes, let's not talk about load/stress testing anymore. That's for another day. Sorry for starting the offtopic. 

* Let's agree to not worry too much about stage being down. It's not designed to be as reliable as prod. Having said that, JP is investigating the reliability of prod for its uptime. His first step is to add New Relic monitoring for better insight I think. 

* If the purpose of the automation suite is to test for regressions, then we should keep it to that. Not a server reliability tool. Arguably, super slow queries is automation regression work, but the problem was with even connecting to the server. 

* Just because I resolve this doesn't mean we are done worrying about stage. But this bug is getting confusing and infested with noise. Need. Fresh. Bug air.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Socorro → Socorro Graveyard
You need to log in before you can comment on or make changes to this bug.