Closed Bug 718218 Opened 13 years ago Closed 13 years ago

search performance regression in 2.4

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: rhelmer, Unassigned)

References

Details

Attachments

(1 file)

We're tracking a perfomance regression that the selenium tests are showing in production following the 2.4 push (bug 717055).

What we know so far:

A run we believe to be equivalent on production 2.3.5.1, from Jan 11th:
http://qa-selenium.mv.mozilla.com:8080/job/socorro.prod/1741/testReport/

A run from just now (looks similar to last night):
http://qa-selenium.mv.mozilla.com:8080/view/Socorro/job/socorro.prod/1736/testReport/
We have a number of failures (that we believe to be timeouts), and the following test suites seem to have regressed in performance:

tests.test_smoke_tests
  before: 1 min 11 sec
  after:  6 min 5 sec

tests.test_search
  before: 2 min 11 sec
  after:  13 min

It looks like postgresql has big load-average spikes that coincide with the test timeouts (graph attached), and we get nagios alerts from the front-end and mware servers.
Testing individual search queries showed no change in execution time.  So something else is going on.  Setting up a performance test environment on stagedb.
This test seems to be the single worst offender, and it'd make sense since it's doing a much larger query:

https://github.com/mozilla/Socorro-Tests/commit/32d53ffb58d6800db8837bc2e814d1ab41c98978

I see I am implicated in this, I was trying to suggest a version that wouldn't expire as fast as Aurora, but did not ask what it was used for :)

Can someone please try changing this to 11.0a2 instead?
(In reply to Robert Helmer [:rhelmer] from comment #2)
> This test seems to be the single worst offender, and it'd make sense since
> it's doing a much larger query:
> 
> https://github.com/mozilla/Socorro-Tests/commit/
> 32d53ffb58d6800db8837bc2e814d1ab41c98978
> 
> I see I am implicated in this, I was trying to suggest a version that
> wouldn't expire as fast as Aurora, but did not ask what it was used for :)
> 
> Can someone please try changing this to 11.0a2 instead?

I just filed a pull request: 
https://github.com/mozilla/Socorro-Tests/pull/82
Pull request merged, we're running a test against prod:
http://qa-selenium.mv.mozilla.com:8080/job/socorro.prod/1745/
Just FYI I reversed the links here:

(In reply to Robert Helmer [:rhelmer] from comment #0)
> A run we believe to be equivalent on production 2.3.5.1, from Jan 11th:

http://qa-selenium.mv.mozilla.com:8080/view/Socorro/job/socorro.prod/1736/testReport/
 
> A run from just now (looks similar to last night):
> We have a number of failures (that we believe to be timeouts), and the
> following test suites seem to have regressed in performance:

http://qa-selenium.mv.mozilla.com:8080/job/socorro.prod/1741/testReport/
I think we tracked this down to at least one cause: test_that_filter_for_browser_results has a |while| condition that looks for a "browser_icon" present, and then clicks the "Next" button for each set of search results -- the problem is, it never really clicks Next because that condition is never true.  As a stopgap, we're commenting out the offending while/assert, and re-running:

https://github.com/mozilla/Socorro-Tests/commit/36a5f9fcac7c189495fd3c413c99e0bac9e700ee

(We watched that test running in a Sauce Labs video, which led us to this conclusion.)
The problematic test mentioned in comment 6 did turn out to be the problem; it was cycling through all pages looking for a browser icon (which has never existed for the advanced search results page).

Yesterday the test was switched to use 9.0.1 instead of 10.0a1 (my suggestion, since Aurora changes so frequently). However this causes the DB load to be so high that things start timing out and unrelated tests fail. This has been reverted.

We now have a test run that is comparable to the run on the 11th we were comparing to:
http://qa-selenium.mv.mozilla.com:8080/job/socorro.prod/1747/

We discussed and have some ideas for making more representative tests for perf comparisons, and using different methodology (running the same set of tests right before/after push for instance), we can cover this in the post-mortem.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
For fun: https://saucelabs.com/jobs/6f38bb2836dcb1db1812e8a532e03de0 is the problematic test running in Sauce Labs -- the video is only archived for ~ 30 days, but if I can, I'll attach it as an FLV, here, too.
Verified ... we'll need give some love to our test suite. A bit of house cleaning is in order.

Awesome work digging down through this! Woot
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: