Closed
Bug 1429546
Opened 7 years ago
Closed 7 years ago
[ops infra socorro] loadtest webapp in new infra
Categories
(Socorro :: Infra, task)
Socorro
Infra
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: willkg, Assigned: willkg)
References
Details
The new infrastructure is different in a few ways from the old infrastructure. We should loadtest the webapp nodes in the new infrastructure.
This bug covers writing up a loadtesting plan for the webapp nodes that we'll use for -stage and -prod.
Assignee | ||
Comment 1•7 years ago
|
||
Grumpy: Is this something you want to take on? We're thinking we'll need this in February plus you'll probably get more familiarity with Socorro out of it.
Flags: needinfo?(chartjes)
Comment 2•7 years ago
|
||
Yes, would be glad to help out with this effort in February.
Flags: needinfo?(chartjes)
Updated•7 years ago
|
QA Contact: chartjes
Comment 3•7 years ago
|
||
As mentioned in an IRC conversation with willkg, I require some time with someone familiar with the system to determine what end points need to be hit for a load test.
Who would be the appropriate person for this project
Flags: needinfo?(willkg)
Assignee | ||
Comment 4•7 years ago
|
||
Definitely worth looking at the e2e-tests and basing the load test on that.
Flags: needinfo?(willkg)
Comment 5•7 years ago
|
||
Created a repo for these tests
https://github.com/mozilla-services/socorro-load-tests
Assignee | ||
Comment 6•7 years ago
|
||
I traded emails with Chris just now. I'm going to take this on and try to get it done this week.
Assignee: nobody → willkg
Status: NEW → ASSIGNED
Assignee | ||
Comment 7•7 years ago
|
||
I threw together a test plan based on what Chris started:
https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#
I fleshed out the socorro-load-tests code:
https://github.com/willkg/socorro-load-tests
I ran a test-the-load-test test and following that a 1 hour test against -new-stage.
Copying from my email to socorro-dev:
"""
Last night, I picked up where Chris left off and threw together a
rough load test plan for the webapp. Then today, I took the code that
Chris had started, and fleshed it out to a point where I could run a
test-the-load-test test and a rough load test to get a feel for what
things looked like.
Webapp load test plan is here:
https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#
During the test-the-load-test test, I determined I can probably get
enough req/s from my laptop that it was sufficient to run it from
there. Then I did a 1 hour load test running from my laptop against
the -new-stage webapp nodes.
Short summary:
1. the webapp in -new-stage exceeds the 1x (3 req/s) and 3x (9 req/s)
targets--peak of 14 req/s
2. Datadog graphs suggest the webapp cluster is scaling nicely--scaled
at 10m and 20m for a total of 4 nodes
3. there weren't any errors in Sentry or non-200 responses
The only concern is that 0.9% of the requests timed out. The load test
code sets a timeout of 10 seconds which isn't very long. It's not
clear why these requests timed out (connection? waiting for response?
ES garbage collection?).
More details on the 1 hour from my laptop load test:
https://docs.google.com/document/d/1d-WqjrzMhjwMSzr_TFoyA4gYy7KF0p0Qld7wR0pDLQI/edit#heading=h.2swl8ar501gi
My thoughts at this stage:
I think there's enough evidence to suggest we're fine and that we
don't need to do anything further.
Does that make sense to you? What's important to pursue further?
"""
I think this is good enough, but will iterate further if there are interesting things we should pursue. I'll keep the bug open until there's consensus we should be done.
Assignee | ||
Comment 8•7 years ago
|
||
Mike and Miles raised an eyebrow at the FAILURES/Timeouts.
I did another iteration of load testing focusing on those and comparing -new-stage to -stage. The timeouts are definitely timeouts, though it's not clear what specifically is timing out. Both environments show timeouts during a loadtest, but -stage is significantly worse than -new-stage.
I'm not entirely sure how to differentiate between the various types of timeouts. aiohttp returns TimeoutError() with no explanation. I think to figure out more, I'd have to switch tools. It's a mystery, but I think we're ok with leaving it mysterious.
Everything else about the load test looks fine. Given that, I'm going to mark this FIXED.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•