Load test input.stage for testing major release scale

RESOLVED FIXED

Status

mozilla.org Graveyard
Server Operations
RESOLVED FIXED
7 years ago
3 years ago

People

(Reporter: aakashd, Assigned: cshields)

Tracking

Details

(Reporter)

Description

7 years ago
The Input team is working on getting ready for major release support for Firefox 4 and, after doing some number crunching using SUMO webtrend data, the expectation is to handle 50x the amount of load we're receiving now at a max. 

The previous max (with no drops) we've hit so far on our stating setup is at 40 requests per second. So, I'd like to see what'd be required to not completely falter with the current hardware. 

The timetable as to when to get this done would be 2/4 at the latest.


Note: This is a follow-up from bug 606964
(Reporter)

Updated

7 years ago
Summary: Load test input.stage for Major Release → Load test input.stage for testing major release scale
(Reporter)

Comment 1

7 years ago
Sorry its not clear in the reported bug, but these would be write requests posting to one of our submission forms.
(Assignee)

Comment 2

7 years ago
I'm sorry, are you asking for a load test of 50x your current hit rate by 2/4 or are you asking to scale your infrastructure to handle 50x by then??  The latter seems a bit unrealistic for such a short time if looked at in a linear scale, but we will have to see how much you are hitting right now compared to your current capacity.
(Reporter)

Comment 3

7 years ago
> I'm sorry, are you asking for a load test of 50x your current hit rate by 2/4
or are you asking to scale your infrastructure to handle 50x by then??

I'm asking for a load test by 2/4.
(In reply to comment #2)
> cc: cshields@mozilla.comI'm sorry, are you asking for a load test of 50x your current hit rate by 2/4
> or are you asking to scale your infrastructure to handle 50x by then??  The
> latter seems a bit unrealistic for such a short time if looked at in a linear
> scale, but we will have to see how much you are hitting right now compared to
> your current capacity.

We should see what we can get now, I think a lot of the scaling will happen from the app-side as well as infra.  But we should find out how much we can handle.

Sustained 50x won't happen, it'll likely be spikes if it even gets that close at all, we just need to handle them gracefully... even if that means having a fail-pet once in a while.
We can provide a number of POST requests along with valid user agents that we can use to flood our staging instance and see how fast these requests get absorbed until either the database or the app dies.

As a second test, we could explore how varying GET requests on the dashboard side of the app work, but I think the real traffic can be expected on the write side of our app, so that's our immediate concern, when Release users see they can leave feedback and all decide to tells us how much they like* Firefox 4 at once.

* or so we hope
My Netsparker Community Edition tool easily brings down (current) Input staging (and no, I don't run it that way any longer, for obvious reasons), but perhaps that's a place to start -- it has an "attack" mode which tends to max out HTTP sessions, server-side.

http://www.mavitunasecurity.com/communityedition/
(Assignee)

Comment 7

7 years ago
Load testing against stage is a bit pointless, you're in a shared environment there on a different hardware platform than your production.

What we will do is spin up a third webhead, pp-app-input03 (since you will likely need it anyway), and load test it directly, outside of the production VIP.
(Assignee)

Comment 8

7 years ago
We have a production node up and serving input independent of the production cluster, and can test against this node.  This is pending verification that the site is working properly.  (ping me on #input for the IP to test..  it looks fine to me from my limited poking)

Stephen, at what point were you able to bring down the staging site?  (how many hits, etc)
Assignee: server-ops → cshields
(In reply to comment #8)
> We have a production node up and serving input independent of the production
> cluster, and can test against this node.  This is pending verification that the
> site is working properly.  (ping me on #input for the IP to test..  it looks
> fine to me from my limited poking)
> 
> Stephen, at what point were you able to bring down the staging site?  (how many
> hits, etc)

No clue; was scolded and immediately stopped running the tool.  I can work with you to run a controlled test, if you'd like?
(Assignee)

Comment 10

7 years ago
We have pp-app-input03 setup and is on its own VIP for load testing.  This backs in to a copy of the input db (input_load_test on the phx1 a01 cluster).  This copy happened at around 08:30 2/1/11 mv time.

The temporary VIP for this is 63.245.217.55 and is setup to answer for input.mozilla.com.

Please do not load test this without someone from IT watching - since the a01 db cluster is production we will want to keep an eye on it.
For write testing, I suggest using something like cURL or the like to issue a POST like a user would. I'll come up with three calls (rating, broken website, suggestion), all of which will behave like AJAX submissions on the release feedback page.

For the record three things are important:
- CSRF needs to validate
- Needs to contain some comment, but must not add the same comment more than once (use a counter?).
- The user agent needs to fake a Firefox 4 release.
curl http://input.stage.mozilla.com/en-US/release/feedback -A"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0b11pre) Gecko/20110129 Firefox/4.0" -d"startup=1&crashy=2&pageload=3&features=1&responsive=3&csrfmiddlewaretoken=1&type=4" -bcsrftoken=1
(Assignee)

Comment 13

7 years ago
Using a jmeter test provided by Fred, I benchmarked with 20 seamicro nodes and can top out at an error-free rate of around 1,770 submissions per minute.  The load on the webhead reaches around 6.5 (which is healthy for that hardware).

As I understand it, the current submission rate is about 8 per minute (spread across 2 webheads of the same build).

With this in mind, we can add this third webhead to the production pool, and we should have enough power to handle roughly 5,000 submissions per minute.

Is this acceptable?
(In reply to comment #13)
> Using a jmeter test provided by Fred, I benchmarked with 20 seamicro nodes and
> can top out at an error-free rate of around 1,770 submissions per minute.  The
> load on the webhead reaches around 6.5 (which is healthy for that hardware).
> 
> As I understand it, the current submission rate is about 8 per minute (spread
> across 2 webheads of the same build).
> 
> With this in mind, we can add this third webhead to the production pool, and we
> should have enough power to handle roughly 5,000 submissions per minute.
> 
> Is this acceptable?

If we're doing 8/m and we can do 1770/m without errors - then we shouldn't need to upgrade anything.

If that's the case I'd not add any hardware, we can use those nodes elsewhere.

Fred can you verify that I'm understanding these results correctly?
(Assignee)

Comment 15

7 years ago
(In reply to comment #14)
> If we're doing 8/m and we can do 1770/m without errors - then we shouldn't need
> to upgrade anything.
> 
> If that's the case I'd not add any hardware, we can use those nodes elsewhere.

I still plan on adding the third node for redundancy sake.  I want to be able to lose a node completely without stressing the rest and risking the site.
(In reply to comment #15)
> (In reply to comment #14)
> > If we're doing 8/m and we can do 1770/m without errors - then we shouldn't need
> > to upgrade anything.
> > 
> > If that's the case I'd not add any hardware, we can use those nodes elsewhere.
> 
> I still plan on adding the third node for redundancy sake.  I want to be able
> to lose a node completely without stressing the rest and risking the site.

Yes, that's how I understood it as well. Thanks for running that test, Corey.

The only thing we did not test now is what happens with more complicated term extraction (i.e., extracting nouns from feedback items), but we do have a switch we can flip to remove this feature if it gets too hard for the boxes too handle.

Calling this one fixed. I believe we are in good shape for the release.
Status: NEW → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.