Closed Bug 895558 Opened 11 years ago Closed 11 years ago

Load test snippets VMs in production

Categories

(Snippets :: Service, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hoosteeno, Unassigned)

References

Details

Once we have deployed the application to snippets 6/7, we should run a load test using real traffic. We'll put these servers in the pool and verify that they can take traffic evenly balanced (watching newrelic for data). We can even throttle snippets 1/2/3 and see how 4/5/6/7 behave with the entire load. They should be quite capable of it since in bug 887284 comment 19 two servers handled the load.
:jakem suggests that we may also wish to experiment with TTL settings during this test, since we may be able to tune them to make each server much more performant:

"Is it just a TTL problem? Snippets don't change that frequently... if
the problem is just the hit rate is not good enough and we can't handle
the misses, we can probably cache for up to an hour with little negative
effect."
Hit rate on old app and new app is about 90-92%, consistently.

TTL is currently 90 seconds, set somewhere within the app (not in Apache or overridden by Zeus).

The current 2 VMs is sufficient, but only barely... if one of them fails, there will probably not be quite enough capacity to handle all the traffic alone. So let's spin up those other 2 nodes as planned.

However, let's also try a larger TTL. Can we do 3/4/5/10 minutes, instead of 1.5 min? That'll probably help the hit rate a few %... going from 90% to 95% cuts the volume of traffic that goes to the backend servers in half, so a small change here can help a lot.
The max-age for the requests is controlled by a setting called SNIPPET_HTTP_MAX_AGE, which defaults to 90 (in seconds). I recommend controlling it via that setting in local.py rather than updating the app code.
We ran the following test:

1. added snippets 4/5/6/7 to the production snippets pool and observed their behavior
 -> each of the new hosts carried a 1/7th production load without any problem

2. drained traffic from snippets 1/2/3 and observed the new nodes' behavior
 -> each of the new hosts carried a 1/4th production load without any problem

3. drained traffic from snippets 5 and observed the remaining nodes
 -> each of the new hosts carried a 1/3rd production load without any problem

This was all expected, but it is good confirmation that we have appropriate-sized infrastructure for a successful launch (scheduled for 2013-07-30 at 1pm Mountain time). 

We did discover and open one additional bug, but it is not blocking launch since it is not constrained to snippets: https://bugzilla.mozilla.org/show_bug.cgi?id=898548

Regarding the TTL configuration, I have opened a separate bug for adjusting that setting. It also is not blocking launch: https://bugzilla.mozilla.org/show_bug.cgi?id=898553
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.