Closed Bug 1093202 Opened 10 years ago Closed 10 years ago

[Tiles][Back-end] Load testing the Tiles web-submission component

Categories

(Content Services Graveyard :: General, defect)

All
Mer
defect
Not set
normal

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: kthiessen, Assigned: mostlygeek)

Details

This bug is mostly for posterity -- a place to record load test results.  Bugs discovered as a result of this load testing will be marked to block this bug.
Day 1: 2014-10-28


beeswithmachineguns is a better tool for this than blitz.io.

I did, in fact, get the AutoScaler to generate a new set of 3 servers.

Sustained load of ~3K req/sec over about 15 minutes.  (4 requests failed out of 1.2 million.  Yay!)

Detailed logs and so forth forthcoming tomorrow morning.

So far, I'd say status looks green, but more testing is definitely called for.
Status: NEW → ASSIGNED
Day 2: 2014-10-29


Status: still green, no blockers.  AutoScaling continues to work as designed.

By grafting siege onto BWMG, Benson has created bees with siege engines! (Yay Ben!) 50 of these t2.micro instances can create a lode of 6.5K RPS.

This caused AutoScaling to ramp up, but nothing fell over.

By tweaking concurrency limits, I hope to get this strategy to 10K RPS today, at which point if we use a hive of 100 rather than 50, we should be able to get 20K.

Suggestions and encouragement welcome.  I will also be filing a bug somewhere to record these results.

Onward!
--KT.
Day 3: 2014-10-30


Hive of 50 t2.micros replaced with 10 c3.larges.  Managed to get 9K RPS out of those; scaling works as expected.

Small number of 500s (<0.1% of requests) as the ELBs scale up -- not sure that's avoidable.  In practice, I suspect it will just mean a re-fetch; anyone care to confirm?

Anyone who would like to is welcome to look at the Stackwatch graphs at https://app.stackdriver.com/groups/11386/stage-tiles for the last few days and tell me if anything looks fishy -- I'm mostly looking for catastrophic failures, and other people will have a better idea of how small glitches will affect the system.

Last run of yesterday used 20 c3.larges, and was able to get to 16K RPS for a moment, but couldn't sustain it.

Starting off today with 30 c3.larges, turning down the concurrency a little bit, and we'll see if we can get a sustained 15-20K RPS.  Wish me luck.

--KT.
Day 4: 2014-10-31  Status: YELLOW


Folks --

There does, in fact, appear to be a performance/functionality cliff at 16.5K RPS or so ... it's reachable as a peak but doesn't sustain.

I'd like for someone with better knowledge of the system than I have to take a look at the last few runs and swap theories about why that cliff happens there when I've been able to sustain slightly lower rates.

I hate to bury this in the 'bad news on Friday evening' timeslot, but that's the way it has worked out.

I'll bring this up first thing in the Monday 10:30 meeting, and request help from others in the group.

Thanks,
--KT.
Depends on: 1093204
Day 5: 2014-11-03  Status: GREEN

After Benson replaced the one ELB with a set of three behind DNS round-robin, I was able to get a sustained 15-20K RPS total out of 30 c3.large wasps.  I'm going to run a couple of bake-in tests overnight, but the object lesson here seems to be that 12K req/sec or so per ELB is a safe maximum.

If the bake-in tests succeed, I'll close out bug 1093204 and we'll proceed to the next hurdle.
Product: Mozilla Services → Content Services
The work for the initial deployment is complete; raising the load ceiling will be covered in bug 1093204.
I'm going to remove the dependency, however -- the 16.5K ceiling was adequate for initial deployment,
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
No longer depends on: 1093204
Resolution: --- → FIXED
Marking this as VERIFIED to get it off my 'todo' query.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.