[test] load test Eliot
Categories
(Eliot :: General, task, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Assigned: willkg)
References
Details
Attachments
(4 files)
We should create and run a load test on Eliot.
The load should be based on existing Tecken symbolication load.
We should run the load test on a single node to see whether it dies and how, performance characteristics under load, and what kind of load it can handle.
We should run the load test on the cluster to see whether the cluster scales up and down in burst-of-requests conditions. That seems to be common for symbolication usage in Tecken.
Assignee | ||
Comment 1•4 years ago
|
||
Eliot is on stage now, so I'm going to grab this and work out a load test suite.
Assignee | ||
Comment 2•4 years ago
|
||
We've got a bunch of things in the systemtest and we can probably use molotov to just wire that up into a load test.
Assignee | ||
Comment 3•3 years ago
|
||
Assignee | ||
Comment 4•3 years ago
|
||
Assignee | ||
Comment 5•3 years ago
|
||
Assignee | ||
Comment 6•3 years ago
•
|
||
I didn't get what I wanted out of molotov, so I switched to locust. I also moved all the load test stuff to:
https://github.com/mozilla-services/tecken-loadtests/
Here's some comparison load test using locust with 50 workers (ramped up at 1/s) for 1 minute:
locust -f locust-eliot/testfile.py --host "${HOST}" --users 50 --run-time "1m" --print-stats --headless
Name | # reqs | # fails | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Tecken prod (cold cache) | 251 | 0 | 2200 | 7200 | 8700 | 9500 | 15000 | 18000 | 21000 | 26000 | 29000 | 29000 | 29000 |
Tecken prod (hot cache) | 875 | 0 | 580 | 940 | 1300 | 1500 | 4800 | 9300 | 15000 | 18000 | 24000 | 24000 | 24000 |
Eliot prod (cold cache) | 2654 | 0 | 170 | 210 | 250 | 290 | 2000 | 4100 | 5400 | 6500 | 8400 | 9200 | 9200 |
Eliot prod (hot cache) | 5775 | 0 | 180 | 240 | 290 | 320 | 430 | 540 | 770 | 2700 | 3500 | 4800 | 4800 |
"Tecken" here refers to Mozilla Symbols Server at https://symbols.mozilla.org/ which has a downloads API and an uploads API.
"Eliot" here refers to the new Mozilla Symbolication Server at https://symbolication.services.mozilla.com/ which only has a symbolication API and isn't in use by anyone, yet.
I've got 1,035 stack samples I'm using, so there are a fair number of repeat requests. The request results aren't cached, so a repeat request is just like another request, but more likely to use modules that are in the cache. Since the cache for nodes kept growing and didn't hit the max cache size forcing evictions, it's likely that most/all of the modules were in cache in the "hot cache" tests.
Stack samples are derived from stacks for recent Firefox nightly crash reports. There are two builds per day in the nightly channel so I claim there's more variety here in regards to modules used than if we were looking at the release channel. I also claim that using sequential crash reports from the Firefox nightly channel is "representative of real life usage". At a minimum, it's the order in which Socorro processed incoming crash reports.
I also did a 10 minute test with 60 users:
locust -f locust-eliot/testfile.py --host "${HOST}" --users 60 --run-time "10m" --print-stats --headless
Name | # reqs | # fails | 50% | 66% | 75% | 80% | 90% | 95% | 98% | 99% | 99.9% | 99.99% | 100% |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Eliot prod (hot cache) | 58861 | 0 | 250 | 280 | 310 | 340 | 430 | 530 | 660 | 750 | 5300 | 39000 | 39000 |
From the Grafana dashboard, I observed a few things for the Eliot tests:
- no errors--all the requests completed successfully
- disk cache grew, but didn't peak
- most of the symbolication requests pulled sym files from cache
- there were no sym parse error counts
- downloading sym files are predominantly under a second
- parsing sym files are under 4 seconds to parse with the majority under 1 second
A couple of things I thought were interesting here:
- Eliot with a hot cache is finishing requests in under a second for 90% of requests. This is encouraging. Large bursts of symbolication requests from symbolication crash pings and other batch jobs like that should do better with Eliot than they are currently with Tecken.
- Parsing sym files takes the bulk of the time. I forget if I'm keeping track of how long it takes to parse a symcache file. I think this is the step that's worth spending some time on to reduce overall symbolication time. Maybe it's worth having an in-memory LRU cache for symcache files?
- I don't know if I'm keeping track of how much time is spent not downloading and parsing sym files. Maybe it's worth looking at reordering symbols so we do all the lookups against a specific module at the same time?
If we were to do the load test again:
- We should have 100,000 samples and go through them sequentially rather than pick a random sample every scenario. The "cold cache" test should be a newly deployed system. The symcache cache is on disk, so new nodes start with an empty cache.
- We should prove the claim that using the nightly channel results in more variety of modules used. Regardless, we should measure the variety of modules used in the samples, the number of modules in a sample, and the number of frames in the stack of a sample. These numbers affect what a "request" looks like. Maybe we build sample stacks such that they always use 2 modules and 10 frames?
Assignee | ||
Comment 7•3 years ago
|
||
Assignee | ||
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
I think that captures the things future me wants to see. Marking this as FIXED.
Assignee | ||
Comment 11•1 year ago
|
||
Moving to Eliot product.
Description
•