Closed Bug 1674107 Opened 4 years ago Closed 3 years ago

[test] load test Eliot

Categories

(Eliot :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

We should create and run a load test on Eliot.

The load should be based on existing Tecken symbolication load.

We should run the load test on a single node to see whether it dies and how, performance characteristics under load, and what kind of load it can handle.

We should run the load test on the cluster to see whether the cluster scales up and down in burst-of-requests conditions. That seems to be common for symbolication usage in Tecken.

Eliot is on stage now, so I'm going to grab this and work out a load test suite.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

We've got a bunch of things in the systemtest and we can probably use molotov to just wire that up into a load test.

https://molotov.readthedocs.io/en/stable/

I didn't get what I wanted out of molotov, so I switched to locust. I also moved all the load test stuff to:

https://github.com/mozilla-services/tecken-loadtests/

Here's some comparison load test using locust with 50 workers (ramped up at 1/s) for 1 minute:

locust -f locust-eliot/testfile.py --host "${HOST}" --users 50 --run-time "1m" --print-stats --headless
Name # reqs # fails 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
Tecken prod (cold cache) 251 0 2200 7200 8700 9500 15000 18000 21000 26000 29000 29000 29000
Tecken prod (hot cache) 875 0 580 940 1300 1500 4800 9300 15000 18000 24000 24000 24000
Eliot prod (cold cache) 2654 0 170 210 250 290 2000 4100 5400 6500 8400 9200 9200
Eliot prod (hot cache) 5775 0 180 240 290 320 430 540 770 2700 3500 4800 4800

"Tecken" here refers to Mozilla Symbols Server at https://symbols.mozilla.org/ which has a downloads API and an uploads API.

"Eliot" here refers to the new Mozilla Symbolication Server at https://symbolication.services.mozilla.com/ which only has a symbolication API and isn't in use by anyone, yet.

I've got 1,035 stack samples I'm using, so there are a fair number of repeat requests. The request results aren't cached, so a repeat request is just like another request, but more likely to use modules that are in the cache. Since the cache for nodes kept growing and didn't hit the max cache size forcing evictions, it's likely that most/all of the modules were in cache in the "hot cache" tests.

Stack samples are derived from stacks for recent Firefox nightly crash reports. There are two builds per day in the nightly channel so I claim there's more variety here in regards to modules used than if we were looking at the release channel. I also claim that using sequential crash reports from the Firefox nightly channel is "representative of real life usage". At a minimum, it's the order in which Socorro processed incoming crash reports.

I also did a 10 minute test with 60 users:

locust -f locust-eliot/testfile.py --host "${HOST}" --users 60 --run-time "10m" --print-stats --headless
Name # reqs # fails 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
Eliot prod (hot cache) 58861 0 250 280 310 340 430 530 660 750 5300 39000 39000

From the Grafana dashboard, I observed a few things for the Eliot tests:

  • no errors--all the requests completed successfully
  • disk cache grew, but didn't peak
  • most of the symbolication requests pulled sym files from cache
  • there were no sym parse error counts
  • downloading sym files are predominantly under a second
  • parsing sym files are under 4 seconds to parse with the majority under 1 second

A couple of things I thought were interesting here:

  1. Eliot with a hot cache is finishing requests in under a second for 90% of requests. This is encouraging. Large bursts of symbolication requests from symbolication crash pings and other batch jobs like that should do better with Eliot than they are currently with Tecken.
  2. Parsing sym files takes the bulk of the time. I forget if I'm keeping track of how long it takes to parse a symcache file. I think this is the step that's worth spending some time on to reduce overall symbolication time. Maybe it's worth having an in-memory LRU cache for symcache files?
  3. I don't know if I'm keeping track of how much time is spent not downloading and parsing sym files. Maybe it's worth looking at reordering symbols so we do all the lookups against a specific module at the same time?

If we were to do the load test again:

  1. We should have 100,000 samples and go through them sequentially rather than pick a random sample every scenario. The "cold cache" test should be a newly deployed system. The symcache cache is on disk, so new nodes start with an empty cache.
  2. We should prove the claim that using the nightly channel results in more variety of modules used. Regardless, we should measure the variety of modules used in the samples, the number of modules in a sample, and the number of frames in the stack of a sample. These numbers affect what a "request" looks like. Maybe we build sample stacks such that they always use 2 modules and 10 frames?

https://earthangel-b40313e5.influxcloud.net/d/a9-7FT0Zk/tecken-app-metrics?orgId=1&var-env=prod&from=1633956250671&to=1633957762361

I think that captures the things future me wants to see. Marking this as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

Moving to Eliot product.

Component: Symbolication → General
Product: Tecken → Eliot
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: