Closed Bug 1674107 Opened 4 years ago Closed 3 years ago

[test] load test Eliot

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(4 files)

pr 2427: bug 1674107: add loadtest for symbolication 3 years ago Will Kahn-Greene [:willkg] ET needinfo? me 52 bytes, text/x-github-pull-request		Details \| Review
pr 2433: bug 1674107: move loadtests to other repo 3 years ago Will Kahn-Greene [:willkg] ET needinfo? me 52 bytes, text/x-github-pull-request		Details \| Review
Grafana dashboard for Eliot during three tests (cold cache, hot cache, and hot cache duration) 3 years ago Will Kahn-Greene [:willkg] ET needinfo? me 66.82 KB, image/png		Details
Grafana dashboard for Eliot during three tests (cold cache, hot cache, and hot cache duration) -- second screenshot 3 years ago Will Kahn-Greene [:willkg] ET needinfo? me 59.55 KB, image/png		Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Description

•

4 years ago

We should create and run a load test on Eliot.

The load should be based on existing Tecken symbolication load.

We should run the load test on a single node to see whether it dies and how, performance characteristics under load, and what kind of load it can handle.

We should run the load test on the cluster to see whether the cluster scales up and down in burst-of-requests conditions. That seems to be common for symbolication usage in Tecken.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 1

•

4 years ago

Eliot is on stage now, so I'm going to grab this and work out a load test suite.

Assignee: nobody → willkg

Status: NEW → ASSIGNED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 2

•

4 years ago

We've got a bunch of things in the systemtest and we can probably use molotov to just wire that up into a load test.

https://molotov.readthedocs.io/en/stable/

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 3

•

3 years ago

Attached file pr 2427: bug 1674107: add loadtest for symbolication — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 4

•

3 years ago

willkg merged PR #2427: "bug 1674107: add loadtest for symbolication" in 9c615e9.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 5

•

3 years ago

Attached file pr 2433: bug 1674107: move loadtests to other repo — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 6

•

3 years ago

•

Edited

I didn't get what I wanted out of molotov, so I switched to locust. I also moved all the load test stuff to:

https://github.com/mozilla-services/tecken-loadtests/

Here's some comparison load test using locust with 50 workers (ramped up at 1/s) for 1 minute:

locust -f locust-eliot/testfile.py --host "${HOST}" --users 50 --run-time "1m" --print-stats --headless

Name	# reqs	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
Tecken prod (cold cache)	251	2200	7200	8700	9500	15000	18000	21000	26000	29000	29000	29000
Tecken prod (hot cache)	875	580	940	1300	1500	4800	9300	15000	18000	24000	24000	24000
Eliot prod (cold cache)	2654	170	210	250	290	2000	4100	5400	6500	8400	9200	9200
Eliot prod (hot cache)	5775	180	240	290	320	430	540	770	2700	3500	4800	4800

"Tecken" here refers to Mozilla Symbols Server at https://symbols.mozilla.org/ which has a downloads API and an uploads API.

"Eliot" here refers to the new Mozilla Symbolication Server at https://symbolication.services.mozilla.com/ which only has a symbolication API and isn't in use by anyone, yet.

I've got 1,035 stack samples I'm using, so there are a fair number of repeat requests. The request results aren't cached, so a repeat request is just like another request, but more likely to use modules that are in the cache. Since the cache for nodes kept growing and didn't hit the max cache size forcing evictions, it's likely that most/all of the modules were in cache in the "hot cache" tests.

Stack samples are derived from stacks for recent Firefox nightly crash reports. There are two builds per day in the nightly channel so I claim there's more variety here in regards to modules used than if we were looking at the release channel. I also claim that using sequential crash reports from the Firefox nightly channel is "representative of real life usage". At a minimum, it's the order in which Socorro processed incoming crash reports.

I also did a 10 minute test with 60 users:

locust -f locust-eliot/testfile.py --host "${HOST}" --users 60 --run-time "10m" --print-stats --headless

Name	# reqs	# fails	50%	66%	75%	80%	90%	95%	98%	99%	99.9%	99.99%	100%
Eliot prod (hot cache)	58861	0	250	280	310	340	430	530	660	750	5300	39000	39000

From the Grafana dashboard, I observed a few things for the Eliot tests:

no errors--all the requests completed successfully
disk cache grew, but didn't peak
most of the symbolication requests pulled sym files from cache
there were no sym parse error counts
downloading sym files are predominantly under a second
parsing sym files are under 4 seconds to parse with the majority under 1 second

A couple of things I thought were interesting here:

Eliot with a hot cache is finishing requests in under a second for 90% of requests. This is encouraging. Large bursts of symbolication requests from symbolication crash pings and other batch jobs like that should do better with Eliot than they are currently with Tecken.
Parsing sym files takes the bulk of the time. I forget if I'm keeping track of how long it takes to parse a symcache file. I think this is the step that's worth spending some time on to reduce overall symbolication time. Maybe it's worth having an in-memory LRU cache for symcache files?
I don't know if I'm keeping track of how much time is spent not downloading and parsing sym files. Maybe it's worth looking at reordering symbols so we do all the lookups against a specific module at the same time?

If we were to do the load test again:

We should have 100,000 samples and go through them sequentially rather than pick a random sample every scenario. The "cold cache" test should be a newly deployed system. The symcache cache is on disk, so new nodes start with an empty cache.
We should prove the claim that using the nightly channel results in more variety of modules used. Regardless, we should measure the variety of modules used in the samples, the number of modules in a sample, and the number of frames in the stack of a sample. These numbers affect what a "request" looks like. Maybe we build sample stacks such that they always use 2 modules and 10 frames?

https://earthangel-b40313e5.influxcloud.net/d/a9-7FT0Zk/tecken-app-metrics?orgId=1&var-env=prod&from=1633956250671&to=1633957762361

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 7

•

3 years ago

willkg merged PR #2433: "bug 1674107: move loadtests to other repo" in 8cdafd8.

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 8

•

3 years ago

Attached image Grafana dashboard for Eliot during three tests (cold cache, hot cache, and hot cache duration) — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 9

•

3 years ago

Attached image Grafana dashboard for Eliot during three tests (cold cache, hot cache, and hot cache duration) -- second screenshot — Details

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 10

•

3 years ago

I think that captures the things future me wants to see. Marking this as FIXED.

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Will Kahn-Greene [:willkg] ET needinfo? me

Assignee

Comment 11

•

1 year ago

Moving to Eliot product.

Component: Symbolication → General

Product: Tecken → Eliot

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[test] load test Eliot

Categories

(Eliot :: General, task, P2)

Tracking

(Not tracked)

People

(Reporter: willkg, Assigned: willkg)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type