Closed Bug 1365677 Opened 7 years ago Closed 7 years ago

Start load testing Download and Symbolication with QA

Categories

(Socorro :: Symbols, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: peterbe, Assigned: grumpy)

References

Details

I'm not entirely convinced that this should block https://bugzilla.mozilla.org/show_bug.cgi?id=1365672 since we're not expecting any high load on the service any time soon. 

We *might* some day later this year. Then, extremely high load. 
Would be good to get the ball rolling on this as soon as possible though. At least to find some low-hang fruit in terms of figuring out if things break under load. 

Our goal is to test against a different environment other than Dev.
Miles, 
Two questions for you to decide:

1. Should we use prod or stage to do the load testing? If they're identical in terms of machine resources, it would be nice to use stage since it'll be around and ready to try-to-break once we start using/depending on prod. 

2. What is the URL (domain) we point our load testing to?
Flags: needinfo?(miles)
Summary: Start loading testing Download and Symbolication with QA → Start load testing Download and Symbolication with QA
:grumpy,

I have some tooling [0] I've been using to bombard the service locally. I've been using this primarily to do optimization of the service because it's easier to see what to make fast when it's being hit. 

The two services we want to load test are:

1. Symbolication (sending in a JSON blob with hex addresses and expecting them to be replaced with C++ signatures from symbol files stored in S3). This is what symbolication.py does in tecken-loader

2. Download (doing a GET on a symbol file will redirect to the symbols canonical public URL in S3). This is what download.py does in tecken-loader.

I've never used molotov before but if you think that's good stuff perhaps we can join forces and write some scripts together and you, grumpy, can be responsible for running them. Where/How do we start? 


[0] https://github.com/peterbe/tecken-loader
Flags: needinfo?(chartjes)
Just to clarify, the purpose of a load test is not solely to know at what point your app falls over for contrived unrealistic conditions of load. The purpose of a load test is multi-fold:

1. it helps establish what infrastructure is needed to run the app

2. it helps establish indicators for how the app should scale

3. it gives us a baseline for how the app behaves under increasing load so that later on down the line we have good feels for how the app will behave as the requirements and purpose change

4. assuming we run load tests in such a way that they're repeatable, we then have what we need to test architecture changes and anything else that could heavily affect the performance of the app

We definitely want some kind of load test before going to prod because otherwise we just have no idea how to answer questions related to the above things.
PS. (attention :grumpy) We *don't* need to wait (to start on this bug) for a stage and/or production environment. We can write the tests now and use Dev or local docker laptop environments.
:peterbe,

I definitely think we can take some of the stuff that you wrote in download.py and symbolication.py and make them work with molotov.

The quick-start docs are pretty good

https://molotov.readthedocs.io/en/latest/tutorial/
Flags: needinfo?(chartjes)
(In reply to Chris Hartjes [:grumpy][:chartjes] from comment #5)
> :peterbe,
> 
> I definitely think we can take some of the stuff that you wrote in
> download.py and symbolication.py and make them work with molotov.
> 
> The quick-start docs are pretty good
> 
> https://molotov.readthedocs.io/en/latest/tutorial/

Do you want to take a stab at it or should I?
(In reply to Peter Bengtsson [:peterbe] from comment #6)
> (In reply to Chris Hartjes [:grumpy][:chartjes] from comment #5)
> > :peterbe,
> > 
> > I definitely think we can take some of the stuff that you wrote in
> > download.py and symbolication.py and make them work with molotov.
> > 
> > The quick-start docs are pretty good
> > 
> > https://molotov.readthedocs.io/en/latest/tutorial/
> 
> Do you want to take a stab at it or should I?

Given that I don't have a running version of tekken on my laptop, probably better if you give it a try.
(In reply to Peter Bengtsson [:peterbe] from comment #1)
> 1. Should we use prod or stage to do the load testing? If they're identical
> in terms of machine resources, it would be nice to use stage since it'll be
> around and ready to try-to-break once we start using/depending on prod. 
Stage and prod will be identical - same AMIs (so same code). They will be in different regions (stage is us-east-1, prod is us-west-2). Stage is the standard environment to use for this sort of thing.

> 2. What is the URL (domain) we point our load testing to?
Though not available yet, the domain will be symbols.stage.mozaws.net. I'll check back in when we have a functional stage environment for symbols/tecken.
Flags: needinfo?(miles)
This is now in. https://github.com/mozilla-services/tecken-loadtests/blob/master/loadtest.py

I'm not sure what to do next. 

Miles, are you ready to set up a Stage instance so that :grumpy can start bombarding?

:grumpy, will you take ownership of this bug now?


Note-to-self; I'm not entirely convinced the test is good. The business logic for if a symbol download should be 404 or 200 depends on time and I took a snapshot. We might have to remove the test [0] and just make sure it's EITHER 200 or 404 but nothing else.

[0]  https://github.com/mozilla-services/tecken-loadtests/blob/ceb7a0773e756a7f23f165bb77fcbbe515eec733/loadtest.py#L149-L152
Flags: needinfo?(miles)
Assignee: nobody → chartjes
:peterbe I'm happy to take ownership. I noticed that there are some features in the latest release of molotov that can help, so I will refactor the load test code to use them.
Symbols is now ready for load testing in stage. Here is some relevant info:

APM <= single node running new relic
app <= autoscaled nodes not running new relic

symbols.stage.mozaws.net <= main endpoint, hits both APM and app instances
symbols-loadtest-apm.stage.mozaws.net <= specifically for load testing, hits only APM instance
symbols-loadtest-as.stage.mozaws.net <= specifically for load testing, hits both APM and app instances

New Relic and Datadog are configured for Symbols.

https://rpm.newrelic.com/accounts/1402187/applications/52227224 <= New Relic
https://app.datadoghq.com/dash/286319/tecken <= Datadog

Logging isn't quite working yet, coming soon.

Other than that, you're ready to go!
Flags: needinfo?(miles)
Any news on this? It would be nice to know when it's going to happen so I can be on standby with the graphs and stuff.
Flags: needinfo?(chartjes)
I just need a node on the same network as the symbol server staging instance and I will be ready to do it.
Flags: needinfo?(chartjes) → needinfo?(miles)
The node is up and afaik we are good to go on load testing.
Flags: needinfo?(miles)
What's the load numbers for the current symbols server? We can probably use that as a 1x target number.

Also, it's worth finding the load numbers for when Durst was hitting the symbols server with his python script. That's probably also a good 1x target number.

For Antenna, we put the 1x, 3x, and 10x numbers into some of the graphs so that we knew what our goals were for load testing and health.
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #16)
> What's the load numbers for the current symbols server? We can probably use
> that as a 1x target number.
> 
> Also, it's worth finding the load numbers for when Durst was hitting the
> symbols server with his python script. That's probably also a good 1x target
> number.
> 
> For Antenna, we put the 1x, 3x, and 10x numbers into some of the graphs so
> that we knew what our goals were for load testing and health.

https://docs.google.com/document/d/1UGz4sY-WESTr_x0_6j9duKhfJHSPsRrKnhCY1PyskLQ
We have to add symbols-loadtest-apm.stage.mozaws.net and symbols-loadtest-as.stage.mozaws.net to ALLOWED_HOSTS. Right now they're returning 400 Bad Request.
Flags: needinfo?(miles)
Yikes. That's my bad. Making the changes and pushing now.
Flags: needinfo?(miles)
To update, those hosts are now allowed properly.
This is no longer actionable. We have results (not great but that's another story) and we have a framework to do loadtesting. And the Google doc that talks about needs, targets and baselines is done and still useful.

After this we'll work on new optimizations (infra and code) and start new load testing. 

Also, a technical detail we learned, is that one of the results is that Tecken is not yet ready to handle the load from Socorro's processors. We'll deal with that after we go to prod.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.