Open Bug 1311510 Opened 8 years ago Updated 2 years ago

chrome.storage.sync: performance test of production stack for chrome.storage.sync

Categories

(WebExtensions :: Storage, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: glasserc, Unassigned)

References

Details

(Whiteboard: [storage], triaged)

We need to make sure that we have enough capacity in production for when this feature hits beta.
Blocks: 1311710
Component: WebExtensions: Untriaged → WebExtensions: General
Priority: -- → P3
Whiteboard: [storage]triaged
Blocks: 1220494
No longer blocks: 1311710
Would this be you Remy, or someone else?
Flags: needinfo?(rhubscher)
Hi Andy,

We have some loadtest for Kinto ready there: https://github.com/mozilla-services/ailoads-kinto
Usually QA is running them. But I don't know who is the QA for the service side of the webextensions stack.

Stuart do you know if we have a QA that could run some loadtest on the webextension stack?
Flags: needinfo?(rhubscher) → needinfo?(sphilp)
Karl can take it, cc'ing him. Do we need this for a certain date?
Flags: needinfo?(sphilp)
QA Contact: kthiessen
(In reply to Stuart Philp :sphilp from comment #3)
> Karl can take it, cc'ing him. Do we need this for a certain date?

This would block the feature from landing in beta. So, sometime before that would be great. Release trains are at https://wiki.mozilla.org/RapidRelease/Calendar
I'll note that we're aiming for Firefox 53 here, which means the relevant merge date is currently 2017-03-06.

I can agree to that timeframe.
Who in Services Ops is going to be in charge of this production deployment?  Can we get them cc'ed on this bug, please, or get a pointer to another bug to use for communication with Ops?
Flags: needinfo?(eglassercamp)
More questions:

* Do we have defined desired capacities in terms of, for example, number of queries per second we want the service to stand up under?

* Do we need to co-ordinate with Ops to determine what the optimum size of the production cluster will be, or have they already made that decision?

* Who is our Ops contact for deployment verification?  Is there a stage instance for the AMO-specific cluster, or are we just using the existing https://webextensions-settings.stage.mozaws.net?

* My team are standing up the load testing apparatus today and tomorrow; we should have the first successful tests late this week or early next, and I'm hoping to have a go/no-go call by the end of next week.  Does that work with everyone's timetable?
(In reply to Karl Thiessen [:kthiessen] from comment #6)
> Who in Services Ops is going to be in charge of this production deployment? 
> Can we get them cc'ed on this bug, please, or get a pointer to another bug
> to use for communication with Ops?

I am the primary Ops on Kinto/Storage today, bobm is secondary.
(In reply to Karl Thiessen [:kthiessen] from comment #7)
> More questions:
> * Do we need to co-ordinate with Ops to determine what the optimum size of
> the production cluster will be, or have they already made that decision?
> 

As of right now production is up but with minimal resources, 3 web instances c4.large and RDS m4.large [1]. We can adjust as needed based on performance testing and how much traffic we expect to receive. Production endpoint is https://webextensions.settings.services.mozilla.com/v1/


> * Who is our Ops contact for deployment verification?  Is there a stage
> instance for the AMO-specific cluster, or are we just using the existing
> https://webextensions-settings.stage.mozaws.net?

I am the Ops contact, reach out to me with any questions. We should use https://webextensions-settings.stage.mozaws.net for testing.


[1] https://github.com/mozilla-services/cloudops-deployment/blob/master/projects/kintowe/ansible/envs/prod.yml#L15-L20
Brilliant!  Thanks, Jason.

The only outstanding question is:

* Do we have defined desired capacities in terms of, for example, number of queries per second we want the service to stand up under?

Ethan, have you got an answer for that, or an you point us in the direction of someone who does?
(In reply to Karl Thiessen [:kthiessen] from comment #10)
> Brilliant!  Thanks, Jason.
> 
> The only outstanding question is:
> 
> * Do we have defined desired capacities in terms of, for example, number of
> queries per second we want the service to stand up under?

AdBlock Plus uses this. If you took the number of users for AdBlock Plus (~20 million) multiplied it by the number of times it queries in a day, you'll get the idea. But AdBlock Plus isn't moving over to this for a few releases.

Overall approx. 15% of all add-ons on the Chrome store use this API [1]. We currently have 89 add-ons using it [2].

We've explicitly stated that this API end point has no SLA around usage or performance, developers get what they get and they don't get upset.

I really don't want us to end up throwing too many resources at this and would like to suggest we ramp up performance as the usage increases, I expect very little usage until it hits a peak when something like AdBlock Plus hits release (expected November).

It's worth noting that chrome.storage.sync only works if you are signed in through Firefox Sync. So we can probably say that a simple metric is to take the amount of traffic that syncing through Firefox Sync does and then dividing that by.

How many queries per second that translates into, I don't know. But I would be interested in the amount of GET, POSTs and PUTs on sync right now from other services and then suggesting that by Nov 57, the load on this service would be a fraction of that (amount of sync traffic / amount of add-ons using). 

What numbers do other sync services handle?

What numbers can Kinto put up right now?

[1] https://github.com/andymckay/arewewebextensionsyet.com/blob/master/usage.csv#L16
[2] https://gist.github.com/andymckay/10c3a4c64ce8990b589f0ac740f65955#file-firefox-permissions-L131
Flags: needinfo?(eglassercamp)
Thank you, Andy!  That's very useful information.  I'll check with the Sync metrics team and see if I can get some related data.
I'm not sure what the policy is for putting traffic numbers in public bugs, but I have the sync numbers that Andy asked for above, and will bring them to the meeting tomorrow.
Do we want load test results in this bug, or somewhere more private (since they're likely to include performance thresholds)?
Flags: needinfo?(jthomas)
I think we should keep performance thresholds private. Sharing via google docs works for me but if you want to include datadog graphs might be worth looking at https://app.datadoghq.com/notebook/.
Flags: needinfo?(jthomas)
https://app.datadoghq.com/notebook/list is better and has a notebook created by :miles for another project.
QA Contact: kthiessen → chartjes
We have a scenario document being used for load testing here:
   https://docs.google.com/document/d/1na-4DtECFRf0zEgJzaeK4G6MJAINfqY_UO_8rx5p_ME/edit

Please get the scenarios you want tested into that document, so that Chris can do the testing required to make sure this product is ready for release.
Hi Ethan,

Are you the best person to gather the scenarios for load testing?  Added Bob as well - who worked as webextension liaison
Flags: needinfo?(eglassercamp)
Flags: needinfo?(bob.silverberg)
Whiteboard: [storage]triaged → [storage]
Not sure if QA plan helps - in case some of those scenarios are good perf test cases
Krupa and I have been chatting to Karl about load testing for this. Whats the next steps?
Flags: needinfo?(kthiessen)
Flags: needinfo?(eglassercamp)
Flags: needinfo?(bob.silverberg)
Let's lay out a timeline of load testing/perf scenarios -- we expect n users by d date, staggered all the way up to November.

Then we can schedule a series of load tests on some sort of non-production environment -- either stage or something purpose-built in AWS.  We will also need to think about how we are going to model users for the load/perf tests -- how often are we going to allow a given add-on to hit its storage container, etc, and how much enforcement of those limits is needed?

I think the important thing here is to start early and test frequently.
Flags: needinfo?(kthiessen)
(In reply to Karl Thiessen [:kthiessen] from comment #21)
> Let's lay out a timeline of load testing/perf scenarios -- we expect n users
> by d date, staggered all the way up to November.
> 
> Then we can schedule a series of load tests on some sort of non-production
> environment -- either stage or something purpose-built in AWS.  We will also
> need to think about how we are going to model users for the load/perf tests
> -- how often are we going to allow a given add-on to hit its storage
> container, etc, and how much enforcement of those limits is needed?
> 
> I think the important thing here is to start early and test frequently.

andy, rémy - thoughts?
Flags: needinfo?(rhubscher)
Flags: needinfo?(amckay)
I guess settings in an add-on shouldn't be write heavy and people are syncing with their mobile 48% of the time and with another desktop 52%.

The Android app doesn't support the storage.sync API just yet.

So it means we will have at max 4 millions users that might be using the storage sync API in one or two addons.


In my opinion if the stack can handle 300 requests per seconds we should be fine for a while. Because it means we can handle 9 millions sync per day. Which means every user would be updating the two add-ons everyday which is really unlikely.
Flags: needinfo?(rhubscher)
Whiteboard: [storage] → [storage], triaged
What would be really awesome is a dashboard on data dog that shows the amount of traffic the production instances get in terms of reads and writes and time taken. If we can handle 300 requests per second then I'm pretty happy with things, but a graph showing this would let us all see whats happening and react appropriately.
Flags: needinfo?(amckay)
Benson, do you know of any graphs or metrics for this service?
Flags: needinfo?(bwong)
jason set up a datadog dashboard for it [1]. If there's other metrics you'd like I can add them to the dashboard.


[1] https://app.datadoghq.com/dash/241098/kinto-webextensions-prod?live=true&page=0&is_auto=false&from_ts=1494333525702&to_ts=1494347925702&tile_size=m
Flags: needinfo?(bwong)
Component: WebExtensions: General → WebExtensions: Storage
Product: Toolkit → WebExtensions
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.