Closed Bug 1378569 Opened 7 years ago Closed 7 years ago

Increase max records per batch so ~99% of sync users are able to post all their bookmarks in a single batch.

Categories

(Cloud Services Graveyard :: Server: Sync, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: markh, Assigned: rfkelly)

References

Details

(Whiteboard: [Sync Q3 OKR])

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=1&end_date=2017-07-04&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=PLACES_BOOKMARKS_COUNT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-06-12&table=0&trim=1&use_submission_date=0 shows that 5k is over 98%, but note that histogram is opt-in and not limited to Sync users - bobm may be able to get actual figures for sync users from the servers.

Note that it may, or may not, make sense to try and do this as part of bug 1378567 (pro: single deployment, con: additional risk)
Mark, what do you think about adding a telemetry flag to specifically indicate "this sync failed to fit everything into a single batch"?  I'm happy to have a go at landing some (assumedly fairly straightforward) code for this if you can point me in the right direction.  (Something a bit different, it'll be fun, right?)
Flags: needinfo?(markh)
For reference, the current default size of a batch is 10,000 records and 100 MB.  We depend on the client sending these numbers in the request headers though, we don't check that the sent value of e.g. X-Weave-Total-Records actually reflects the number of records in the batch.

> adding a telemetry flag to specifically indicate "this sync failed to fit everything into a single batch"

To be clear, the idea here would be to take the number of sync attempts with this flag, as a percentage of the total number of sync attempts, to produce the actual rate at which syncs fail to fit everything into a single batch.
> We depend on the client sending these numbers in the request headers 

In IRC, :tcsc says that clients probably do not send these headers, but they do limit the size of their batches based on the advertised values from /info/collections

Anyways, the default limits appear to be safely larger than the numbers in that telemetry dashboard, so I think the next step here needs to be getting more nuanced telemetry to see whether further changes are needed.
Assignee: nobody → rfkelly
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> Mark, what do you think about adding a telemetry flag to specifically
> indicate "this sync failed to fit everything into a single batch"?

The issue as I see it is that at least once (plus once each time they are node reassigned), a user is going to try and fit *all* their records into a single batch. So ISTM that looking at the 9xth percentile of WBO counts for users (probably excluding history) is what we should consider using.

IOW, I think we'd see significantly different counts now compared to March when we did the mass reassignment.

However, if you want to land something in the client, I guess we could just record the number of batches taken using code similar to what landed in bug 1353645 (although in this case it would be in record.js) - alternatively, we could just record the total number of records and maybe total number of bytes and deduce how many batches that must have taken.

(In reply to Ryan Kelly [:rfkelly] from comment #3)
> Anyways, the default limits appear to be safely larger than the numbers in
> that telemetry dashboard, so I think the next step here needs to be getting
> more nuanced telemetry to see whether further changes are needed.

As above though, that is likely to be mainly recording "incremental" syncs, which we do expect to rarely hit those limits. It's the first syncs I'm primarily concerned about.
Flags: needinfo?(markh)
Are incremental vs full syncs distinguished in any way in the client telemetry?
Not as far as I'm aware. It wouldn't be too difficult to differentiate in new probes though.

However, isn't it true that for bookmarks at least, we have a very high confidence that if a user was node-reassigned tomorrow, they'd re-upload exactly how many bookmarks they currently have on the server. (hrm - excluding deletions - although we do plan on storing deletions locally, so in the future we'd expect deletions to also be counted)
> However, isn't it true that for bookmarks at least, we have a very high confidence that if a
> user was node-reassigned tomorrow, they'd re-upload exactly how many bookmarks they currently have on the server. 

Yes, I expect this is fairly accurate.  Counting server-side sounds like it's probably a good place to start, although I'm still moderately interested in some client-side telemetry to see if we can monitor this over time.
(In reply to Richard Newman [:rnewman] from comment #8)
> Related to Bug 1250747?

Related, but I'm struggling to work out if it is a dupe - with one eye open, that other bug appears to be about how many records per POST (regardless of max per batch), but with the other eye open it does appear to be max per batch. So let's just bump both :)
Blocks: 1378567
No longer blocks: 1378567
Depends on: 1378567
Whiteboard: [Sync Q3 OKR]
I think we've done this analysis in the past, but let's take a fresh look at the data.  Basically we want to know the distribution of the number of items in each collection for each user.  :bobm, can we safely grab this from the db on a representative sample of storage nodes?  I'm thinking just a big dump of counts produced by something like:

   SELECT collection, COUNT(id) FROM bsoN GROUP BY userid, collection;

Then we can do some post-processing to extract statistics from it.
Flags: needinfo?(bobm)
I'll collect a fresh dataset for these and make it available as before (https://mana.mozilla.org/wiki/display/SVCOPS/Sync+Dataset)
Flags: needinfo?(bobm)
Just to confirm, this dataset looks like it can tell me exactly what I want, thanks :bobm!  Please let me know when the fresh data is available.
(In reply to Ryan Kelly [:rfkelly] from comment #12)
> Just to confirm, this dataset looks like it can tell me exactly what I want,
> thanks :bobm!  Please let me know when the fresh data is available.

I've sent the dataset out of band as a CSV.  I can have it added to the data portal if you would prefer.  Let me know if this works.
Flags: needinfo?(rfkelly)
Thanks :bobm!  It looks like many of the rows are missing some columns in the csv, is this because that user doesn't have any data for that particular collection?  Do you have a link to the script that generates this report, so I can check my understanding of the data?
Flags: needinfo?(rfkelly) → needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #14)
> Thanks :bobm!  It looks like many of the rows are missing some columns in
> the csv, is this because that user doesn't have any data for that particular
> collection?  

That's correct.  Missing columns indicate missing collections for that user.

> Do you have a link to the script that generates this report, so
> I can check my understanding of the data?

The perl scripts I used to generate the csv are linked in this section: https://mana.mozilla.org/wiki/display/SVCOPS/Sync+Dataset#SyncDataset-CollectingtheData
Flags: needinfo?(bobm)
I crunched the provided data dump to determine the item count and maximum total size stats for each collection, results as follows:

ITEM COUNTS (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 1 - 1 - 2 - 12 - 92 - 346 - 3924
  Addons: 1 - 1 - 1 - 2 - 8 - 25 - 199
  Clients: 1 - 1 - 1 - 1 - 1 - 3 - 41
  Crypto: 1 - 1 - 1 - 1 - 1 - 1 - 1
  Forms: 1 - 1 - 4 - 90 - 974 - 4439 - 293450
  Bookmarks: 1 - 7 - 13 - 36 - 376 - 3327 - 225197
  Prefs: 1 - 1 - 1 - 1 - 1 - 1 - 1
  History: 1 - 1 - 11 - 234 - 2945 - 11088 - 127424

TOTAL SIZES (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 167 - 472 - 1014 - 6552 - 51080 - 193720 - 2065492
  Addons: 251 - 295 - 315 - 654 - 2812 - 8331 - 66489
  Clients: 275 - 423 - 467 - 487 - 543 - 1513 - 184339
  Crypto: 339 - 339 - 339 - 339 - 339 - 363 - 379
  Forms: 187 - 231 - 968 - 21275 - 234287 - 1067475 - 66824818
  Bookmarks: 278 - 2478 - 5266 - 16705 - 199392 - 1730069 - 118639371
  Prefs: 8019 - 8379 - 8955 - 9659 - 10963 - 19387 - 166439
  History: 187 - 487 - 5761 - 121936 - 1495029 - 5451000 - 62205656

The 99th percentile for bookmarks on this box is 3327 items, but the max is 225197.

Bob, if it's straightforward, can I please get a similar dump from two different hosts for comparison?
Flags: needinfo?(bobm)
Can do.
Flags: needinfo?(bobm)
For the record, the current limits being reported in production are:

  {
    u'max_post_records': 100,
    u'max_post_bytes': 1048576,
    u'max_total_bytes': 104857600,
    u'max_total_records': 10000,
    u'max_record_payload_bytes': 262144}
  }

The max_total_bytes and max_total_records values seem large enough to comfortably fit the 99th percentile of bookmarks sizes here.

ni? myself to repeat analysis using latest figures from :bobm
Flags: needinfo?(rfkelly)
Results from db dumps on 3 separate webheads:

ITEM COUNTS (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 1 - 1 - 2 - 20 - 153 - 521 - 21155
  Addons: 1 - 1 - 1 - 3 - 12 - 32 - 1161
  Clients: 1 - 1 - 1 - 1 - 2 - 5 - 229
  Crypto: 1 - 1 - 1 - 1 - 1 - 1 - 1
  Forms: 1 - 1 - 8 - 283 - 2549 - 9543 - 352498
  Bookmarks: 1 - 7 - 13 - 57 - 650 - 4890 - 393092
  Prefs: 1 - 1 - 1 - 1 - 1 - 1 - 1
  History: 1 - 1 - 23 - 888 - 6758 - 19959 - 299626

TOTAL SIZES (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 167 - 473 - 1122 - 11231 - 84880 - 288556 - 5976736
  Addons: 251 - 295 - 315 - 929 - 3857 - 10689 - 345035
  Clients: 251 - 423 - 467 - 487 - 1007 - 2435 - 230331
  Crypto: 339 - 339 - 339 - 339 - 341 - 364 - 379
  Forms: 167 - 231 - 1888 - 67116 - 615364 - 2314938 - 95034014
  Bookmarks: 278 - 2429 - 5394 - 27980 - 347236 - 2598413 - 220248500
  Prefs: 7655 - 8403 - 9639 - 9683 - 12051 - 23015 - 261267
  History: 187 - 774 - 12643 - 459591 - 3398008 - 9619860 - 341436609

This moves the extremes out a little, but still seems comfortably within the current upload limits.

I'm leaning towards declaring the current limits ok for >99% of users and closing this bug - :markh thoughts?
Flags: needinfo?(rfkelly) → needinfo?(markh)
STGM, thanks.
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(markh)
Resolution: --- → WORKSFORME
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.