If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Increase max records per batch so ~99% of sync users are able to post all their bookmarks in a single batch.

RESOLVED WORKSFORME

Status

Cloud Services
Server: Sync
RESOLVED WORKSFORME
3 months ago
a month ago

People

(Reporter: markh, Assigned: rfkelly)

Tracking

(Depends on: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [Sync Q3 OKR])

(Reporter)

Description

3 months ago
https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=1&end_date=2017-07-04&keys=__none__!__none__!__none__&max_channel_version=nightly%252F56&measure=PLACES_BOOKMARKS_COUNT&min_channel_version=null&processType=*&product=Firefox&sanitize=1&sort_keys=submissions&start_date=2017-06-12&table=0&trim=1&use_submission_date=0 shows that 5k is over 98%, but note that histogram is opt-in and not limited to Sync users - bobm may be able to get actual figures for sync users from the servers.

Note that it may, or may not, make sense to try and do this as part of bug 1378567 (pro: single deployment, con: additional risk)
(Assignee)

Comment 1

2 months ago
Mark, what do you think about adding a telemetry flag to specifically indicate "this sync failed to fit everything into a single batch"?  I'm happy to have a go at landing some (assumedly fairly straightforward) code for this if you can point me in the right direction.  (Something a bit different, it'll be fun, right?)
Flags: needinfo?(markh)
(Assignee)

Comment 2

2 months ago
For reference, the current default size of a batch is 10,000 records and 100 MB.  We depend on the client sending these numbers in the request headers though, we don't check that the sent value of e.g. X-Weave-Total-Records actually reflects the number of records in the batch.

> adding a telemetry flag to specifically indicate "this sync failed to fit everything into a single batch"

To be clear, the idea here would be to take the number of sync attempts with this flag, as a percentage of the total number of sync attempts, to produce the actual rate at which syncs fail to fit everything into a single batch.
(Assignee)

Comment 3

2 months ago
> We depend on the client sending these numbers in the request headers 

In IRC, :tcsc says that clients probably do not send these headers, but they do limit the size of their batches based on the advertised values from /info/collections

Anyways, the default limits appear to be safely larger than the numbers in that telemetry dashboard, so I think the next step here needs to be getting more nuanced telemetry to see whether further changes are needed.
Assignee: nobody → rfkelly
(Reporter)

Comment 4

2 months ago
(In reply to Ryan Kelly [:rfkelly] from comment #1)
> Mark, what do you think about adding a telemetry flag to specifically
> indicate "this sync failed to fit everything into a single batch"?

The issue as I see it is that at least once (plus once each time they are node reassigned), a user is going to try and fit *all* their records into a single batch. So ISTM that looking at the 9xth percentile of WBO counts for users (probably excluding history) is what we should consider using.

IOW, I think we'd see significantly different counts now compared to March when we did the mass reassignment.

However, if you want to land something in the client, I guess we could just record the number of batches taken using code similar to what landed in bug 1353645 (although in this case it would be in record.js) - alternatively, we could just record the total number of records and maybe total number of bytes and deduce how many batches that must have taken.

(In reply to Ryan Kelly [:rfkelly] from comment #3)
> Anyways, the default limits appear to be safely larger than the numbers in
> that telemetry dashboard, so I think the next step here needs to be getting
> more nuanced telemetry to see whether further changes are needed.

As above though, that is likely to be mainly recording "incremental" syncs, which we do expect to rarely hit those limits. It's the first syncs I'm primarily concerned about.
Flags: needinfo?(markh)
(Assignee)

Comment 5

2 months ago
Are incremental vs full syncs distinguished in any way in the client telemetry?
(Reporter)

Comment 6

2 months ago
Not as far as I'm aware. It wouldn't be too difficult to differentiate in new probes though.

However, isn't it true that for bookmarks at least, we have a very high confidence that if a user was node-reassigned tomorrow, they'd re-upload exactly how many bookmarks they currently have on the server. (hrm - excluding deletions - although we do plan on storing deletions locally, so in the future we'd expect deletions to also be counted)
(Assignee)

Comment 7

2 months ago
> However, isn't it true that for bookmarks at least, we have a very high confidence that if a
> user was node-reassigned tomorrow, they'd re-upload exactly how many bookmarks they currently have on the server. 

Yes, I expect this is fairly accurate.  Counting server-side sounds like it's probably a good place to start, although I'm still moderately interested in some client-side telemetry to see if we can monitor this over time.
Related to Bug 1250747?
(Reporter)

Comment 9

2 months ago
(In reply to Richard Newman [:rnewman] from comment #8)
> Related to Bug 1250747?

Related, but I'm struggling to work out if it is a dupe - with one eye open, that other bug appears to be about how many records per POST (regardless of max per batch), but with the other eye open it does appear to be max per batch. So let's just bump both :)
(Assignee)

Updated

2 months ago
Blocks: 1378567
(Assignee)

Updated

2 months ago
No longer blocks: 1378567
Depends on: 1378567
Whiteboard: [Sync Q3 OKR]
(Assignee)

Comment 10

2 months ago
I think we've done this analysis in the past, but let's take a fresh look at the data.  Basically we want to know the distribution of the number of items in each collection for each user.  :bobm, can we safely grab this from the db on a representative sample of storage nodes?  I'm thinking just a big dump of counts produced by something like:

   SELECT collection, COUNT(id) FROM bsoN GROUP BY userid, collection;

Then we can do some post-processing to extract statistics from it.
Flags: needinfo?(bobm)
I'll collect a fresh dataset for these and make it available as before (https://mana.mozilla.org/wiki/display/SVCOPS/Sync+Dataset)
Flags: needinfo?(bobm)
(Assignee)

Comment 12

2 months ago
Just to confirm, this dataset looks like it can tell me exactly what I want, thanks :bobm!  Please let me know when the fresh data is available.
(In reply to Ryan Kelly [:rfkelly] from comment #12)
> Just to confirm, this dataset looks like it can tell me exactly what I want,
> thanks :bobm!  Please let me know when the fresh data is available.

I've sent the dataset out of band as a CSV.  I can have it added to the data portal if you would prefer.  Let me know if this works.
Flags: needinfo?(rfkelly)
(Assignee)

Comment 14

a month ago
Thanks :bobm!  It looks like many of the rows are missing some columns in the csv, is this because that user doesn't have any data for that particular collection?  Do you have a link to the script that generates this report, so I can check my understanding of the data?
Flags: needinfo?(rfkelly) → needinfo?(bobm)
(In reply to Ryan Kelly [:rfkelly] from comment #14)
> Thanks :bobm!  It looks like many of the rows are missing some columns in
> the csv, is this because that user doesn't have any data for that particular
> collection?  

That's correct.  Missing columns indicate missing collections for that user.

> Do you have a link to the script that generates this report, so
> I can check my understanding of the data?

The perl scripts I used to generate the csv are linked in this section: https://mana.mozilla.org/wiki/display/SVCOPS/Sync+Dataset#SyncDataset-CollectingtheData
Flags: needinfo?(bobm)
(Assignee)

Comment 16

a month ago
I crunched the provided data dump to determine the item count and maximum total size stats for each collection, results as follows:

ITEM COUNTS (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 1 - 1 - 2 - 12 - 92 - 346 - 3924
  Addons: 1 - 1 - 1 - 2 - 8 - 25 - 199
  Clients: 1 - 1 - 1 - 1 - 1 - 3 - 41
  Crypto: 1 - 1 - 1 - 1 - 1 - 1 - 1
  Forms: 1 - 1 - 4 - 90 - 974 - 4439 - 293450
  Bookmarks: 1 - 7 - 13 - 36 - 376 - 3327 - 225197
  Prefs: 1 - 1 - 1 - 1 - 1 - 1 - 1
  History: 1 - 1 - 11 - 234 - 2945 - 11088 - 127424

TOTAL SIZES (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 167 - 472 - 1014 - 6552 - 51080 - 193720 - 2065492
  Addons: 251 - 295 - 315 - 654 - 2812 - 8331 - 66489
  Clients: 275 - 423 - 467 - 487 - 543 - 1513 - 184339
  Crypto: 339 - 339 - 339 - 339 - 339 - 363 - 379
  Forms: 187 - 231 - 968 - 21275 - 234287 - 1067475 - 66824818
  Bookmarks: 278 - 2478 - 5266 - 16705 - 199392 - 1730069 - 118639371
  Prefs: 8019 - 8379 - 8955 - 9659 - 10963 - 19387 - 166439
  History: 187 - 487 - 5761 - 121936 - 1495029 - 5451000 - 62205656

The 99th percentile for bookmarks on this box is 3327 items, but the max is 225197.

Bob, if it's straightforward, can I please get a similar dump from two different hosts for comparison?
Flags: needinfo?(bobm)
Can do.
Flags: needinfo?(bobm)
(Assignee)

Comment 18

a month ago
For the record, the current limits being reported in production are:

  {
    u'max_post_records': 100,
    u'max_post_bytes': 1048576,
    u'max_total_bytes': 104857600,
    u'max_total_records': 10000,
    u'max_record_payload_bytes': 262144}
  }

The max_total_bytes and max_total_records values seem large enough to comfortably fit the 99th percentile of bookmarks sizes here.

ni? myself to repeat analysis using latest figures from :bobm
Flags: needinfo?(rfkelly)
(Assignee)

Comment 19

a month ago
Results from db dumps on 3 separate webheads:

ITEM COUNTS (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 1 - 1 - 2 - 20 - 153 - 521 - 21155
  Addons: 1 - 1 - 1 - 3 - 12 - 32 - 1161
  Clients: 1 - 1 - 1 - 1 - 2 - 5 - 229
  Crypto: 1 - 1 - 1 - 1 - 1 - 1 - 1
  Forms: 1 - 1 - 8 - 283 - 2549 - 9543 - 352498
  Bookmarks: 1 - 7 - 13 - 57 - 650 - 4890 - 393092
  Prefs: 1 - 1 - 1 - 1 - 1 - 1 - 1
  History: 1 - 1 - 23 - 888 - 6758 - 19959 - 299626

TOTAL SIZES (min - 1% - 10% - 50% - 90% - 99% - max):
  Passwords: 167 - 473 - 1122 - 11231 - 84880 - 288556 - 5976736
  Addons: 251 - 295 - 315 - 929 - 3857 - 10689 - 345035
  Clients: 251 - 423 - 467 - 487 - 1007 - 2435 - 230331
  Crypto: 339 - 339 - 339 - 339 - 341 - 364 - 379
  Forms: 167 - 231 - 1888 - 67116 - 615364 - 2314938 - 95034014
  Bookmarks: 278 - 2429 - 5394 - 27980 - 347236 - 2598413 - 220248500
  Prefs: 7655 - 8403 - 9639 - 9683 - 12051 - 23015 - 261267
  History: 187 - 774 - 12643 - 459591 - 3398008 - 9619860 - 341436609

This moves the extremes out a little, but still seems comfortably within the current upload limits.

I'm leaning towards declaring the current limits ok for >99% of users and closing this bug - :markh thoughts?
Flags: needinfo?(rfkelly) → needinfo?(markh)
(Reporter)

Comment 20

a month ago
STGM, thanks.
Status: NEW → RESOLVED
Last Resolved: a month ago
Flags: needinfo?(markh)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.