If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Possible improvements to sync and loadtesting based on log crunching

NEW
Unassigned

Status

Cloud Services
Server: Sync
P4
normal
4 years ago
2 years ago

People

(Reporter: rfkelly, Unassigned)

Tracking

(Depends on: 1 bug, Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qa+])

(Reporter)

Description

4 years ago
I worry that part of the problems we're currently seeing with sync are due to changing access patterns, which our loadtests don't cover.  One example: the loadtest does not use sortindex, at all.  But IIRC this is used by android client devices.

How can we fix it?  I'd like to get a snapshot of the nginx access logs during a Monday Traffic Spike, so that I can trawl through and extract some aggregate statistics and patterns.  The bigger the better.  Is this allowable per log access policy?  I'd be happy with an anonimized version as long as each user account remained distinguishable.

This would also help inform access patterns for PiCL, a nice side-effect.
Kind of like scraping user data huh? 
This would be really good for correcting/customizing the load test.
Whiteboard: [qa+]
I don't think anything uses sortindex - we went to a bunch of effort to confirm this before removing the indexes on that.

If the logparsing does show up queries that are using it, we're going to need to revisit.
(Reporter)

Comment 3

4 years ago
(In reply to Toby Elliott [:telliott] from comment #2)
> I don't think anything uses sortindex - we went to a bunch of effort to
> confirm this before removing the indexes on that.

Well....

 (moz)rfk@durian:~$ grep "GET.*storage" access.log-20130520-anonymized | wc -l
 3990380
 (moz)rfk@durian:~$ 
 (moz)rfk@durian:~$ grep "GET.*storage.*sort=index" access.log-20130520-anonymized | wc -l
 207442
 (moz)rfk@durian:~$ 
 rfk@durian:lib(master *)$ python -c "print 207442.0 / 3990380 * 100"
 5.19855251881

That's around 5% of reads using sortindex.  I vaguely recall a claim from :rnewman that it's used on Android but not on Desktop, which lines up with what I see in this log.  Desktop:

 (moz)rfk@durian:~$ grep "GET.*storage" access.log-20130520-anonymized | grep FxSync | wc -l
 3515813
 (moz)rfk@durian:~$ 
 (moz)rfk@durian:~$ grep "GET.*storage.*sort=index" access.log-20130520-anonymized | grep FxSync | wc -l
 108

Android:

 (moz)rfk@durian:~$ grep "GET.*storage" access.log-20130520-anonymized | grep Android | wc -l
 441394
 (moz)rfk@durian:~$ grep "GET.*storage.*sort=index" access.log-20130520-anonymized | grep Android | wc -l
 206151


Not exactly a smoking gun for all our performance issues, but a good indication that out loadtests are well out of date...
(Reporter)

Comment 4

4 years ago
Morphing this into a dev bug to get the loadtests updated based on the data provided by ops.
Component: Operations → Server: Sync
Summary: sample sync access logs for rfkelly; to make loadtests more closely match the real world → Make syncstorage loadtests more closely match the real world
Great. Lets get some logs and I'll crunch them while you're out.
(Reporter)

Comment 6

4 years ago
(And the remaining ~1000 users of sort=index are things like Firefox Home, Fowbrowser, Bookmarks for iPad, and other third-party clients)
:bobm, :telliott, :rfkelly
Just an FYI on the use of PMO for storing this stuff...
I think Bob was right - as of this morning I have no access to /home/bobm or /home/rfkelly. So PMO must have some process for checking/resetting permissions.

If that zipped file is available someplace else, send me an email about it. Thanks.
So, the first, and probably most important, piece of data from the log crunching: our average history GET is returning 128K worth of data, and peaks as high as 13M.

It turns out that FF Desktop never uses the limit= parameter (Android does, <3 rnewman). Are we still marshalling the entire result set before sending the response? PHP used to stream it as we moved through the set, but I remember early python didn't. 

Bobm is working on having the zlb add a limit=1000 to all history requests that don't have one. That should be sufficient for the aweseomebar.
Use of query parameters in GET requests:

10.4% newer
24.2% full
2.5% limit
2.5% sort (though not quite exactly the same. About 400 queries had index with no sort)

Of the sorts, 1.1% were newest. The rest were index sorts (no 'oldest'). 

0 queries made us of parentid, predecessorid, index_above or index_below. 

37 (out of 8M+) used offset. They all came from variants on "Apache-HttpClient/UNAVAILABLE (java 1.4)" user-agent and had an offset of 0. 

(also 3.2% use v=, but that's not relevant to us)


I think that we could strip out sort=index at the zlb from the queries. Evidence suggests that it's not doing anything for the client, and it's got to be making a fair bit of extra work on the db. The SQLAlchemy definition of the table suggests that there's no index on that column (though that might not match the actual table definition. bobm?). Sort is only coming from Android, iDevices and EmergentWeave. 

rnewman - is that sort being used for anything, or is it just a useful default? Would stripping it cause problems? It's possible that this isn't all that relevant - history queries with a sort index in general are sufficiently better constructed that they're averaging 5s vs 36s without an index.
Flags: needinfo?(rnewman)
Note that that 5s vs 36s is also the difference between queries with a limit and those without. Basically, it's how Android is hitting us vs how FF desktop is hitting us.
Summary: Make syncstorage loadtests more closely match the real world → Possible improvements to sync and loadtesting based on log crunching
GET frequencies (with average query time):

collection_usage -> 1108 (0%) -> (6.95)
quota -> 1189 (0%) -> (0.42)
adblockplus -> 1251 (0%) -> (2.39)
addons -> 24106 (0%) -> (1.80)
prefs -> 25069 (0%) -> (1.69)
passwords -> 140877 (1%) -> (1.81)
forms -> 150905 (1%) -> (3.77)
bookmarks -> 170148 (2%) -> (5.20)
history -> 251042 (2%) -> (28.37)
tabs -> 524188 (6%) -> (0.03)
crypto/keys -> 656200 (7%) -> (2.46)
clients -> 756987 (8%) -> (3.17)
meta/global -> 1288679 (15%) -> (0.07)
collections -> 4422113 (52%) -> (0.05)

Note that info/collection_counts - a potentially expensive query - was called once in the sample, so it appears to be safely unused.

One potential improvement here - should we consider a writethrough cache for clients? What's our memcache volume looking like, bobm? Obviously we'd need to make sure we had sufficient space, since we don't want it bumping out anything else.



It's pretty obvious what's going to memcache and what's going to the db here.
Flags: needinfo?(bobm)
Same data for POSTs:

adblockplus -> 1470 (0%) -> (3.03)
addons -> 20440 (0%) -> (5.02)
prefs -> 91219 (2%) -> (1.62)
passwords -> 256671 (5%) -> (3.05)
bookmarks -> 293189 (6%) -> (6.77)
forms -> 516008 (11%) -> (6.13)
clients -> 640703 (14%) -> (2.89)
tabs -> 1117549 (25%) -> (0.04)
history -> 1478979 (33%) -> (8.65)

We've long-held that one of the main sources of pain is the write volume, but it's notable that POSTs aren't as bad as GETs. It may be that the whole "get everything" requests that we are seeing from FF desktop are what is killing us rather than trying to get everything onto the platters.

The constant updating of forms data is really surprising. As is clients. What are we doing to the client record that requires so much updating?
PUTs:

clients -> 673 (1%) -> (2.14)
crypto/keys -> 8613 (18%) -> (5.60)
meta/global -> 38119 (80%) -> (2.33)

I'm hoping that the bad performance here is just general database load, because these are writing one record each.

The clients PUTs are all iDevices. Not sure why FFDesktop (which is PUTting meta/global) is POSTing client requests. 

Total numbers of puts are obviously irrelevant from a loadtesting standpoint.
DELETEs:

/adblockplus -> 451 (0%) -> (0.69)
/addons -> 8286 (5%) -> (1.21)
/ -> 9006 (6%) -> (4.84)
/tabs -> 9112 (6%) -> (0.03)
/prefs -> 10448 (6%) -> (1.01)
/history -> 11873 (7%) -> (10.53)
/passwords -> 14496 (9%) -> (2.28)
/bookmarks -> 22336 (14%) -> (10.51)
/clients -> 22835 (15%) -> (5.97)
/forms -> 39799 (26%) -> (2.95)

Interesting that it isn't so bad, especially deleting the whole thing. There are a fair number (2%) that just die at 30s, though. 

24% of the queries include one or more ids. I broke down the counts, but there's nothing that interesting there - 25% have one, 25% have 100 and everything else in between just sort of drops off to small numbers. Where it took a long time seems random - 1 and 100 had almost exactly the same query times.

IDs appear to really help the query:

with:

/clients -> 445 (1%) -> (2.77)
/history -> 463 (1%) -> (1.97)
/addons -> 661 (1%) -> (1.64)
/bookmarks -> 2762 (7%) -> (1.33)
/passwords -> 3784 (10%) -> (1.03)
/forms -> 28324 (77%) -> (1.25)


without:

/clients -> 22390 (19%) -> (6.03)
/history -> 11410 (10%) -> (10.88)
/addons -> 7625 (6%) -> (1.17)
/passwords -> 10712 (9%) -> (2.71)
/bookmarks -> 19574 (17%) -> (11.81)
/forms -> 11475 (10%) -> (7.13)

(Interesting that forms uses ids so much)

Zero deletes used any of the documented parameters besides ids.
(In reply to Toby Elliott [:telliott] from comment #11)

> Note that info/collection_counts - a potentially expensive query - was
> called once in the sample, so it appears to be safely unused.

collection_counts is used infrequently by Android Sync to decide whether you have too many bookmarks to safely sync.

(In reply to Toby Elliott [:telliott] from comment #9)
> Use of query parameters in GET requests:
> 
> 10.4% newer
> 24.2% full
> 2.5% limit
> 2.5% sort (though not quite exactly the same. About 400 queries had index
> with no sort)

What about ids? All requests from clients should be one of these:

* Give me everything, sorted by index.
* Give me everything since I last synced. (newer)
* I failed last time, so give me these ids. (ids)
* Give me all the IDs, then give me these ids, these ids, and these ids. (ids)


> Of the sorts, 1.1% were newest. The rest were index sorts (no 'oldest'). 
> 
> 0 queries made us of parentid, predecessorid, index_above or index_below. 

Correctly so; we were going to remove these from the protocol!


> rnewman - is that sort being used for anything, or is it just a useful
> default? Would stripping it cause problems?

It improves the 'quality' of Sync for clients, both in combination with limit and in first syncs.

It definitely applies for history; we get your most frecent history first. It applies to an extent for bookmarks; by fetching in index order we grab roots and folders first, then frequently used bookmarks, and so on. We aim to use sortindex *more* in the future (Bug 723241).

Our own clients don't really use sortindex for anything other than bookmarks and history, so you can safely ignore it for other collections, but you would be violating the protocol by doing so.


(In reply to Toby Elliott [:telliott] from comment #12)
> Same data for POSTs:
> 
> adblockplus -> 1470 (0%) -> (3.03)
> addons -> 20440 (0%) -> (5.02)
> prefs -> 91219 (2%) -> (1.62)
> passwords -> 256671 (5%) -> (3.05)
> bookmarks -> 293189 (6%) -> (6.77)
> forms -> 516008 (11%) -> (6.13)
> clients -> 640703 (14%) -> (2.89)
> tabs -> 1117549 (25%) -> (0.04)
> history -> 1478979 (33%) -> (8.65)
> 
> We've long-held that one of the main sources of pain is the write volume,
> but it's notable that POSTs aren't as bad as GETs. It may be that the whole
> "get everything" requests that we are seeing from FF desktop are what is
> killing us rather than trying to get everything onto the platters.
> 
> The constant updating of forms data is really surprising.

Form data is touched frequently -- any time you type something in a text box on desktop, or a form history entry is removed, you're going to incur a POST to forms on the next sync.


> As is clients.
> What are we doing to the client record that requires so much updating?

const CLIENTS_TTL = 1814400; // 21 days
const CLIENTS_TTL_REFRESH = 604800; // 7 days

  _syncStartup: function _syncStartup() {
    // Reupload new client record periodically.
    if (Date.now() / 1000 - this.lastRecordUpload > CLIENTS_TTL_REFRESH) {
      this._tracker.addChangedID(this.localID);
      this.lastRecordUpload = Date.now() / 1000;
    }
    SyncEngine.prototype._syncStartup.call(this);
  },


(In reply to Toby Elliott [:telliott] from comment #13)

> The clients PUTs are all iDevices. Not sure why FFDesktop (which is PUTting
> meta/global) is POSTing client requests. 

Desktop *only* PUTs meta/global. Everything else is POST.
Flags: needinfo?(rnewman)
Thanks for the info. 

25K out of 8 million GETs used the ID param, so it's barely a blip. Of those, they cap out at 50 (11%), and 62% post only one ID. 10% for 2, and then the curve drops off big-time until the spike at 50. Oddly, 41 requests had an ids parameter with no ids!
Oh, and the id-based gets are fast - mostly sub-second.
So then the confusion is: why do apparently 85% of requests not include newer?
(In reply to Richard Newman [:rnewman] from comment #18)
> So then the confusion is: why do apparently 85% of requests not include
> newer?

My theory:

Most collections don't change most of the time, particularly for single-device users. There will be more POSTs than GETs, thanks to info/collections.

Requests will omit newer if:

* A client has been node-reassigned
* A reset has occurred
* They're a mobile XUL device (apparently very rare)
* It's a first sync
* Firefox quit or crashed before saving prefs, so the next launch is a first sync, too.
(In reply to Richard Newman [:rnewman] from comment #18)
> So then the confusion is: why do apparently 85% of requests not include
> newer?

Slight improvement here - same numbers on just /storage requests (as opposed to /info get requests):

22% newer
51.1% full
5.3% limit
5.3% sort (though not quite exactly the same. About 400 queries had index with no sort)

So better, though still lower than I think we were expecting.
This one's just kind of depressing. I've managed to do a little ghetto session analysis, and here's the client counts per user:

1 -> 1100363 (83)
2 -> 189571 (14)
3 -> 22889 (1)
4 -> 2678 (0)
5 -> 364 (0)
6 -> 74 (0)
7 -> 16 (0)
8 -> 11 (0)
9 -> 7 (0)
10 -> 1 (0)
12 -> 1 (0)
> One potential improvement here - should we consider a writethrough cache for
> clients? What's our memcache volume looking like, bobm? Obviously we'd need
> to make sure we had sufficient space, since we don't want it bumping out
> anything else.

We're just starting to see the beginnings of encroaching memory pressure on the caching layer in PHX1.  Since the SCL2 migration there is now replica vBucket ejection, but we aren't yet seeing active vBucket ejection, so everything being served from the caching layer is coming directly from memory still.  

We have additional capacity standing-by, and will be adding it soon.  I'm all for moving the high volume and rapidly changing traffic into the caching layer.
Flags: needinfo?(bobm)
(Reporter)

Comment 23

4 years ago
(In reply to Bob Micheletto [:bobm] from comment #22)
> > One potential improvement here - should we consider a writethrough cache for
> > clients? What's our memcache volume looking like, bobm? Obviously we'd need
> > to make sure we had sufficient space, since we don't want it bumping out
> > anything else.
> 
> We're just starting to see the beginnings of encroaching memory pressure on
> the caching layer in PHX1.  Since the SCL2 migration there is now replica
> vBucket ejection, but we aren't yet seeing active vBucket ejection, so
> everything being served from the caching layer is coming directly from
> memory still.  

If we start getting cache misses prod, we're in for some hurt - the memcached access lib is known to block the gevent event loop.  Which is OK when it's just blocking on a read fro memory, but a lot less fun when it's blocking on something being fetched off disk.
(Reporter)

Comment 24

4 years ago
(In reply to Toby Elliott [:telliott] from comment #8)
> So, the first, and probably most important, piece of data from the log
> crunching: our average history GET is returning 128K worth of data, and
> peaks as high as 13M.
> 
> It turns out that FF Desktop never uses the limit= parameter (Android does,
> <3 rnewman). Are we still marshalling the entire result set before sending
> the response?

Yes.  We build the entire JSON response in-memory as a string, then output it to the client one go.  IIRC the same is true for the application/newlines output format.
(Reporter)

Comment 25

4 years ago
(In reply to Richard Newman [:rnewman] from comment #19)
> (In reply to Richard Newman [:rnewman] from comment #18)
> > So then the confusion is: why do apparently 85% of requests not include
> > newer?
> 
> My theory:
> 
> Most collections don't change most of the time, particularly for
> single-device users. There will be more POSTs than GETs, thanks to
> info/collections.
> 
> Requests will omit newer if:
> 
> * A client has been node-reassigned
> * A reset has occurred
> * They're a mobile XUL device (apparently very rare)
> * It's a first sync
> * Firefox quit or crashed before saving prefs, so the next launch is a first
> sync, too.

So given the recent problems, these percentages may not be representative of "normal" use - some portion of them would be attributed to re-assignments and resets as part of dealing with the server issues.  Still much lower than I was expecting though.
Following up on comment 8, here's a map of log(10) history GET response sizes:

0 -> 47169
1 -> 14756
2 -> 14564
3 -> 48665
4 -> 88754
5 -> 31249
6 -> 5810
7 -> 74

yes, that's 74 10M+ requests.
same log(10) size map of bookmarks:

1 -> 4064
2 -> 13024
3 -> 36906
4 -> 15399
5 -> 8106
6 -> 1981
7 -> 11
(Reporter)

Comment 28

4 years ago
Per Bug 878021, we're seeing some gevent blocking in prod that hasn't been witnessed in stage.  This may be because the loadtests don't generate queries that are big enough, or badly-formed enough.
(Reporter)

Updated

4 years ago
Blocks: 907479
Anything we want to do here for Sync 1.5 load testing?
Priority: -- → P3
Priority: P3 → P4
You need to log in before you can comment on or make changes to this bug.