Closed Bug 1324600 Opened 7 years ago Closed 6 years ago

Android sync not committing batch uploads

Categories

(Cloud Services Graveyard :: Server: Sync, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: mostlygeek, Assigned: mostlygeek)

References

Details

Seeing this on the go-syncstorage prototype server. An android client is creating new batches but never committing them.

User Agent: Firefox AndroidSync 1.52.0a2.0 (Firefox)

Cleaned logs: 

> [2016-12-19T01:07:10+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47101
> [2016-12-19T01:07:10+00:00]  "POST  /1.5/xxx/storage/history?batch=b1815    HTTP/1.1"  202  47878
> [2016-12-19T01:08:31+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106758
> [2016-12-19T01:08:31+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1816  HTTP/1.1"  202  48024
> ...
> [2016-12-19T01:13:38+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47161
> [2016-12-19T01:13:38+00:00]  "POST  /1.5/xxx/storage/history?batch=b1823    HTTP/1.1"  202  47905
> [2016-12-19T01:14:59+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106689
> [2016-12-19T01:15:00+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1824  HTTP/1.1"  202  48036

So far this client has created about 2000 batches that eventually expire and get cleaned up.
I dug through the server logs and it seems to be affect 3 other users using the same UA version.
Note: another user seems to have created close to 5500 batches that have never committed.
Assignee: nalexander → gkruglov
It looks like the client figured out that it's in the batching mode. That's good good.
It used the supplied token (at least once). It did not send a batch=TOKEN&commit=true payload (empty or with some records).

Client processes a stream of records to upload, and groups them into "payloads". Multiple payloads make up a batch (according to limits outlined in info/configuration). Once we fill up a payload, or a batch, we flush it. If we filled up a batch, we also commit it.

So, client will send a commit=true param in a few cases.
- If there are no more records to upload ("storeDone" has been called), and we have records we haven't uploaded yet.
- If there are no more records to upload ("storeDone" has been called), we've uploaded all of the records, but we're in a batching mode, and we've processed at least one record (that is, batch is marked as "needs commit").
- If the batch becomes full while processing an incoming record.
- If an incoming record is too large to fit within batch, and fills up the next batch (that must be very rare, if at all possible with per-record size limits).

Logic for "no more records to upload, need to commit?" resides here: [0]

Couple of theories:
- RepositorySession's storeDone() is never called, so we never hit `noMoreRecordsToUpload`.
--> from the logs in Comment 0, clients seem to move on to the next stage, implying that storeDone() && storeDone(timestamp) are called, which would mean that we do go through the "commitIfNecessaryAfterLastPayload" logic ([0]).
- BatchMeta's needsCommit is wrong. Or uploader's inBatchingMode is wrong. Or the if statements are messing up somehow.

[0] https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java?q=BatchingUploader.java&redirect_type=direct#164
(In reply to Benson Wong [:mostlygeek] from comment #1)
> Note: another user seems to have created close to 5500 batches that have
> never committed.

Did you see any android clients actually commit a batch? It would help to know if some or all of them are behaving incorrectly.

Also, are logs from Comment 0 in the right order? It's a bit strange, we normally sync bookmarks before we process history (incremental sync stuff which adds a "recent history" stage before bookmarks didn't land yet.)
Flags: needinfo?(bwong)
> Did you see any android clients actually commit a batch?

Sort of. I see a lot of batch=true&commit=true in the logs. However I never see a batch made up of more than one request (batch=<id>&commit=true) ever commit. 

Also it seems to be affecting several versions of Android: 

$ gunzip -c access.log-20161220.gz | grep 'Firefox AndroidSync' | grep batch=b |  grep -o '"Firefox[^"]*"' | sort | uniq -c
    209 "Firefox AndroidSync 1.51.0.0 (Firefox)"
    252 "Firefox AndroidSync 1.52.0a2.0 (Firefox)"
     34 "Firefox AndroidSync 1.53.0a1.0 (Firefox)"

> Also, are logs from Comment 0 in the right order?

Yep. Here's a longer dump

[2016-12-19T01:18:11+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106746
[2016-12-19T01:18:11+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1828  HTTP/1.1"  202  47928
[2016-12-19T01:18:23+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  46963
[2016-12-19T01:18:23+00:00]  "POST  /1.5/xxx/storage/history?batch=b1829    HTTP/1.1"  202  48001
[2016-12-19T01:19:49+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106833
[2016-12-19T01:19:49+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1830  HTTP/1.1"  202  47964
[2016-12-19T01:20:00+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47155
[2016-12-19T01:20:00+00:00]  "POST  /1.5/xxx/storage/history?batch=b1831    HTTP/1.1"  202  47845
[2016-12-19T01:21:26+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  107148
[2016-12-19T01:21:26+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1832  HTTP/1.1"  202  48060
[2016-12-19T01:21:38+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47089
[2016-12-19T01:21:38+00:00]  "POST  /1.5/xxx/storage/history?batch=b1833    HTTP/1.1"  202  48013
[2016-12-19T01:23:05+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106812
[2016-12-19T01:23:05+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1834  HTTP/1.1"  202  48030
[2016-12-19T01:23:17+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47086
[2016-12-19T01:23:17+00:00]  "POST  /1.5/xxx/storage/history?batch=b1835    HTTP/1.1"  202  47971
[2016-12-19T01:24:42+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106797
[2016-12-19T01:24:43+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1836  HTTP/1.1"  202  47883
[2016-12-19T01:24:54+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  46984
[2016-12-19T01:24:54+00:00]  "POST  /1.5/xxx/storage/history?batch=b1837    HTTP/1.1"  202  48022
[2016-12-19T01:26:14+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106920
[2016-12-19T01:26:14+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1838  HTTP/1.1"  202  48093
[2016-12-19T01:26:26+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47173
[2016-12-19T01:26:26+00:00]  "POST  /1.5/xxx/storage/history?batch=b1839    HTTP/1.1"  202  48055
[2016-12-19T01:27:47+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106806
[2016-12-19T01:27:47+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1840  HTTP/1.1"  202  47880
[2016-12-19T01:27:58+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47047
[2016-12-19T01:27:59+00:00]  "POST  /1.5/xxx/storage/history?batch=b1841    HTTP/1.1"  202  48046
[2016-12-19T01:29:19+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106902
[2016-12-19T01:29:20+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1842  HTTP/1.1"  202  47904
[2016-12-19T01:29:31+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47122
[2016-12-19T01:29:32+00:00]  "POST  /1.5/xxx/storage/history?batch=b1843    HTTP/1.1"  202  47965
[2016-12-19T01:30:57+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=true   HTTP/1.1"  202  106734
[2016-12-19T01:30:58+00:00]  "POST  /1.5/xxx/storage/bookmarks?batch=b1844  HTTP/1.1"  202  47913
[2016-12-19T01:31:09+00:00]  "POST  /1.5/xxx/storage/history?batch=true     HTTP/1.1"  202  47167
[2016-12-19T01:31:09+00:00]  "POST  /1.5/xxx/storage/history?batch=b1845    HTTP/1.1"  202  48025
Flags: needinfo?(bwong)
Maybe related, but I just realized that while debugging this on the gosync server I saw a Last-Modified timestamp change (bumped forward) in a 2** response to a non-commit payload (can't remember if response code was 200 or 202 now... I think it was 202). That doesn't seem right at all, given that we follow X-I-U-S mechanics? I'd expect to see a 412.

Android batching code recognizes this as a failure to store record(s) - see [0], but in a subtly different way than if it was an actual HTTP failure.

[0] https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadUploadDelegate.java#
Flags: needinfo?(bwong)
(In reply to :Grisha Kruglov from comment #5)
> Maybe related, but I just realized that while debugging this on the gosync
> server I saw a Last-Modified timestamp change (bumped forward) in a 2**
> response to a non-commit payload (can't remember if response code was 200 or
> 202 now... I think it was 202). That doesn't seem right at all, given that
> we follow X-I-U-S mechanics? I'd expect to see a 412.
> 
> Android batching code recognizes this as a failure to store record(s) - see
> [0], but in a subtly different way than if it was an actual HTTP failure.
> 
> [0]
> https://dxr.mozilla.org/mozilla-central/source/mobile/android/services/src/
> main/java/org/mozilla/gecko/sync/repositories/uploaders/
> PayloadUploadDelegate.java#

Although x-if-unmodified-since handling on the server side seems fine (https://github.com/mozilla-services/go-syncstorage/blob/3ef4c055c055184565eab3bf89be33196cdc044c/web/sentNotModified.go#L69)...
My local client's flow:
1) client makes a request to /1.5/****/storage/history?batch=true
  xius header: "x-if-unmodified-since: 1484029350.280" (local collection's l-m timestamp)
  ->> response: 202, batch token=b133, last-modified=1.48403162384E9, ...otherstuff...
2) client makes a consequent request to /1.5/****/storage/history?batch=b133
  xius header: "x-if-unmodified-since: 1484031623.840" (previously returned l-m timestamp)
  ->> response: 202, batch token=b133, last-modified=1.48403190377E9, ...otherstuff...

Observe that:
- 1st request was processed as a 202 by the server, even though x-i-u-s values passed up was lower than l-m value returned. I'd expect to see 412 here.
- 1st response was processed just fine by the client as well, even though returned l-m value was greater than x-i-u-s value. Nevertheless, client records the returned l-m value, and uses it later:
- While processing the second 202 response, client sees that new l-m value is different from previously recorded l-m value (specifically, greater). This wasn't a commit request, so client didn't expect l-m value to change yet, so it considers this request a failure, and marks every subsequent request a failure so as to not make them (no requests are made to the server for this collection in this sync past this point), and moves onto the next collection. This explains the fact that in logs in Comment 4 you only ever see one batch=true and then only one batch=b*** request for each collection, even though there are probably multiple batches worth of data for each collection.

So it seems that there are a few different things going on here.
1) client is not handling starting L-M/x-i-u-s values correctly. When we make the first batch=true POST, we pass in collection's last-modified timestamp. We do this to avoid data loss on the server due to concurrent modifications, and this will only work if our starting collection's last-modified value (what we pass as the first x-i-u-s header value) is something sane... e.g. our "latest known remote state of the collection". Bug 1291821 switches collections from sort=index to sort=oldest, making this easier.
2) it doesn't seem that server is not handling x-i-u-s timestamp correctly in this case - we should be seeing 412s in the 1st request! Looking at the Go code, could it be just that it's case sensitive while processing header values? [1] gets the header as `r.Header.Get("X-If-Unmodified-Since")`, while android passes a header as "x-if-unmodified-since" (note non-capitals)... Docs say [2] that header fields are case insensitive, so all of this should just work; http/2 demands that header fields are lower case [3]. And, this PR ([4]) clarifies that net/http's Header.Get is indeed case-insensitive... So I'm not sure what's going on on the server side.

[1] https://github.com/mozilla-services/go-syncstorage/blob/3ef4c055c055184565eab3bf89be33196cdc044c/web/sentNotModified.go#L26
[2] https://tools.ietf.org/html/rfc7230#section-3.2
[3] http://httpwg.org/specs/rfc7540.html#HttpHeaders
[4] https://go-review.googlesource.com/#/c/33530/
It seems the source of the problem is the ambiguity of X-Last-Modified and how it's handled by the android client: 

Go-sync server, the X-L-M value is: 

 - batch=true (creating a batch)
   > the current time [1]

 - batch=b1 (append to batch)
   > the current time [1]

 - batch=b1&commit=true (append and commit)
   > the last modified time of the collection [2]

 - batch=true&commit=true (create / commit)
   > the last modified time of the collection [2]

Python server, the X-L-M value is: 

 - batch=true (creating a batch)
   > Not set [3]

 - batch=b1 (append to batch)
   > Not set [3]

 - batch=b1&commit=true (append and commit)
   > the last modified time of the collection [4]

 - batch=true&commit=true (create / commit)
   > the last modified time of the collection [4]

However, the Android client seems to expect that X-L-M:

  - must exist, even for batch=true OR batch=b1 [5]
  - must be the last modified ts of the collection [6]

[1] https://github.com/mozilla-services/go-syncstorage/blob/195f8b9/web/syncUserHandler.go#L993
[2] https://github.com/mozilla-services/go-syncstorage/blob/195f8b9/web/syncUserHandler.go#L984
[3] https://github.com/mozilla-services/server-syncstorage/blob/bcf86ca/syncstorage/views/__init__.py#L401-L402
[4] https://github.com/mozilla-services/server-syncstorage/blob/bcf86ca/syncstorage/views/__init__.py#L416-L417
[5] https://dxr.mozilla.org/mozilla-central/rev/7011ed1427de2b6f075c46cc6f4618d3e9fcd2a4/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/PayloadUploadDelegate.java#62-68
[6] https://dxr.mozilla.org/mozilla-central/rev/7011ed1427de2b6f075c46cc6f4618d3e9fcd2a4/mobile/android/services/src/main/java/org/mozilla/gecko/sync/repositories/uploaders/BatchingUploader.java#262
Flags: needinfo?(bwong)
:rfk

Is my understanding of the python code accurate in Comment 8? 
  - For batch create/append, X-Last-Modified is not sent to the client
  - For batch commits, the collections last modified time is sent to the client
Flags: needinfo?(rfkelly)
I think the way the android client expect server responses is the right way. 

For batch create/append responses: 

 - X-Last-Modified always exists
 - X-Last-Modified is the last time the collection was updated

This way a client can abort a batch upload if another client modified the collection in the mean time. It doesn't have to wait until the commit to be told of a conflict.
Bug 1250189 Comment 75 and following are relevant here.

Returning the current timestamp when a change has not been made is never correct -- a subsequent request will see X-L-M going *back in time*, because the change wasn't committed, and clients expect these values to be monotonically increasing.

The API docs (which aren't entirely updated to match the batching API):

"This header gives the last-modified time of the target resource as seen during processing of the request, and will be included in all success responses (200, 201, 204). When given in response to a write request, this will be equal to the server’s current time and to the new last-modified time of any BSOs created or changed by the request."

The trick is in recognizing that all batch operations prior to commit are not really writes: they're preparation for a write.
> However, the Android client seems to expect that X-L-M:
>
>  - must exist, even for batch=true OR batch=b1 [5]
>  - must be the last modified ts of the collection [6]

This is the correct interpretation, so it sounds like both servers are in error in different ways here.  It should always exist and should always tell you the time at which a change was last successfully made to the collection.  (And indeed, an uncommitted batch should not count as a "change" because it's not visible to other clients)
Flags: needinfo?(rfkelly)
Fixed on go-sync with PR 155 [1]. It should behave as android expects in Comment 10. 

[1] https://github.com/mozilla-services/go-syncstorage/pull/155
> Python server, the X-L-M value is: 
> 
> - batch=true (creating a batch)
>   > Not set [3]
>
> - batch=b1 (append to batch)
>   > Not set [3]

I added some tests to ensure that the batch write responses have the expected X-Last-Modified header:


  https://github.com/mozilla-services/server-syncstorage/pull/55/files

The python server passed these tests without any modification, so either the tests are wrong, or the above summary of python server behavior is inaccurate. :mostlygeek can you please try these tests out to see whether they correctly detect the problem in the original go version, and correctly pass for your modified version?
Flags: needinfo?(bwong)
> The python server passed these tests without any modification, so either the tests are wrong,
> or the above summary of python server behavior is inaccurate.

Aha, the python server has this code to ensure that X-Last-Modified is present on all responses:

  https://github.com/mozilla-services/server-syncstorage/blob/bcf86caa65f63b0019a7ed5e36a0fb9d10eebd91/syncstorage/views/renderers.py#L25
(In reply to Ryan Kelly [:rfkelly] from comment #14)

> The python server passed these tests without any modification…

I find this tremendously reassuring.
Assignee: gkruglov → bwong
Component: Android Sync → Server: Sync
Product: Android Background Services → Cloud Services
go-sync 1.6.0r1 ... 

FAILS:

======================================================================
FAIL: test_batch_partial_update (syncstorage.tests.functional.support.LiveTestCases)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "syncstorage/tests/functional/test_storage.py", line 1531, in test_batch_partial_update
    self.assertEquals(orig_ts, float(resp.headers['X-Last-Modified']))
AssertionError: 1484104595.74 != 1484104595.76

----------------------------------------------------------------------
Ran 8 tests in 0.702s

with current master: 

PASS
Flags: needinfo?(bwong)
Deployed go-syncstorage:1.6.1r1 to production. 
Thanks for everybody's help on this issue.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
The fix appears to be working. The updated container go-syncstorage:1.6.1r1 was deployed at Wed 2017-01-11 18:27:58 UTC and a little while later: 

> [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 64171 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.201 0.011
> [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 81069 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.253 0.011
> [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 81616 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.256 0.013
> [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 82529 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.266 0.018
> [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 61596 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.203 0.021
> [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 69888 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.233 0.025
> [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 68707 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.222 0.016
> [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 60712 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.205 0.017
> [2017-01-11T18:54:10+00:00] "POST /1.5/58084743/storage/history?batch=b307&commit=true HTTP/1.1" 200 1551 67491 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.416 0.215

It is pretty interesting to see that the Android client sends multiple parallel requests.
> It is pretty interesting to see that the Android client sends multiple parallel requests.

... that or the timestamp format is not granular enough to show milliseconds. 

Appears to be fixed for AndroidSync 1.51.0.0 as well: 

> [2017-01-11T20:56:54+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 61437 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.519 0.010
> [2017-01-11T20:56:55+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 58743 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.483 0.014
> [2017-01-11T20:56:56+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 57690 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.632 0.012
> [2017-01-11T20:56:56+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 55578 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.453 0.013
> [2017-01-11T20:56:57+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 61333 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.543 0.014
> [2017-01-11T20:56:58+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 60877 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.571 0.012
> [2017-01-11T20:56:59+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 59325 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.513 0.013
> [2017-01-11T20:57:00+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 64172 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.555 0.014
> [2017-01-11T20:57:01+00:00] "POST /1.5/xxx/storage/history?batch=b733&commit=true HTTP/1.1" 200 1551 61221 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.686 0.208
(In reply to Benson Wong [:mostlygeek] from comment #20)
> The fix appears to be working. The updated container go-syncstorage:1.6.1r1
> was deployed at Wed 2017-01-11 18:27:58 UTC and a little while later: 
> 
> > [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 64171 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.201 0.011
> > [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 81069 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.253 0.011
> > [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 81616 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.256 0.013
> > [2017-01-11T18:54:08+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 82529 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.266 0.018
> > [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 61596 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.203 0.021
> > [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 69888 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.233 0.025
> > [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 68707 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.222 0.016
> > [2017-01-11T18:54:09+00:00] "POST /1.5/58084743/storage/history?batch=b307 HTTP/1.1" 202 1566 60712 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.205 0.017
> > [2017-01-11T18:54:10+00:00] "POST /1.5/58084743/storage/history?batch=b307&commit=true HTTP/1.1" 200 1551 67491 "Firefox AndroidSync 1.53.0a1.0 (Firefox)" 0.416 0.215
> 
> It is pretty interesting to see that the Android client sends multiple
> parallel requests.

I think that makes sense. As Android is processing a stream of records and builds up batches out of them, it will send these batches off whenever we hit per-batch limits. We try to process records as quickly as possible, so we'll be firing off requests pretty close in time to each other. Millisecond resolution will help see that better.
(In reply to Benson Wong [:mostlygeek] from comment #21)
> > It is pretty interesting to see that the Android client sends multiple parallel requests.
> 
> ... that or the timestamp format is not granular enough to show
> milliseconds. 
> 
> Appears to be fixed for AndroidSync 1.51.0.0 as well: 
> 
> > [2017-01-11T20:56:54+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 61437 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.519 0.010
> > [2017-01-11T20:56:55+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 58743 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.483 0.014
> > [2017-01-11T20:56:56+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 57690 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.632 0.012
> > [2017-01-11T20:56:56+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 55578 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.453 0.013
> > [2017-01-11T20:56:57+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 61333 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.543 0.014
> > [2017-01-11T20:56:58+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 60877 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.571 0.012
> > [2017-01-11T20:56:59+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 59325 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.513 0.013
> > [2017-01-11T20:57:00+00:00] "POST /1.5/xxx/storage/history?batch=b733 HTTP/1.1" 202 1566 64172 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.555 0.014
> > [2017-01-11T20:57:01+00:00] "POST /1.5/xxx/storage/history?batch=b733&commit=true HTTP/1.1" 200 1551 61221 "Firefox AndroidSync 1.51.0.0 (Firefox)" 0.686 0.208

It's awesome to see it working away, committing those batches! \o/
Found another instance of android failing to commit batches: 
The UA is: Firefox AndroidSync 1.51.0.0 (Firefox)

This is the broken user:

> [2017-01-19T23:03:55+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1"  202 1567 62094
> [2017-01-19T23:03:56+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b8613 HTTP/1.1" 202 1590 71046
> [2017-01-19T23:10:29+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1"  202 1567 62331
> [2017-01-19T23:10:30+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b8616 HTTP/1.1" 202 1590 70866
> [2017-01-20T00:15:18+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1"  202 1567 62235
> [2017-01-20T00:15:19+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b8633 HTTP/1.1" 202 1590 70983
> [2017-01-20T04:25:39+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1"  202 1567 62223
> [2017-01-20T04:25:40+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b8635 HTTP/1.1" 202 1590 70902

Unlike last time we have users with the same UA working successfully: 

> [2017-01-19T05:34:09+00:00] "POST /1.5/user1/storage/history?batch=true HTTP/1.1" 202 1557 65861 
> [2017-01-19T05:34:09+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 73612 
> [2017-01-19T05:34:10+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 122523 
> [2017-01-19T05:34:11+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 67960 
> [2017-01-19T05:34:12+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 59574 
> [2017-01-19T05:34:13+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 53329 
> [2017-01-19T05:34:14+00:00] "POST /1.5/user1/storage/history?batch=b283 HTTP/1.1" 202 1557 56821 
> [2017-01-19T05:34:14+00:00] "POST /1.5/user1/storage/history?batch=b283&commit=true HTTP/1.1" 200 756 30156 
> 
> 
> [2017-01-19T12:00:07+00:00] "POST /1.5/user2/storage/history?batch=true HTTP/1.1" 202 1557 62498 
> [2017-01-19T12:00:09+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 58896 
> [2017-01-19T12:00:10+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 52738 
> [2017-01-19T12:00:11+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 64894 
> [2017-01-19T12:00:12+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 56290 
> [2017-01-19T12:00:13+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 65727 
> [2017-01-19T12:00:14+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 67884 
> [2017-01-19T12:00:14+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 56762 
> [2017-01-19T12:00:15+00:00] "POST /1.5/user2/storage/history?batch=b518 HTTP/1.1" 202 1557 55559 
> [2017-01-19T12:00:16+00:00] "POST /1.5/user2/storage/history?batch=b518&commit=true HTTP/1.1" 200 1551 61846
Status: RESOLVED → REOPENED
Flags: needinfo?(gkruglov)
Resolution: FIXED → ---
Comment #24 compared bookmarks vs history. Filtering for just bookmark batches...

Another user with failed upload batches: 

> (9 requests)
> [2017-01-23T18:03:24+00:00] "POST /1.5/user3/storage/bookmarks?batch=true HTTP/1.1" 202 1555 54598
> [2017-01-23T18:03:25+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 55517
> [2017-01-23T18:03:25+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 56507
> [2017-01-23T18:03:26+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 60571
> [2017-01-23T18:03:26+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 57581
> [2017-01-23T18:03:27+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 57745
> [2017-01-23T18:03:27+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 59164
> [2017-01-23T18:03:28+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 57769
> [2017-01-23T18:03:28+00:00] "POST /1.5/user3/storage/bookmarks?batch=b181 HTTP/1.1" 202 1566 59811
> 
> (9 requests)
> [2017-01-23T18:03:29+00:00] "POST /1.5/user3/storage/bookmarks?batch=true HTTP/1.1" 202 1566 62664
> [2017-01-23T18:03:30+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 58211
> [2017-01-23T18:03:30+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 56100
> [2017-01-23T18:03:31+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 59131
> [2017-01-23T18:03:32+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 59237
> [2017-01-23T18:03:32+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 59039
> [2017-01-23T18:03:32+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 56662
> [2017-01-23T18:03:32+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 59508
> [2017-01-23T18:03:33+00:00] "POST /1.5/user3/storage/bookmarks?batch=b183 HTTP/1.1" 202 1566 59247

It does work:

> [2017-01-26T18:19:22+00:00] "POST /1.5/user4/storage/bookmarks?batch=true HTTP/1.1" 202 1551 38532
> [2017-01-26T18:19:24+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 40405  
> [2017-01-26T18:19:27+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 39388  
> [2017-01-26T18:19:28+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 48123  
> [2017-01-26T18:19:29+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 47847  
> [2017-01-26T18:19:30+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 47178  
> [2017-01-26T18:19:31+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 47104  
> [2017-01-26T18:19:32+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7 HTTP/1.1" 202 1564 48468  
> [2017-01-26T18:19:34+00:00] "POST /1.5/user4/storage/bookmarks?batch=b7&commit=true HTTP/1.1" 200 1056 31712


Grisha in bug 1320772, a too big BSO brakes the batch. Could something similar to be happening here? The BSO list hits a bookmark that is too large and aborts the batch on the client side?
It will always be possible for a client to fail mid-upload, and there's no mechanism to say we're aborting a batch, so you will definitely always see log entries like this. 

So long as they're not looping rapidly (which doesn't seem to be the case), or comprise a large proportion of our user base, investigating this is simply about routine quality improvement -- it's unlikely to be a bug. Prior to batch uploads these users would simply have errored out while preparing an HTTP body, leaving the server in an inconsistent state: the system works!

Possible causes:

- Malformed data in the local database.
- Failure to serialize a record (e.g., too large).
- Out of memory.
- Network disconnection or unreliability.
- Timeout hit.
(In reply to Benson Wong [:mostlygeek] from comment #25)
> Grisha in bug 1320772, a too big BSO brakes the batch. Could something
> similar to be happening here? The BSO list hits a bookmark that is too large
> and aborts the batch on the client side?

If a single record is larger than max_post_bytes defined by info/configuration, we won't upload it, but neither will we fail to commit. The client will let the flow of record complete (so, committing a batch if necessary), but it will mark that collection's sync as "failed", meaning that the next sync will re-upload data from the failed sync... Hopefully succeeding this time around.

Generally speaking, there is a wide variety of causes, as Richard said in Comment 26. Benson, are you seeing in the logs that a client is persistently failing to commit, never succeeding - or failing sporadically, or succeeding eventually after a series of failures, or..?
Flags: needinfo?(gkruglov)
(In reply to :Grisha Kruglov from comment #27)
> Benson, are you seeing in the logs that a client is persistently
> failing to commit, never succeeding - or failing sporadically, or succeeding
> eventually after a series of failures, or..?

I grepped a wide range of longs. The user3 from comment #25 was eventually able to upload that large batch. In between the failures they upgraded to Android 1.52.0.0. So I don't think we need to worry about that user. 

The user from Comment #24 seems to still be experiencing the issue. They also upgraded to 1.52.0.0 but it didn't help them. 

User agent: Firefox AndroidSync 1.52.0.0 (Firefox)

> [2017-01-30T02:43:01+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b9774 HTTP/1.1" 202 1590 71037 
> [2017-01-30T02:44:50+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1" 202 1567 62115  
> [2017-01-30T02:44:50+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b9777 HTTP/1.1" 202 1590 70884 
> [2017-01-30T03:07:29+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1" 202 1567 62193  
> [2017-01-30T03:07:30+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b9780 HTTP/1.1" 202 1590 71040 
> [2017-01-30T03:19:55+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1" 202 1567 62151  
> [2017-01-30T03:19:56+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b9782 HTTP/1.1" 202 1590 70908 
> [2017-01-30T16:51:06+00:00] "POST /1.5/xxx/storage/bookmarks?batch=true HTTP/1.1" 202 1567 62208  
> [2017-01-30T16:51:07+00:00] "POST /1.5/xxx/storage/bookmarks?batch=b9821 HTTP/1.1" 202 1590 71037

This user seems to have been failing on their batch upload for a few weeks. The batch id increments sequentially so they have attempted 9821 times so far. 

AFAICT this is the only user experiencing this problem on the server.
Sheesh. 

s/I grepped a wide range of longs/I grepped a wider range of logs/
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → INVALID
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.