Android Sync hangs on sync when server has missing (404) meta/global record

RESOLVED FIXED

Status

()

defect
RESOLVED FIXED
7 years ago
2 years ago

People

(Reporter: nalexander, Assigned: nalexander)

Tracking

unspecified
All
Android
Points:
---

Firefox Tracking Flags

(blocking-fennec1.0 beta+)

Details

(Whiteboard: [qa+])

Attachments

(12 attachments, 7 obsolete attachments)

1.76 KB, text/plain
Details
22.62 KB, text/plain
Details
5.17 KB, text/plain
Details
3.70 KB, text/plain
Details
5.10 KB, patch
rnewman
: review+
Details | Diff | Splinter Review
328.67 KB, text/plain
Details
185.76 KB, text/plain
Details
3.34 KB, patch
Details | Diff | Splinter Review
17.82 KB, text/plain
Details
8.41 KB, text/plain
Details
5.01 KB, text/plain
Details
27.77 KB, patch
Details | Diff | Splinter Review
On finding a missing (404) meta/global record, `FetchMetaGlobalStage` calls `GlobalSession.processMissingMetaGlobal()`.

`GlobalSession.processMissingMetaGlobal()`calls `GlobalSession.freshStart()` which calls `GlobalSession.freshStart(GlobalSession, FreshStartDelegate)` which calls `GlobalSession.wipeServer(GlobalSession, WipeServerDelegate)`.

The wipe server callback `onWiped` calls `MetaGlobal.upload(MetaGlobalDelegate)`, which has not been implemented.  `MetaGlobal.upload(MetaGlobalDelegate)` calls `SyncStorageRecordRequest.deferPut(JSONObject)` with parameter `null` (instead of generating a valid meta/global record body -- this is most of the content of Bug 709313).

`SyncStorageRecordRequest.deferPut(JSONObject)` spawns a new thread that calls `put` with the parameter.  This put tries calls `.toJSONObject()` on the parameter, which should throw `NullPointerException`.  This exception doesn't appear to be caught anywhere, and isn't reported.  Instead, we just die and never alert the monitor to fail the sync.
(In reply to Nick Alexander :nalexander from comment #1)
> Created attachment 621732 [details]
> Suggested minimal patch against m-c

This patch makes `MetaGlobal.upload(MetaGlobalDelegate)</tt> fail immediately.  This will abort the sync immediately.  Android Sync clients will continue to fail until Android stops them syncing or a Desktop client initializes the server with a correct meta/global record.
Just clarifying for triage: the purpose of this bug is to be an immediate stop-gap that we can get into a beta to avoid incorrect behavior before Bug 745430 and Bug 745431 are ready.

The idea is to minimize risk.
Nick's been doing some analysis of failure cases as part of his work on those bugs. We're tracking those here:

https://wiki.mozilla.org/User:Anaaktgeboren/NativeSync#Cases_that_need_testing_for_blank_server_and_corrupted_server
Posted file first sync with patch 2 (obsolete) —
Attachment #621832 - Attachment is obsolete: true
(In reply to Nick Alexander :nalexander from comment #11)
> Created attachment 621862 [details]
> Suggested patch against mozilla-aurora

This patch is longer because I commented things out as block comments.  I think this makes it easier to see exactly what has been removed.  I also made the session abort messages different (and hopefully more instructive) in each case.  See attached logs.
Attachment #621733 - Attachment is obsolete: true
Attachment #621734 - Attachment is obsolete: true
Attachment #621833 - Attachment is obsolete: true
Extracted from earlier communiques:

This will let us "do no harm" until we're ready to land stable release blocker code, which will include a more thorough solution to this case (namely, doing the right thing!) and address the remaining scenarios that we need to cover eventually, but are too involved to band-aid.

In my opinion this hits the sweet spot of very low risk and tidy damage-avoidance, and so I'd like to get this landed before we cut a beta if we can.
Attachment #621862 - Attachment is obsolete: true
Attachment #622011 - Flags: review+
Attachment #622011 - Flags: approval-mozilla-aurora?
Assignee: nobody → nalexander
Status: NEW → ASSIGNED
STRs via irc:

rnewman: you can test by deleting an account through Account Portal (if it's a newer account, it's more likely to not time out!)
[09:54am] rnewman: assuming that just wipes data
[09:55am] rnewman: failing that, do the "delete server" case here: https://wiki.mozilla.org/Services/Sync/Snippets#Partially_corrupt_a_server
[09:55am] tchung: rnewman: so what's the str -- sync account on android, delete account from AP, and try to resync on android?
[09:57am] rnewman: I suggest using the snippet instead, because on reflection I don't know precisely what AP does
[09:57am] rnewman: but yes
[09:57am] rnewman: setup, sync, clear server from desktop (and quit desktop to avoid another sync), sync android, verify from log that it doesn't upload a meta/global or hang
Whiteboard: [qa+]
My test procedure:

*** verify trivial crypto/keys changes are processed correctly
   - sync device to existing account
   - bumped crypto/keys modified time using
Components.utils.import("resource://services-sync/main.js");
Components.utils.import("resource://services-sync/resource.js");
let r = new Resource(Weave.Service.storageURL + "crypto/keys");
let g = r.get();
r.put(g);
   - device sync completed just fine, and correctly fetched fresh keys quietly:
I GlobalSession(13112)        Running next stage ensureKeysStage (org.mozilla.gecko.sync.stage.EnsureCrypto5KeysStage@411f9870)...
I EnsureC5KeysStage(13112)    Fetching fresh collection keys for this session.
D dalvikvm(13112)             GC_CONCURRENT freed 556K, 10% free 8254K/9095K, paused 2ms+4ms
I EnsureC5KeysStage(13112)    New default key is the same as old default key; returning changed individual keys.

*** verify deleted crypto/keys record correctly aborts sync
   - sync device to existing account
   - deleted crypto keys using
Components.utils.import("resource://services-sync/main.js");
Components.utils.import("resource://services-sync/resource.js");
function deletePath(path) {
  let resource = new Resource(Weave.Service.storageURL + path);
  resource.setHeader("X-Confirm-Delete", "1");
  return resource.delete();
}
deletePath("crypto/keys");
   - device sync correctly aborted with
W GlobalSession(19091)        Aborting sync: aborting after finding keys missing because uploading fresh crypto/keys record is not yet implemented.

*** verify missing meta/global record correctly aborts sync
   - sync device to existing account
   - wipe server, either using Account Portal or
Components.utils.import("resource://services-sync/main.js");
Components.utils.import("resource://services-sync/resource.js");
function deletePath(path) {
  let resource = new Resource(Weave.Service.storageURL + path);
  resource.setHeader("X-Confirm-Delete", "1");
  return resource.delete();
}
deletePath("");
   - device sync correctly aborted with
W GlobalSession(19091)        Aborting sync: aborting freshStart because uploading fresh meta/global record is not yet implemented.
dear mobile triage: as per rnewman's email to drivers this morning, we desire to uplift this, so as to do no harm while we fix teh release blockers related to the following issues: <https://wiki.mozilla.org/User:Anaaktgeboren/NativeSync#Cases_that_Android_Sync_does_not_handle:_blank_and_corrupted_servers>
blocking-fennec1.0: --- → ?
blocking-fennec1.0: ? → beta+
Attachment #622011 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
mfinkle & rnewman's race dropped the flag :)
blocking-fennec1.0: --- → beta+
didnt see the meta/global issue reported in comment 0, but here's my log.   Tested on HTC Desire, rnewman's try build.
I get an expected 401 Unauthorized.   No hang and Sync account on Android is saying "Sync is currently experiencing problems. It will be back shortly."
Posted file logcat 2
I reran the tests again, and this time i'm hitting the blank server during sync.

STR: Nexus S, rnewman's build --> sync android, on desktop run snippets (but dont kill desktop), and force sync on android afterwards.

05-08 21:41:42.031: INFO/SyncAdapter(718): Forced sync: overruling remaining backoff of 218937ms.
05-08 21:41:42.031: INFO/SyncAdapter(718): Got onPerformSync. Extras bundle is Bundle[{ignore_backoff=true, ignore_settings=true, force=true}]
05-08 21:41:42.031: INFO/SyncAdapter(718): Account name: testme@me.com
05-08 21:41:42.031: DEBUG/SyncAdapter(718): Invalidating auth token.
05-08 21:41:42.265: INFO/SyncAuthService(718): AccountManager.get(org.mozilla.gecko.sync.setup.SyncAuthenticatorService@40713fb0)
05-08 21:41:42.277: INFO/SyncAdapter(718): Waiting on sync monitor.
05-08 21:41:42.304: INFO/SyncAuthService(718): Setting username. Null? false
05-08 21:41:42.308: INFO/SyncAuthService(718): Setting Sync Key. Null? false
05-08 21:41:42.410: INFO/SyncAdapter(718): AccountManagerCallback invoked.
05-08 21:41:42.410: DEBUG/SyncAdapter(718): Username: 5hdzttjtqbtegdx53wirzg5rxzsybfon
05-08 21:41:42.410: DEBUG/SyncAdapter(718): Server:   https://auth.services.mozilla.com/
05-08 21:41:42.410: DEBUG/SyncAdapter(718): Password? true
05-08 21:41:42.410: DEBUG/SyncAdapter(718): Key?      true
05-08 21:41:42.460: INFO/SyncAdapter(718): Performing sync.
05-08 21:41:42.460: INFO/GlobalSession(718): GlobalSession initialized with bundle Bundle[{ignore_backoff=true, ignore_settings=true, force=true}]
05-08 21:41:42.480: INFO/SyncConfiguration(718): Set clusterURL from bundle: https://phx-sync322.services.mozilla.com/
05-08 21:41:42.480: INFO/SyncConfiguration(718): Set syncID from bundle: m6PuivGPdMBh
05-08 21:41:42.480: INFO/SyncAdapter(718): Stage completed: idle
05-08 21:41:42.484: INFO/GlobalSession(718): Running next stage checkPreconditions (org.mozilla.gecko.sync.stage.CheckPreconditionsStage@4072d4f8)...
05-08 21:41:42.484: INFO/SyncAdapter(718): Stage completed: checkPreconditions
05-08 21:41:42.484: INFO/GlobalSession(718): Running next stage ensureClusterURL (org.mozilla.gecko.sync.stage.EnsureClusterURLStage@4072afe8)...
05-08 21:41:42.484: INFO/EnsureClusterURLStage(718): Cluster URL is already set and not stale. Continuing with sync.
05-08 21:41:42.484: INFO/SyncAdapter(718): Stage completed: ensureClusterURL
05-08 21:41:42.484: INFO/GlobalSession(718): Running next stage fetchInfoCollections (org.mozilla.gecko.sync.stage.FetchInfoCollectionsStage@4071df70)...
05-08 21:41:42.953: INFO/InfoCollections(718): info/collections is {}
05-08 21:41:42.953: INFO/FetchInfoCollStage(718): Got timestamps: 
05-08 21:41:42.953: INFO/SyncAdapter(718): Stage completed: fetchInfoCollections
05-08 21:41:42.957: INFO/GlobalSession(718): Running next stage fetchMetaGlobal (org.mozilla.gecko.sync.stage.FetchMetaGlobalStage@4071ae70)...
05-08 21:41:42.957: INFO/FetchMetaGlobalStage(718): Fetching fresh meta/global for this session.
05-08 21:41:43.070: ERROR/GPS(112): [on_request_connection][line = 1061] : Failed with NO SESSION SLOT
05-08 21:41:43.191: WARN/SSResourceDelegate(718): HTTP request failed.
05-08 21:41:43.214: WARN/SSResourceDelegate(718): HTTP response body: 404 Not Found
05-08 21:41:43.214: WARN/SSResourceDelegate(718): The resource could not be found.
05-08 21:41:43.218: WARN/GlobalSession(718): Aborting sync: aborting freshStart because uploading fresh meta/global record is not yet implemented.
05-08 21:41:43.218: INFO/SyncAdapter(718): GlobalSession indicated error. Flagging auth token as invalid, just in case.
Any reason not to land this on inbound too?
(In reply to JP Rosevear [:jpr] from comment #26)
> Any reason not to land this on inbound too?

can someone from sync dev confirm my log results is correct and addresses the original hang?
(In reply to JP Rosevear [:jpr] from comment #26)
> Any reason not to land this on inbound too?

Rolling up a final patch for inbound now.

(In reply to Tony Chung [:tchung] from comment #27)
> can someone from sync dev confirm my log results is correct and addresses
> the original hang?

Confirmed.
Whiteboard: [qa+] → [qa+][landed for beta][needs to land on inbound]
We (mconnor & I) have decided not land this on inbound, since this is just a bandaid for aurora & beta while the proper patches land.

Thanks to nick for uploading inbound patches.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [qa+][landed for beta][needs to land on inbound] → [qa+]
Product: Mozilla Services → Android Background Services
Flags: needinfo?(nalexander)
Flags: needinfo?(nalexander)
Product: Android Background Services → Firefox for Android
You need to log in before you can comment on or make changes to this bug.