sync fails in extension storage with "TypeError: record is null"

VERIFIED FIXED in Firefox 54

Status

()

defect
VERIFIED FIXED
2 years ago
2 years ago

People

(Reporter: tortino, Assigned: glasserc)

Tracking

unspecified
Firefox 55
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox53+ wontfix, firefox54 fixed, firefox55 fixed)

Details

Attachments

(2 attachments)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:55.0) Gecko/20100101 Firefox/55.0
Build ID: 20170426030329

Steps to reproduce:

configure sync on latest nightly.



Actual results:

sync started giving errors in extension storage (previously was running ok)

1493222040683	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for savepage-we@DW-dev
1493222040843	FirefoxAccounts	DEBUG	getOAuthToken enter
1493222040844	FirefoxAccounts	DEBUG	getOAuthToken returning a cached token
1493222041094	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for undefined
1493222041356	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for undefined
1493222041382	Sync.Engine.Extension-Storage	ERROR	Syncing storage-sync-crypto: request failed: TypeError: record is null (resource://gre/modules/ExtensionStorageSync.jsm:207:11) JS Stack trace: decode/<@ExtensionStorageSync.jsm:207:11 < TaskImpl_run@Task.jsm:319:42 < TaskImpl@Task.jsm:277:3 < asyncFunction@Task.jsm:252:14 < Task_spawn@Task.jsm:166:12 < decode@ExtensionStorageSync.jsm:206:12 < decode@ExtensionStorageSync.jsm:319:20 < async*[6]</_decodeRecord/</<@kinto-offline-client.js:1291:24 < waitForSyncCallback@async.js:98:7 < makeSpinningCallback/callback.wait@async.js:168:27 < promiseSpinningly@async.js:234:12 < _sync@extension-storage.js:40:12 < WrappedNotify@util.js:162:21 < sync@engines.js:720:5 < _syncEngine@enginesync.js:218:7 < sync@enginesync.js:165:15 < onNotify@service.js:1079:7 < WrappedNotify@util.js:162:21 < WrappedLock@util.js:118:16 < _lockedSync@service.js:1069:12 < sync/<@service.js:1061:7 < WrappedCatch@util.js:93:16 < sync@service.js:1050:5
1493222041383	Sync.Status	DEBUG	Status for engine extension-storage: error.engine.reason.unknown_fail
1493222041384	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1493222041384	Sync.ErrorHandler	DEBUG	extension-storage failed: TypeError: record is null (resource://gre/modules/ExtensionStorageSync.jsm:207:11) JS Stack trace: decode/<@ExtensionStorageSync.jsm:207:11 < TaskImpl_run@Task.jsm:319:42 < TaskImpl@Task.jsm:277:3 < asyncFunction@Task.jsm:252:14 < Task_spawn@Task.jsm:166:12 < decode@ExtensionStorageSync.jsm:206:12 < decode@ExtensionStorageSync.jsm:319:20 < async*[6]</_decodeRecord/</<@kinto-offline-client.js:1291:24 < waitForSyncCallback@async.js:98:7 < makeSpinningCallback/callback.wait@async.js:168:27 < promiseSpinningly@async.js:234:12 < _sync@extension-storage.js:40:12 < WrappedNotify@util.js:162:21 < sync@engines.js:720:5 < _syncEngine@enginesync.js:218:7 < sync@enginesync.js:165:15 < onNotify@service.js:1079:7 < WrappedNotify@util.js:162:21 < WrappedLock@util.js:118:16 < _lockedSync@service.js:1069:12 < sync/<@service.js:1061:7 < WrappedCatch@util.js:93:16 < sync@service.js:1050:5
1493222041384	Sync.Telemetry	TRACE	observed weave:engine:sync:error extension-storage
1493222041385	Sync.Doctor	INFO	Skipping check of prefs - disabled via preferences

Possible causes that I can think of:
1) I just updated to the latest nightly build: 55.0a1 (2017-04-26) (64 bit)
2) I installed https://addons.mozilla.org/it/firefox/addon/save-page-we but maybe it's not the culprit since I have uninstalled it and the error is still present
Component: Untriaged → Firefox Sync: Backend
Product: Firefox → Cloud Services
Version: 55 Branch → unspecified
Ethan, any ideas? (Also, how do you feel about a new bugzilla component for storage.sync?)
Component: Firefox Sync: Backend → Sync
Flags: needinfo?(eglassercamp)
Product: Cloud Services → Firefox
Summary: sync failed in extension storage → sync fails in extension storage with "TypeError: record is null"
Hi, tried on another pc (win7 x64) and got this error after installing "save page we" extension:

1493237555141	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for [default]
1493237555141	Sync.CollectionKeyManager	INFO	Processing downloaded per-collection keys.
1493237555141	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for fastimageresearch@usacyborg.com
1493237555141	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for zoompage-we@DW-dev
1493237555142	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for savepage-we@DW-dev
1493237555142	Sync.CollectionKeyManager	INFO	Clearing collection keys...
1493237555142	Sync.CollectionKeyManager	INFO	Saving downloaded keys.
1493237555142	Sync.CollectionKeyManager	INFO	Bumping last modified to 1492062070839
1493237555213	FirefoxAccounts	DEBUG	getOAuthToken enter
1493237555216	FirefoxAccounts	DEBUG	getOAuthToken returning a cached token
1493237555590	Sync.Engine.Extension-Storage	ERROR	Syncing storage-sync-crypto: request failed: Error: Server has been flushed. (resource://services-common/kinto-offline-client.js:1748:15) JS Stack trace: [6]</pullChanges/<@kinto-offline-client.js:1748:15 < step@kinto-offline-client.js:868:183 < [6]</_asyncToGenerator/</</step/<@kinto-offline-client.js:868:361 < waitForSyncCallback@async.js:98:7 < makeSpinningCallback/callback.wait@async.js:168:27 < promiseSpinningly@async.js:234:12 < _sync@extension-storage.js:40:12 < WrappedNotify@util.js:162:21 < sync@engines.js:720:5 < _syncEngine@enginesync.js:218:7 < sync@enginesync.js:165:15 < onNotify@service.js:1079:7 < WrappedNotify@util.js:162:21 < WrappedLock@util.js:118:16 < _lockedSync@service.js:1069:12 < sync/<@service.js:1061:7 < WrappedCatch@util.js:93:16 < sync@service.js:1050:5
1493237555592	Sync.Status	DEBUG	Status for engine extension-storage: error.engine.reason.unknown_fail
1493237555592	Sync.Status	DEBUG	Status.service: success.status_ok => error.sync.failed_partial
1493237555592	Sync.ErrorHandler	DEBUG	extension-storage failed: Error: Server has been flushed. (resource://services-common/kinto-offline-client.js:1748:15) JS Stack trace: [6]</pullChanges/<@kinto-offline-client.js:1748:15 < step@kinto-offline-client.js:868:183 < [6]</_asyncToGenerator/</</step/<@kinto-offline-client.js:868:361 < waitForSyncCallback@async.js:98:7 < makeSpinningCallback/callback.wait@async.js:168:27 < promiseSpinningly@async.js:234:12 < _sync@extension-storage.js:40:12 < WrappedNotify@util.js:162:21 < sync@engines.js:720:5 < _syncEngine@enginesync.js:218:7 < sync@enginesync.js:165:15 < onNotify@service.js:1079:7 < WrappedNotify@util.js:162:21 < WrappedLock@util.js:118:16 < _lockedSync@service.js:1069:12 < sync/<@service.js:1061:7 < WrappedCatch@util.js:93:16 < sync@service.js:1050:5
1493237555592	Sync.Telemetry	TRACE	observed weave:engine:sync:error extension-storage
1493237555593	Sync.Doctor	INFO	Skipping check of prefs - disabled via preferences
1493237555593	Sync.Doctor	INFO	Skipping check of passwords - disabled via preferences
The "record is null" error:

This error came from the decryption code, but it isn't clear to me what's happening. It's reporting that it's trying to decrypt a record that isn't there. I guess it's some kind of bug in the Kinto code, but I'm baffled. The good news is that I tried installing the Save Page WE extension and was able to reproduce the "record is null" error relatively quickly, which means I should be able to track it down.

The "server is flushed" error:

We got another report about that in bug 1350088. I still have no idea what happened there. Maybe I'll trigger that issue while I'm investigating this one? I suggest that we use that other bug to track it.

Regarding creating a Bugzilla component, I don't have any preference, but I will go along with whatever you think is best.
OK, I spent a little time poking around at this. Something seriously strange is going on. Thanks for the report, Tortino!

I'm pretty sure I can rule out the specific web extension, since it's related to activity happening in the crypto collection (that stores keys and salts). I guess adding a web extension that uses chrome.storage caused a new key to be added and thus all the sync failures.

I'm not really sure exactly what happened to cause it, but it looks like the server keyring got deleted. My storage-sync code is trying to upload a keyring using an If-Match header, and that's failing because the keyring is gone. It's failing with a 412 Precondition Failed response. The 412 response normally includes the conflicting record, but in this case, the conflict was that the record was deleted, so it's missing. The kinto client code sees that there isn't a record and puts in "null" to stand in for the record. It invokes my storage-sync code to decrypt the null record, which fails.

I changed a couple minor things in kinto.js to not try decoding "null" and generally to be more robust against it, and that basically worked, but the storage-sync code, as it stands today, tries to use the SERVER_WINS strategy when there's a conflict. But in this case, because the server keyring is gone, that isn't possible, and trying to do so leads to other failures later.

I think we could add code that detects this situation and treats it like the "can't decrypt keyring" case -- wipe the server and reupload everything -- but that's just a band-aid and I'm a little nervous about doing that without a better idea what happened.

As for what actually happened, I'm at a loss. The fact that it happened to my testing account eglassercamp@mozilla.com as well as my personal account is suspicious. In bug 1323240, we wiped the dev server, but this is on the prod server https://webextensions.settings.services.mozilla.com/ . The only thing in the storage-sync code that tries to delete the keyring is the "delete bucket" call which should wipe the entire server, but although my server keyring is gone, the collections for some web extensions are around and still encrypted. (I'm not sure which extensions, of course, because the names are all hashed.)

Trying to look for a tombstone or deleted record in the storage-sync-crypto collection doesn't work. Doing a request for 'https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections/storage-sync-crypto/records?_since=0' gets me an empty list. Does that mean the collection itself was destroyed?

Requesting the collection itself, I get:

{
    "data": {
        "id": "storage-sync-crypto",
        "last_modified": 1493325878452
    },
    "permissions": {
        "write": [
            "fxa:..."
        ]
    }
}

That date is from only a couple of hours ago, so maybe it's something in my storage-sync code that triggered it. Or maybe the collection was blown away and recreated automatically when I tried to look at it?

I'm needinfo'ing :markh to tell me whether he thinks it's important/urgent enough to try to put in the band-aid that wipes the server when this happens and :leplatrem in case he can think of a reason why the server keyring would have disappeared without a trace, or if he has any ideas on how to continue troubleshooting this.
Flags: needinfo?(mathieu)
Flags: needinfo?(markh)
Flags: needinfo?(eglassercamp)
(In reply to Ethan Glasser-Camp (:glasserc) from comment #4)
> I'm needinfo'ing :markh to tell me whether he thinks it's important/urgent
> enough to try to put in the band-aid that wipes the server when this happens

TBH, I'm a little torn about this. On one hand, I don't think it is urgent enough to rush a change out before we know what went wrong. OTOH though, it seems like this will always be unrecoverable, so ultimately it seems almost certain that we will wipe in this case even when we discover (and hopefully fix) the underlying issue. So I think we should probably spend some time trying to diagnose this further (roping in the kinto team to help), but if it looks like we will ride the next train without a fix, we should probably add the wipe.

FYI, https://gist.github.com/mhammond/01907397057deb00edec1d6616c0a17c has some interesting data (although it is about 2 weeks old):

* around 99.7% of extension-storage syncs succeed.
* of those failing:
** 80% are a shutdown error - hopefully this is simply a "real" shutdown, although it's possible that somehow the sync is "hanging" until the user shuts down - it's not clear.
** around 20% are "Server has been flushed"
** "record is null" doesn't appear - although "json is null" does. I'm not sure if that is a clue or a red-herring.

I'll be running this report next week, so I'll let you know if any of these values have recently changed.
Flags: needinfo?(markh)
I can confirm that my 3 add-ons collections weren't touched since January while I apparently don't have any crypto collection.

> $ http GET https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections Authorization:"Bearer ${OAUTH_BEARER_TOKEN=}"

HTTP/1.1 200 OK
Access-Control-Expose-Headers: Pragma, Last-Modified, Cache-Control, Alert, Next-Page, Backoff, Content-Length, ETag, Total-Records, Expires, Retry-After
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Length: 322
Content-Type: application/json
Date: Fri, 28 Apr 2017 10:41:03 GMT
ETag: "1485797446046"
Last-Modified: Mon, 30 Jan 2017 17:30:46 GMT
Total-Records: 3
X-Content-Type-Options: nosniff

{
    "data": [
        {
            "id": "3e0f013d2362aa4fa7b0db2fefb40011b9545d78dcca7aabc0d32f0b42f1733d", 
            "last_modified": 1485797446046
        }, 
        {
            "id": "5ce482b8983e72ae98cfe65aba2fe9ac05609b154177b3bc092d1fa2e8d78086", 
            "last_modified": 1484764312436
        }, 
        {
            "id": "f350a394d038ad1f917a7191645f8542fecb4435f2f12f43f8b6dd70f38e69e7", 
            "last_modified": 1484667715388
        }
    ]
}
If you want to try with your account:

 $ pip install PyFxA httpie
 $ fxa-client --bearer --scopes sync:addon_storage --oauth-server https://oauth.accounts.firefox.com/v1 --account-server https://api.accounts.firefox.com/v1 --auth fxa@email.address.tld --out env.sh
 $ source env.sh
 $ http GET https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections Authorization:"Bearer ${OAUTH_BEARER_TOKEN=}"
Thanks for posting the fxa-client command, Rémy, that's super useful. But I don't think this bug affects you. The collection IDs you have seem like they're quite old. Mine are all ext-[base64-encoded-junk], like "ext-Ccc4rbVqgGSk-K7sUJv_68gQdPlmqiqNkffN4uv9-Jc". I bet yours are from testing Hoverpad or something.

Tortino, could you run the commands that Rémy posted and tell me whether there is anything in the `data` response, and especially if there is a `storage-sync-crypto` in there anywhere? Just to make sure I'm looking at the same bug.

markh, if this bug reflects a disappearing storage-sync-crypto collection, then the client won't even notice until a new extension is added. I think that's what happened to me. I installed Snap Links and Redirector in late March (maybe March 29?) while investigating bug 1350088, and the sync succeeded (so they must have had keys in an uploaded keyring) every hour until I installed Save Page WE on yesterday, April 27th at 4:15 PM EDT. The fact that this happened to both my profiles is super suspicious.

One possibility is that the storage-sync-crypto collections were deleted for a bunch of users at once, which would almost certainly be a server-side issue. But we can't figure out anything that would cause that. We've tried inducing it by running certain requests against the server, for example manually deleting the bucket, but no dice yet. Another possibility is that it's failing to ever upload the keyring in the first place, for example because the transaction failed to commit, but then how did the client get the ETag to use in subsequent If-Matches?

I just tried to induce it again by resetting sync status, restarting nightly, and installing SnapLinks (which also uses chrome.storage), but it didn't seem to do anything strange. I still have a keyring.

I thought something might be going on with the prod server getting flushed (maybe that would also explain the "server was flushed" errors?) but the flush endpoint isn't present. Trying to flush it just gets me a 404.

Is it maybe possible that something happened to the database in the last month? It's a pretty far-fetched theory because most horrible things that could happen to a database would not be so selective to just remove one storage-sync-crypto collection -- they'd blow away an entire table or even database.

Another useful command -- in the browser console, you can do:

Cu.import("resource://gre/modules/ExtensionStorageSync.jsm"); extensionStorageSync.cryptoCollection.resetSyncStatus()

This will reset the sync status of the crypto collection (make the client forget about whatever is supposed to be on the server), which will make it upload the keyring "new". On my "real" profile, that made the sync errors go away (although it doesn't necessarily mean that the synced data is correct), and I checked that the keyring is synced to the server correctly.

I thought about adding logging to ExtensionStorageSync.jsm, but by default, logging is discarded for successful syncs, and since I experienced a month of successful syncs before seeing any error, I'm not sure doing anything like that would be useful..
Flags: needinfo?(mathieu) → needinfo?(tortino)
(In reply to Ethan Glasser-Camp (:glasserc) from comment #8)
> markh, if this bug reflects a disappearing storage-sync-crypto collection,
> then the client won't even notice until a new extension is added. I think
> that's what happened to me.

I'm a little confused by the terminology here, and TBH the implementation details of the extension sync implementation remains a little fuzzy. Sync itself has a "crypto" collection, and IIUC, ExtensionStorageSync uses the same terminology (eg, "CryptoCollection") to refer to a "kinto collection". I can't remember how this relates to the "key ring", but IIRC, that key ring does use the sync "crypto collection" for the encryption keys - but how all these relate to each other is getting fuzzier in my recollection as time goes by.

But what I *do* know clearly is that the *sync" crypto collection may vanish at any time - eg, when a user resets (not changes) their FxA password, or when we perform a "node reassignment" after a server goes down or similar - and note that in March this year, we forcibly node-reassigned *every* sync user due to a server bug.

> One possibility is that the storage-sync-crypto collections were deleted for
> a bunch of users at once, which would almost certainly be a server-side
> issue. But we can't figure out anything that would cause that. We've tried
> inducing it by running certain requests against the server, for example
> manually deleting the bucket, but no dice yet. Another possibility is that
> it's failing to ever upload the keyring in the first place, for example
> because the transaction failed to commit, but then how did the client get
> the ETag to use in subsequent If-Matches?

Is it possible that losing the sync crypto collection could cause the client to delete the kinto crypto collection?

> 
> I just tried to induce it again by resetting sync status, restarting
> nightly, and installing SnapLinks (which also uses chrome.storage), but it
> didn't seem to do anything strange. I still have a keyring.

Maybe try resetting (not changing) your FxA password?

> I thought about adding logging to ExtensionStorageSync.jsm, but by default,
> logging is discarded for successful syncs, and since I experienced a month
> of successful syncs before seeing any error, I'm not sure doing anything
> like that would be useful..

Note that this issue was opened because messages were noticed in sync logs, so IMO (and within reason), the more logging the better.
(In reply to Mark Hammond [:markh] from comment #9)
> Note that this issue was opened because messages were noticed in sync logs,
> so IMO (and within reason), the more logging the better.

Note also that logging at ERROR level is enough to cause a sync error log to appear - it's not necessary for the sync to actually fail.
(In reply to Mark Hammond [:markh] from comment #9)
> I'm a little confused by the terminology here, and TBH the implementation
> details of the extension sync implementation remains a little fuzzy. Sync
> itself has a "crypto" collection, and IIUC, ExtensionStorageSync uses the
> same terminology (eg, "CryptoCollection") to refer to a "kinto collection".
> I can't remember how this relates to the "key ring", but IIRC, that key ring
> does use the sync "crypto collection" for the encryption keys - but how all
> these relate to each other is getting fuzzier in my recollection as time
> goes by.

Does https://wiki.mozilla.org/CloudServices/Sync/ExtensionStorage_Design_Doc answer any of those questions? Do you think I should add something, either to that document or some other place, to help unfuzzify the extension sync implementation? (I guess this is going off on a tangent that isn't related to this bug, but while we're here...)

For "extension storage" (i.e. chrome.storage.sync), extensions store their data in different Kinto collections. One Kinto collection (called "storage-sync-crypto") also exists for use by the chrome.storage.sync code itself. In this collection is a single record (with id: "keys") that stores the keys and salts used in other collections. I call this record the "keyring" in code and documentation. None of this should have anything to do with the Firefox Sync "crypto collection" or its own key management.

The point I was trying to make is that we only sync the Kinto "storage-sync-crypto" collection when a new extension is added, so syncs will still "succeed" even if the collection has been blown away.

> But what I *do* know clearly is that the *sync" crypto collection may vanish
> at any time - eg, when a user resets (not changes) their FxA password, or
> when we perform a "node reassignment" after a server goes down or similar -
> and note that in March this year, we forcibly node-reassigned *every* sync
> user due to a server bug.

My understanding is that this does not happen in the same way for Kinto storage. Users are not at present even sharded, as far as I know, but instead all exist in one cluster that serves data from an RDS cluster. We do have a strategy for sharding users later as needed, but it would involve managing the user -> cluster mapping on the server side. In no circumstances do I expect Kinto collections to vanish, which is what I believe I'm seeing.

> Is it possible that losing the sync crypto collection could cause the client
> to delete the kinto crypto collection?

I guess anything is possible, but I don't really see how. The only delete that we explicitly do is of the entire "bucket" (including all collections), and that doesn't seem to be what we're seeing.

> Maybe try resetting (not changing) your FxA password?

Since I started debugging this bug, I left a copy of Nightly running with my "work" profile (eglassercamp@mozilla.com). On Friday at the end of the day, I made this request:

> http 'https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections' "Authorization":"Bearer ...."
HTTP/1.1 200 OK
Access-Control-Expose-Headers: Total-Records, Next-Page, Expires, Pragma, ETag, Last-Modified, Backoff, Alert, Content-Length, Cache-Control, Retry-After
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Length: 695
Content-Type: application/json
Date: Fri, 28 Apr 2017 19:39:23 GMT
ETag: "1493392083418"
Last-Modified: Fri, 28 Apr 2017 15:08:03 GMT
Total-Records: 8
X-Content-Type-Options: nosniff

{
    "data": [
        {
            "id": "ext-YBd-wajDRq2TVfXWT86UUWvfwlt93isC2u-V1pkTZJ4",
            "last_modified": 1493392083418
        },
        {
            "id": "ext-mhIlAiX9AvbWacInLz_PMPTv1gRgQFsVyJDoI4b8e0k",
            "last_modified": 1493388946175
        },
        {
            "id": "storage-sync-crypto",
            "last_modified": 1493325878452
        },
        {
            "id": "ext-S-AQ0QMIng5bePiHjgedlbOOobggAHae5xrPhYaCIDU",
            "last_modified": 1492711941186
        },
        {
            "id": "ext-IL2fg2T0QrHEuaMh2k-zDfJiZbj-y_Ks-0IFnubg7fs",
            "last_modified": 1492707209384
        },
        {
            "id": "ext-h_w8Dgj7fK032HuRzPQGt0xo2n92v_CaLu1OscDtWkM",
            "last_modified": 1492707209174
        },
        {
            "id": "ext-yfqyYZuvrA2nJz_MKaWbQXUfaSyci4DyJlAI-AQmTc8",
            "last_modified": 1490787265481
        },
        {
            "id": "28af1c9aa38c87a829d2017aa1866fd4aba8ddfc1d531b943087441f57b08258",
            "last_modified": 1483480157187
        }
    ]
}

You can see my "storage-sync-crypto" collection is present, as well as collections for certain webextensions (all the "ext-...." ones), plus one that seems to be left over from a previous generation of hash mechanism (the "28af..." one).

Here's what I saw when I repeated the request this morning:

> http 'https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections' "Authorization":"Bearer ...."
HTTP/1.1 200 OK
Access-Control-Expose-Headers: Pragma, Last-Modified, Cache-Control, Alert, Next-Page, Backoff, Content-Length, ETag, Total-Records, Expires, Retry-After
Cache-Control: no-cache, no-store
Connection: keep-alive
Content-Length: 636
Content-Type: application/json
Date: Mon, 01 May 2017 13:52:28 GMT
ETag: "1493392083418"
Last-Modified: Fri, 28 Apr 2017 15:08:03 GMT
Total-Records: 7
X-Content-Type-Options: nosniff

{
    "data": [
        {
            "id": "ext-YBd-wajDRq2TVfXWT86UUWvfwlt93isC2u-V1pkTZJ4",
            "last_modified": 1493392083418
        },
        {
            "id": "ext-mhIlAiX9AvbWacInLz_PMPTv1gRgQFsVyJDoI4b8e0k",
            "last_modified": 1493388946175
        },
        {
            "id": "ext-S-AQ0QMIng5bePiHjgedlbOOobggAHae5xrPhYaCIDU",
            "last_modified": 1492711941186
        },
        {
            "id": "ext-IL2fg2T0QrHEuaMh2k-zDfJiZbj-y_Ks-0IFnubg7fs",
            "last_modified": 1492707209384
        },
        {
            "id": "ext-h_w8Dgj7fK032HuRzPQGt0xo2n92v_CaLu1OscDtWkM",
            "last_modified": 1492707209174
        },
        {
            "id": "ext-yfqyYZuvrA2nJz_MKaWbQXUfaSyci4DyJlAI-AQmTc8",
            "last_modified": 1490787265481
        },
        {
            "id": "28af1c9aa38c87a829d2017aa1866fd4aba8ddfc1d531b943087441f57b08258",
            "last_modified": 1483480157187
        }
    ]
}

The storage-sync-crypto collection is totally gone. All the other collections appear to be untouched. Their last_modifieds are even still the same. This doesn't seem like a bucket delete. Whatever is happening seems to be limited to just this collection.

Before I started the browser, I had made a couple modifications to kinto-http-client.js to dump() more information whenever talking to Kinto in any context (blocklists, storage.sync, whatever). I didn't do it in a principled way (I should have logged response headers too I guess), but I do have response bodies for every request made. I have that log now and I don't see anything in it that looks like the response to a delete. (I'd expect to see something like `{"data": {"deleted": true, ....}}`.) So my suspicion is that this isn't related to anything on the client. One thing I'm pretty confident about is that I didn't reset my password for my work profile over the weekend, so I think we can rule out that theory.

While I was there, I ran this series of commands in the browser console: (Don't try this at home!)

Cu.import("resource://gre/modules/ExtensionStorageSync.jsm")
extensionStorageSync._deleteBucket()

This _deleteBucket method is the one that should be called when the server notices that the user's password has been reset. I verified that calling it dumped this HTTP response to the console:

{"data":{"deleted":true,"last_modified":1493649904646,"id":"12b86dab-5ae4-8e21-5c56-7d6600156800"}}

Additionally, making the same HTTP request as above (to get the list of collections) now returns an empty list. So I'm pretty sure that _deleteBucket is working correctly (deleting all of the user's collections) and that my client never sent any delete request (because I would have seen it in the logs).

> Note that this issue was opened because messages were noticed in sync logs,
> so IMO (and within reason), the more logging the better.

OK, I will upload a patch that logs as ERROR when we delete the bucket, since it feels like a dangerous operation and we would like to know when it happens. I might as well also add code that handles the "keyring went missing" case -- if I'm right and this isn't related to the client at all, I will be able to observe this behavior even with the client completely shut down. Also, the fix I'm going to add will only take effect when we try to sync the storage-sync-crypto collection, which only happens when we have a new extension, so probably we'll also be able to see it "before" the fix happens.
(In reply to Ethan Glasser-Camp (:glasserc) from comment #8)
> 
> Tortino, could you run the commands that Rémy posted and tell me whether
> there is anything in the `data` response, and especially if there is a
> `storage-sync-crypto` in there anywhere? Just to make sure I'm looking at
> the same bug.

from what I see the commands are from a unix/linux shell, and I'm using windows10.
I need detailed steps to execute them.
Flags: needinfo?(tortino)
Ah, never mind Tortino, I think I have a clearer idea what's going on now.

I investigated sync node reassignment and that doesn't seem to be related (no sync node reassignment happened of my eglassercamp@mozilla.com profile during relevant times).

We saw a strange spike in RDS I/O that could have happened at the same time; see e.g. https://app.datadoghq.com/dash/241098/kinto-webextensions-prod?live=false&page=0&is_auto=false&from_ts=1493509089627&to_ts=1493567463523&tile_size=m&fullscreen=178304222. I looked at this with leplatrem this morning, and it seems to be the effect of a user deleting their bucket.

I asked :jason to turn on RDS logging for the WebExtensions cluster, with an eye specifically for DELETE queries.

I reset the sync status of my eglassercamp@mozilla.com profile and resynced the keyring to the server, and, assuming that my keyring would disappear again, I tried to narrow down exact times that the collection was deleted using a script like:

while true; echo -n (date) ' '; http 'https://webextensions.settings.services.mozilla.com/v1/buckets/default/collections' "Authorization":"Bearer .........." | jq '.data | any(.id == "storage-sync-crypto")'; sleep 60s; end

This indicated that a storage-sync-crypto deletion happened for eglassercamp@mozilla.com around 14:55-14:57 PM EDT. According to datadog, this was also the time that a bucket delete happened. This also prompted a bunch of errors in Sentry that are symptoms rather than causes (see https://github.com/Kinto/kinto/issues/1202).

My current best theory is that one user's "delete bucket" command is somehow spilling over into other users' data. How or why this happens are still unknown. I'm hoping that getting the logs from RDS will shed some light on this -- some query that got screwed up somehow or something.
Comment on attachment 8863436 [details]
Bug 1359879: add logging and robustness to key syncing,

https://reviewboard.mozilla.org/r/135210/#review138738

::: toolkit/components/extensions/ExtensionStorageSync.jsm:339
(Diff revision 1)
> +        // keyring -- wiping the bucket and reuploading everything.
> +        //
> +        // Eventually we will probably fix the bug in kinto.js, and
> +        // this will have to move somewhere else, probably in the code
> +        // that detects a resolved conflict.
> +        KeyRingEncryptionRemoteTransformer.throwServerKeyringDeleted();

Can we do this outside of the try-catch block so we don't hit the isHMACMismatch check?

::: toolkit/components/extensions/ExtensionStorageSync.jsm:368
(Diff revision 1)
>        (exc.message.indexOf(kbMessage) == 0);
>    }
> +
> +  // Generator and discriminator for keyring-was-deleted exceptions.
> +  static throwServerKeyringDeleted() {
> +    throw new Error("server keyring appears to have disappeared; we were called to decrypt null");

Can we use an error subclass for this? E.g., `class ServerKeyringDeleted extends Error {}` and then check `err instanceof ServerKeyringDeleted` below?
Attachment #8863436 - Flags: review?(kmaglione+bmo) → review+
With the logs that :jason provided, I was able to figure out what was going on and filed https://github.com/Kinto/kinto/issues/1209. This is a server side bug, but because the patch provides recovery, I'll try to land that ASAP.
Comment on attachment 8863436 [details]
Bug 1359879: add logging and robustness to key syncing,

https://reviewboard.mozilla.org/r/135210/#review138738

> Can we do this outside of the try-catch block so we don't hit the isHMACMismatch check?

Yes, thanks!

> Can we use an error subclass for this? E.g., `class ServerKeyringDeleted extends Error {}` and then check `err instanceof ServerKeyringDeleted` below?

Yes, thanks! I didn't realize this was OK or encouraged; I was just copying e.g. https://dxr.mozilla.org/mozilla-central/rev/a748acbebbde373a88868dc02910fb2bc5e6a023/services/sync/modules/util.js#249.
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/c06d46d32054
add logging and robustness to key syncing, r=kmag
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/c06d46d32054
Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 55
Comment on attachment 8863436 [details]
Bug 1359879: add logging and robustness to key syncing,

Approval Request Comment
[Feature/Bug causing the regression]: The Kinto server that handles storage.sync has had some data loss and this patch adds recovery code to the client.
[User impact if declined]: Users who add new extensions will fail to sync forever.
[Is this code covered by automated tests?]: This patch adds an automated test that covers it.
[Has the fix been verified in Nightly?]: Not yet to my knowledge.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: No
[Is the change risky?]: I don't believe so
[Why is the change risky/not risky?]: The added code only takes effect in a situation that would already be a failure.
[String changes made/needed]: None
Attachment #8863436 - Flags: approval-mozilla-beta?
Duplicate of this bug: 1350088
Hi Brindusa, could you help find someone to verify if this issue was fixed as expected on a latest Nightly build? Thanks!
Flags: qe-verify+
Flags: needinfo?(brindusa.tot)
Verified this on Windows 10 x64 with latest Nightly, and the error reported was not reproduced. Also, I could not reproduce the bug on a Nightly build 55.0a1 from 2017-04-26.

Torino, could you please verify if this is fixed on your side, on latest Nightly?
Flags: needinfo?(brindusa.tot) → needinfo?(tortino)
Hello, I reinstalled SavePageWE and then synced. I got this:

1494056266969	Sync.CollectionKeyManager	INFO	Saving downloaded keys.
1494056266969	Sync.CollectionKeyManager	INFO	Bumping last modified to 1493295526588
1494056266969	Sync.Engine.Extension-Storage	INFO	Need to create keys and/or salts for ["zoompage-we@DW-dev","savepage-we@DW-dev"]
1494056266975	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for savepage-we@DW-dev
1494056267008	FirefoxAccounts	DEBUG	getOAuthToken enter
1494056267010	FirefoxAccounts	DEBUG	getOAuthToken returning a cached token
1494056267772	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for undefined
1494056268085	Sync.BulkKeyBundle	INFO	BulkKeyBundle being created for undefined
1494056268092	Sync.Engine.Extension-Storage	ERROR	Syncing storage-sync-crypto: request failed: Error: server keyring appears to have disappeared; we were called to decrypt null (resource://gre/modules/ExtensionStorageSync.jsm:110:5) JS Stack trace: ServerKeyringDeleted@ExtensionStorageSync.jsm:110:5 < decode@ExtensionStorageSync.jsm:344:13 < async*[6]</_decodeRecord/</<@kinto-offline-client.js:1291:24 < waitForSyncCallback@async.js:98:7 < makeSpinningCallback/callback.wait@async.js:168:27 < promiseSpinningly@async.js:234:12 < _sync@extension-storage.js:41:12 < WrappedNotify@util.js:161:21 < sync@engines.js:723:5 < _syncEngine@enginesync.js:219:7 < sync@enginesync.js:166:15 < onNotify@service.js:1080:7 < WrappedNotify@util.js:161:21 < WrappedLock@util.js:117:16 < _lockedSync@service.js:1070:12 < sync/<@service.js:1062:7 < WrappedCatch@util.js:92:16 < sync@service.js:1051:5
1494056268100	Services.Common.RESTRequest	TRACE	local time, offset: 1494056268099, 88
1494056268101	Services.Common.RESTRequest	TRACE	hawk auth header: Hawk id="c607ebcde82651f6e7d96535e0bef6027424eb72a4babce54570783299436810", ts="1494056268", nonce="y0GO6hX4I1w=", mac="W7Ffp5WQiqhVwBBpbcIVKClNOsr41ckXO7k9/lEyiuU="
1494056268102	Services.Common.RESTRequest	DEBUG	GET request to https://api.accounts.firefox.com/v1/session/status
1494056268102	Services.Common.RESTRequest	TRACE	HTTP Header authorization: ***** (suppressed)
1494056268102	Services.Common.RESTRequest	TRACE	HTTP Header content-type: text/plain
1494056268102	Services.Common.RESTRequest	TRACE	HTTP Header accept-language: it-IT, it, en-US, en
1494056268842	Services.Common.RESTRequest	TRACE	onStartRequest: GET https://api.accounts.firefox.com/v1/session/status
1494056268843	Services.Common.RESTRequest	TRACE	Channel for GET https://api.accounts.firefox.com/v1/session/status returned status code 0
1494056268843	Services.Common.RESTRequest	DEBUG	GET https://api.accounts.firefox.com/v1/session/status 200
1494056268843	Services.Common.RESTRequest	TRACE	GET body: {"uid":"62e5e055183a43959486f7bc9e4b8536"}
1494056268843	Hawk	DEBUG	(Response) /session/status: code: 200 - Status text: OK
1494056268844	Hawk	DEBUG	Clock offset vs https://api.accounts.firefox.com/v1: 156
1494056268845	Sync.Engine.Extension-Storage	ERROR	Couldn't decipher old keyring; deleting the default bucket and resetting sync status
1494056268845	Sync.Engine.Extension-Storage	ERROR	Deleting default bucket and everything in it
1494056268845	FirefoxAccounts	DEBUG	getOAuthToken enter
1494056268846	FirefoxAccounts	DEBUG	getOAuthToken returning a cached token
1494056275207	FirefoxAccounts	DEBUG	getOAuthToken enter
Flags: needinfo?(tortino)
Yeah, that's expected, it logs an error when it's deleting everything on the server just to make it more obvious that it's happening. Did the next syncs afterwards fail too, or did they succeed?
the next sync were successfull.
Comment on attachment 8863436 [details]
Bug 1359879: add logging and robustness to key syncing,

This patch adds recovery code to the client when sync failed in extension storage and was verified. Beta54+. Should be in 54 beta 6.
Attachment #8863436 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Assignee: nobody → eglassercamp
:krupa asked me to write down exact steps to reproduce. Here is what would have triggered it before we fixed the server.

1. Create a profile (let's call it "profile1@mozilla.com"). Sign in to Sync and add an extension that uses chrome.storage.sync, such as SavePage WE. Force a sync.
2. Create another profile (let's call it "profile2@mozilla.com"). Sign into Sync and add the same extension. Force a sync.
3. Create a third profile. Try to sign in to "profile2@mozilla.com", but do it using a password reset, and then force a sync. This will wipe the Kinto server data for profile2@mozilla.com. Before the Kinto server fix, this would have also wiped the keyring data for profile1@mozilla.com.
4. Reopen "profile1@mozilla.com". Add another extension that uses chrome.storage.sync such as Redirector.
5. Force a sync. You will receive "record is null" or alternately "server was flushed" (as in bug 1350088) depending on the Kinto server version.
Does this affect 53? Is it broken currently on release?
Flags: needinfo?(eglassercamp)
Yes, this affects 53 and it is broken on release. I sent an email to Product Integrity and Krupa wrote back saying that she didn't think it was worth trying to release a system addon to hotfix it since it would take almost as long to go through the whole process as it would to just wait for 54 to come out.
Flags: needinfo?(eglassercamp)
Well, that depends. 

We plan to release 54 on June 13th.  The system addon process should be faster than that so it could be possible if you think the user impact isn't too bad.  

We may be doing a dot release later this week for Desktop (I'm aiming for Thursday).  And maybe for Fennec too for different reasons. So a fix here could go out this week if you have good testing end to end, and if you think the risk isn't too high. It really depends on how bad you think this is for users. You can let me know over email what information you have about how widespread the problem is. Would you know from your server stats? Or from some sort of telemetry?
Flags: needinfo?(markh)
Flags: needinfo?(eglassercamp)
Mark Hammond has some telemetry on syncs that says that as of yesterday, 99.86% of syncs of extension-storage are successful. That means that this particular error must be happening less than 0.14% of the time. At present they seem to make up about 40% of the failures, so that would be about 0.06% of the users with Firefox Sync. We can expect this to grow somewhat slowly over time as the error only starts happening when a user adds a webextension that uses chrome.storage. I think this is partly why Krupa argued that we don't need to worry about going to any extra effort.

When the error happens, the main consequence to the user is that it makes Firefox Sync looked like it failed, but in fact everything will be synced correctly except for chrome.storage.sync stuff (and I haven't really seen extensions "in the wild" using chrome.storage.sync). So the damage is very minor.

On the other hand, I think this fix is low-risk (because it only kicks in when a user was going to encounter an error anyway) and I'd be happy to have it go out for Desktop. I think shipping the fix in this bug is much lower risk than trying to write a system addon hotfix from scratch. Just let me know what you want me to do and I'll try to do it.
Flags: needinfo?(eglassercamp)
(In reply to Ethan Glasser-Camp (:glasserc) from comment #36)
> Mark Hammond has some telemetry on syncs that says that as of yesterday,
> 99.86% of syncs of extension-storage are successful.
> At present they seem to make up about 40% of the failures

Yep - I came here to say the exact same thing :)

I'm also believe the risk of this patch is relatively low, particularly to Sync as a whole - IOW, any risk here is likely to be limited to extension-storage and shouldn't impact other aspects of Sync (and even then, I believe the risk to extension-storage itself is fairly low)
Flags: needinfo?(markh)
OK. I agree with Krupa that it sounds right to let this ship in 54! Thanks for the extra info.
See Also: → 1364135
I still have the error "Server has been flushed." on Firefox 55 beta 12 under Win 10 64-bit after almost every step from comment 32. Here is one of the logs. Please advice whether this is expected. Thank you!
Flags: needinfo?(eglassercamp)
Hi Petruta, no that isn't expected. From your log, it seems like this error is slightly different from the previous ones in this bug. Instead of the keyring having disappeared, it's all the records for a specific collection. Could you elaborate a little bit on what steps you mean when you write "after almost every step from comment 32"? Is this an existing profile or a fresh one? I can't think of what would cause a specific collection to be deleted from a random user. Do you see the same behavior on Nightly? Maybe it would be good to open a new bug for this.
Flags: needinfo?(eglassercamp)
Sorry for not following up sooner.

I retested this on Firefox 57.0 RC and didn't encounter errors after performing the steps from comment 32.
Based on this and on comment 29, I'm removing the qe-verify+ flag and change the status to Verified Fixed.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
You need to log in before you can comment on or make changes to this bug.