Closed Bug 646269 Opened 13 years ago Closed 13 years ago

Tracking bug: unexpected HMAC mismatches

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: rnewman, Assigned: rnewman)

References

Details

(Whiteboard: [qa-])

Finally broke down and filed this as a tracking bug. See also the [sync-hmac] whiteboard tag.

We've seen several reports of HMAC issues recently.

Bug 646230: 3.6.16, Sync 1.7. Passwords and forms. (No new data for bookmarks?)
Bug 645016: 4.2a1pre (Sync 1.7). Bookmarks, passwords, forms.
Bug 646085: 4.0. Just history.

Crossing platforms and Firefox versions. In all cases, the client has the key that's on the server. In some cases the server key timestamp has changed, which implies that the other client is also attempting HMAC-failure recovery (to no avail).

The logs show that the downloaded keys are the same as the locally stored keys... and yet they cannot decrypt the records on the server.

The obvious implication is that one client at some point is generating records using a corrupt key. It never has to read them again, so the other devices are the ones to throw the errors.

(Possibility is that there was a problem that restarting for 1.7/3.6.16 showed. Don't forget that.)
Hypothesis: we've noticed some odd 401s in the logs. It appears that some requests are not carrying correct auth headers. If this happens on info/collections, it might trigger a key generation and upload. If it happens on key upload, it might cause a mismatch between server and client.

Actions: try to figure out why that might be happening; test whether 401s cause inconsistent behavior.
Assignee: nobody → rnewman
Severity: normal → critical
Status: NEW → ASSIGNED
Priority: -- → P1
I believe I have managed to repro this locally through some combination of Reset Syncs on two clients, not running concurrently.

I have five thousand lines of logs to read through, but I call this progress...
Log doesn't make any sense.
The only we were able to work out that would definitively trigger the log produced by rnewman would be if the GET /info/collections returned either 404, or valid JSON without a crypto key present.  So I'm checking production for any instances of GET /info/col 200 OK with empty JSON response followed by GET meta/global 404, because that would indicate severe breakage that would cause the client to go bonkers in something akin to this way.
(In reply to comment #4)
> The only we were able to work out that would definitively trigger the log
> produced by rnewman would be if the GET /info/collections returned either 404,
> or valid JSON without a crypto key present.

Correction:

* meta/global 404
* info/collections 200 with no "crypto" entry.

info/collections != 200 will fail key fetching immediately.

> So I'm checking production for any
> instances of GET /info/col 200 OK with empty JSON response followed by GET
> meta/global 404, because that would indicate severe breakage that would cause
> the client to go bonkers in something akin to this way.

This would also be the scenario for a new account signup, and wouldn't demonstrate this bug. What we would like to find is someone with a well-populated info/collections but a 404 for meta/global. This would be followed by a PUT to storage/crypto/keys, but no DELETEs for other collections.
Depends on: 650208
Depends on: 662067
Depends on: 662508
Depends on: 662627
Depends on: 633749
Depends on: 663799
atoll suggested I add this description here.

Background explanation: an HMAC mismatch occurs when a client downloads a record which was encrypted with a different key than the one they are holding for that collection.

In one case, it's possible for the following to happen:

* Client A uploads key K1.
* Client A uploads tab record K1(R1).
* Server is partially wiped: DB-stored data (K1) is eliminated, but memcache-stored data (K1(R1)) sticks around.
* Client B generates and uploads new key K2.
* Client B downloads tab record K1(R1).
* HMAC mismatch.

Builds after Bug 650208 are no longer susceptible to this bug, because when a new key is uploaded, all server data is deleted. However, this won't land for another 3 or 4 months (Firefox 6), so it would be valuable to avoid the situation. The most direct way to do that is to kill the contents of memcache for a migrated user. The bug to do that is Bug 664658.

(A workaround for tab HMAC problems is to temporarily disable, sync, then re-enable tab sync.)

More broadly, any error which results in a client generating a new key -- e.g., a MySQL failure that causes a 404 on a fetch for storage/crypto/keys, as we just saw for Bug 663799 -- but does not wipe the server will result in a similar kind of inconsistency, but more pernicious: leaving long-lasting corrupt history and bookmark records hanging around.

Again, the fix in Bug 650208 will cover most of these situations (which I have verified with a test case, yet to be committed).

There is one remaining situation which bears investigating: that of a mid-sync node reassignment. A 401 mid-sync could conceivably cause a "split upload", where the sync starts on one node and finishes on another. The next client to sync would upload a new key to the second node, causing corruption.

I suspect that "wipe when uploading new keys" will cover some of this, albeit inelegantly, but there is a potential for race conditions (as well as opportunity for better handling).
Depends on: 664808
Depends on: 664865
Depends on: 667306
Bug 664865 is fixed in services-central, and concludes the pair of remediations for this issue:

  * Bug 650208 (Firefox 6): wipe the server if we upload new keys
  * Bug 664865 (Firefox 7): abort and retry sync if we get a 401.

The former is the most important fix, so no need to try to carry that patch over.

When Bug 664865 gets merged, we can close this.
Depends on: 657867
Depends on: 676892
Goodbye, evil bug. All of the bugs mentioned in Comment 7 have landed, so -- as Roy Batty said -- "time to die".
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [qa-]
Depends on: 672047
Depends on: 681073
Component: Firefox Sync: Crypto → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.