Closed Bug 617268 Opened 14 years ago Closed 14 years ago

Unable to sync with Fennec nightly from Minefield nightly - SHA256 HMAC mismatch

Categories

(Firefox :: Sync, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 617709
Tracking Status
blocking2.0 --- beta8+
fennec 2.0b3+ ---

People

(Reporter: dougt, Assigned: rnewman)

References

Details

Attachments

(1 file)

I reset sync in a Minefield nightly (ae74a11a7d98), selected Sync Now then waited a hour or so.

Then i started a nightly build of Fennec and set up Sync.

It looked like it was working (the UI suggested it was), but no items were actually synced.

I quit and restarted fennec with an error console and say errors like:

Engine.*   WARN   Error Processing Record:  
  Record SHA256 HMAC mismatch:
<big number>
not
<another big number>



Then I see :

Store.forms  WARN ignoring form record update request!

This prevents me from using fennec w/ sync.
tracking-fennec: --- → 2.0b3+
OS: Mac OS X → Android
Hardware: x86 → All
Richard, Philipp suggested you might know more about this or might have time to investigate.
Assignee: nobody → rnewman
(In reply to comment #0)
> I reset sync in a Minefield nightly (ae74a11a7d98), selected Sync Now then
> waited a hour or so.
> 
> Then i started a nightly build of Fennec and set up Sync.
> 
> It looked like it was working (the UI suggested it was), but no items were
> actually synced.
> 
> I quit and restarted fennec with an error console and say errors like:
> 
> Engine.*   WARN   Error Processing Record:  
>   Record SHA256 HMAC mismatch:
> <big number>
> not
> <another big number>

I've seen a couple of reports like this, e.g. https://groups.google.com/group/mozilla-labs-weave/browse_thread/thread/4aa18b05d9097eb6.

It looks like this is another heisenbug from the aftermath of simplified crypto. Richard, can you take a look? Thanks!

> Then I see :
> 
> Store.forms  WARN ignoring form record update request!
> 
> This prevents me from using fennec w/ sync.

That's a bogus log message, has nothing to do with the probelm at hand. See bug 577565.
Component: General → Firefox Sync: Backend
OS: Android → Mac OS X
Hardware: All → x86
The SHA256 mismatch message indicates that either the sync key or collection keys on one device are not the same as on another. In essence, decryption is failing because the key is wrong.

What did the sync key look like that you typed in on Fennec? 26 characters, 16 characters, or something else?

When you click "My Sync Key" in Minefield Sync prefs, is it the same as you typed in?

Did you upgrade an existing Sync account on Minefield?

What happens if you reset Sync on Minefield and try again?
it was originally a passphrase.  At some point, a client converted it to a sync key (i think this was a forced conversion, as I don't remember ever having to make that decision).  I did a reset sync in minefield.  My sync key is in the format of:

x-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx

I just reset my device and verified that the device and minefield have the same exact sync key.  I still experience this failure.
(In reply to comment #4)

> key (i think this was a forced conversion, as I don't remember ever having to
> make that decision).  

Correct. We transparently upgrade keys in the new crypto scheme.

> I did a reset sync in minefield.  My sync key is in the
> format of:
> 
> x-xxxxx-xxxxx-xxxxx-xxxxx-xxxxx

> I just reset my device and verified that the device and minefield have the same
> exact sync key.  I still experience this failure.

Odd. That implies that the keys collection retrieved by each client is different. At least the sync keys are the same.

It's getting a little late for detailed troubleshooting, but here are two things you could do:

One:
* Send me (or append) the contents of about:sync-log and about:sync-log.1 in your Minefield profile.

Two:
* Launch a second Minefield instance with a new profile. (-no-remote -ProfileManager, create new).
* Turn on logging (see http://philikon.wordpress.com/2010/11/12/sync-in-firefox-4-0-beta-7/).
* Set up Sync with the same credentials. Does it sync correctly? If not, please send me (or append) the contents of about:sync-log.

If a fresh profile doesn't sync correctly, it means that something on the server was encrypted with a different key than is currently stored on the server. That's not supposed to happen, of course, and I'd love to figure out why.
Status: NEW → ASSIGNED
Attached file Doug's Minefield log
Doug's logs show that the one of the clients and tabs records fail SHA256-HMAC verification. This implies that:

1. The sync key is correct: the client can decrypt the keys themselves
2. Most records are encrypted with the correct key
3. Some are not.

My suspicion is that one client generated its own bulk keys, uploading its per-machine data. The other client's keys won, but a wipe on that client won't affect the per-machine records. Those just stick around causing errors.

I believe this managed to occur because of the info/collections time bug that we recently fixed on the server side, a result of an outage last week.

During the time that two clients were active, one was unaware of the new key bundles because the server timestamp was incorrect. It happily continues to believe that the keys it has are the most recent, and uploads 'bad' records and fails to decrypt those it downloads.

Again, this situation shouldn't happen.
Ordinarily, if a client generates new keys, the other clients will download them and reupload their data.

In this case, new uploaded keys didn't solve the problem, because the other client *didn't get to see them as changed* thanks to the server error.

So now both clients think they're in a good state, and aren't, but won't make the sweeping changes necessary to fix it.

Right now we don't have any way to validate the bulk keys that we download until sync-time, when we use them for decryption.

A simple solution: in any HMAC mismatch, re-fetch keys. Expensive, but might fix this problem.


A broader proposed solution/workaround, roughly sketched:

Create a sync-time causal link between a HMAC mismatch error and key inconsistency, with a harsh fix.

If a client detects a record mismatch during a sync, it should generate a new set of bulk keys and re-sync, with an extended re-sync duration to avoid thrashing.

In the case of a similar server problem, clients will continue to regenerate keys until the server starts working again, for so long as there's a mismatch. Eventually the new keys will be picked up by all clients, and syncs will complete.

The difference from the simple solution is that we keep touching the server, which will eventually bump the mtime and prompt a download on other clients.


Also:

Do we store modified times in the keys WBO itself? If the modified time for storage/keys is prior to that in the keys record, then we know we've got a server problem, or the other client's clock is way off...

Thoughts?
what server error are you talking about when you side "thanks to the server error"?
(In reply to comment #8)
> what server error are you talking about when you side "thanks to the server
> error"?

Bug 617080.
I just ran into this and can reliably reproduce this with an account on a custom server (so the Sync servers cannot be blamed):

1. Have Firefox connected to account A, sync successfully etc.
2. Now connect Firefox to account B, sync.
3. Observe HMAC failures in about:sync-log.

It looks like the old keys from account A still stuck around. So somehow the Sync Key validation in the setup wizard (which calls verifyLogin/login) succeeded but subsequently it still kept the old keys around.

We used to dump the local key cache in Weave.Service.startOver(), but it doesn't look like we're doing this to the CollectionKeys cache.

Nominating this for a 4.0b8 blocker.
blocking2.0: --- → ?
blocking2.0: ? → beta8+
(In reply to comment #10)
> I just ran into this and can reliably reproduce this with an account on a
> custom server (so the Sync servers cannot be blamed):

Richard just pointed out on IRC that this is a slightly different issue. People were seeing HMAC failures when upgrading form storage version 3 to 4, where as I saw it on startOver(). Filed bug 617478.
So... this should have fixed itself by now (ignoring the separate bug, which is already fixed), if it was tied to the timestamps not changing, and thus a degenerate state not resolving itself.

Can this be reproduced now, or did this all date back to around Friday?
I did a resync and i continue to see this problem as of today.  Also see comment #6.
(In reply to comment #6)
> Doug's logs show that the one of the clients and tabs records fail SHA256-HMAC
> verification. This implies that:
> 
> 1. The sync key is correct: the client can decrypt the keys themselves

I don't understand this. We check the HMAC *before* we decrypt.

> My suspicion is that one client generated its own bulk keys, uploading its
> per-machine data. The other client's keys won, but a wipe on that client won't
> affect the per-machine records. Those just stick around causing errors.

Right. Therefore I disagree with mconnor that the situation will fix itself. The server bug is gone but the broken records are still there (although we probably don't refetch them, so it's other clients uploading new broken records).

To unbreak yourself, dougt, have Sync generate you a new Sync Key. Then enter that Sync Key on all your other devices. If the problem still occurs, then we have a problem.
Philipp - I have already done this two times since Thursday.  I can try again, but I feel that we need a fix that doesn't involve user intervention.  People are stranded (like me, blassey, thomas, mitchell, jay, and probably a boat load of others) and the solution you propose doesn't scale, does it?
(In reply to comment #7)
> Ordinarily, if a client generates new keys, the other clients will download
> them and reupload their data.
> 
> In this case, new uploaded keys didn't solve the problem, because the other
> client *didn't get to see them as changed* thanks to the server error.
> 
> So now both clients think they're in a good state, and aren't, but won't make
> the sweeping changes necessary to fix it.

This makes sense.

> Right now we don't have any way to validate the bulk keys that we download
> until sync-time, when we use them for decryption.
> 
> A simple solution: in any HMAC mismatch, re-fetch keys. Expensive, but might
> fix this problem.

We should make sure that we never ever get there. I guess we could do this as a belt-and-suspenders fix. But we should only ever try it once. There's no point in re-fetching keys on every HMAC failure, especially if re-fetching the keys doesn't fix the problem. I like the idea of the modified time check below better, though.

> A broader proposed solution/workaround, roughly sketched:
> 
> Create a sync-time causal link between a HMAC mismatch error and key
> inconsistency, with a harsh fix.
> 
> If a client detects a record mismatch during a sync, it should generate a new
> set of bulk keys and re-sync, with an extended re-sync duration to avoid
> thrashing.
> 
> In the case of a similar server problem, clients will continue to regenerate
> keys until the server starts working again, for so long as there's a mismatch.
> Eventually the new keys will be picked up by all clients, and syncs will
> complete.
> 
> The difference from the simple solution is that we keep touching the server,
> which will eventually bump the mtime and prompt a download on other clients.

This means a problematic client that doesn't get the HMAC implementation right can cause all other clients to reupload all data over and over again. This would break us on the server.

> Also:
> 
> Do we store modified times in the keys WBO itself?

Yes. A WBO has a modified time set by the server.

> If the modified time for
> storage/keys is prior to that in the keys record, then we know we've got a
> server problem, or the other client's clock is way off...

The WBO modified time is server time and that's guaranteed to be consistent within your range of WBOs. Unless of course the server breaks like it did this weekend...
(In reply to comment #15)
> Philipp - I have already done this two times since Thursday.

Shrug! Although you may just have hit bug 617478. We'll be merging that to m-c soon. If it's gone by then (you'll have to do the dance one more time), we should be good.
"you'll have to do the dance one more time" == many existing people will not know that they have to do this and be stuck, right?  Is there going to be some error message / detection that we can do?
> > 1. The sync key is correct: the client can decrypt the keys themselves
> 
> I don't understand this. We check the HMAC *before* we decrypt.

In the course of starting a sync, we:

1. Check HMAC with sync key on keys WBO
2. Decrypt keys WBO
3. Check HMAC with default key on engine data
4. Decrypt engine data

The errors in Doug's log are in step 3. That means the keys collection is encrypted with the correct sync key.

Because we're not seeing an error in decrypting the keys object, I deduce that the mismatch is between the stored keys and the keys used to encrypt some engine-specific data.

Because it's only tabs and clients that are producing errors, I deduce that one of the clients encrypted and stored using the wrong keys, and any other bad data was overwritten or not written in the first place.


> Right. Therefore I disagree with mconnor that the situation will fix itself.
> The server bug is gone but the broken records are still there (although we
> probably don't refetch them, so it's other clients uploading new broken
> records).

Hence my suggestion to attempt to refresh our local key store if we spot HMAC problems: this would allow a broken client to dig itself out of the hole. ("Hey, all these other records are written in Aramaic! Better download a new dictionary!".)

HMAC problems indicate either a corrupted server or some kind of key mismatch.
(In reply to comment #15)
> Philipp - I have already done this two times since Thursday.  I can try again,
> but I feel that we need a fix that doesn't involve user intervention.  People
> are stranded (like me, blassey, thomas, mitchell, jay, and probably a boat load
> of others) and the solution you propose doesn't scale, does it?

Just to clarify: you've generated a new sync key and entered it on every device, twice, and the problem persists?

Have you quit Firefox on each of your devices at any point during this process?
(In reply to comment #12)
> So... this should have fixed itself by now (ignoring the separate bug, which is
> already fixed), if it was tied to the timestamps not changing, and thus a
> degenerate state not resolving itself.
> 
> Can this be reproduced now, or did this all date back to around Friday?

It wouldn't fix itself unless one of the clients re-uploaded the correct key bundle to the server, causing the others to download it on their next sync.

This should be exactly what happens when a new sync key is generated on one device, hence my puzzlement in Comment 20.

(Something similar this would happen if one client had the wrong sync key -- it would be unable to decrypt the new keys -- but sync should fail visibly at that point. That client wouldn't delete its data.)
19:41:38 <@rnewman> mconnor: just thought of an unconventional approach to perhaps resolving Bug 617268
19:42:47 <@rnewman> how about: when we see an HMAC mismatch, *download* the key bundle from the server, and if it's the same as ours, *upload* it again to touch the mtime
19:43:32 <@mconnor> rnewman: huh
19:43:34 <@rnewman> the healthy client will thus cause the time to be bumped; the unhealthy client will (if it sees a mismatch) download the correct keys
19:43:44 <@mconnor> that makes sense
19:43:44 <@rnewman> or, if it doesn't see a mismatch, another client will bump the mtime...
19:44:12 <@rnewman> of course, do this no more frequently than the longest sync time of any client we support
19:44:25 <@mconnor> means that N-1 clients will redownload the key bundle
19:44:29 <@mconnor> but
19:44:36 <@mconnor> that's not bad
19:44:38 <@rnewman> better than a full reupload
19:44:47 <@rnewman> which is what a regen would involve
19:44:53 <@mconnor> right

Philipp: thoughts?
Depends on: 617709
Duping this forward, bug 617709 is the real fix here.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
Component: Firefox Sync: Backend → Sync
Product: Cloud Services → Firefox
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: