Closed
Bug 1345742
Opened 8 years ago
Closed 8 years ago
Sync server may intermingle batch upload data from multiple users
Categories
(Cloud Services Graveyard :: Server: Sync, enhancement)
Cloud Services Graveyard
Server: Sync
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rfkelly, Unassigned)
References
Details
Attachments
(1 file, 1 obsolete file)
I'm marking this security-sensitive because it's discussing exposure of encrypted sync data to unauthorized users. Thankfully we have client-side encryption so I don't believe there has been any exposure of plaintext data, but let's still treat a little carefully until we're totally sure we understand the impact here. We don't want any enterprising folks to go looking in their own sync data for things that belong to other people.
The python implementation of the sync batch-upload API may intermingle data for multiple users, if they happen to generate conflicting batch ids. This can result in users fetching encrypted BSOs that belong to other users, and appears to be behind the recent uptick in "HMAC verification errors" such as Bug 1343414.
Details below, but the important points are:
* This could have affected any users on a python storage node with batch uploads enabled. Batch uploads have been enabled on all python storage nodes since sometime in February IIRC.
* We have disabled the batch upload API in production, so this issue should no longer be occurring.
* We will need to node-migrate any users who were affected by this issue, as they've got junk data in their storage which will prevent them from syncing correctly. I'll file a separate bug for digging into the cleanup there.
So...what was going on here?
When we generate batch ids, we store them into a table whose primary key is (batchid, userid):
https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/storage/sql/dbconnect.py#L154
Batch ids are timestamps, so this means it's totally possible for two users who sync at the same time, to receive the same batch id.
However, we store the actual items for the batch in a separate table, whose columns are (batchid, item_id, ...item payload data):
https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/storage/sql/dbconnect.py#L170
Note that this table *does not have a userid column*. Thus, the items from both users batches will wind up intermingled in this table with no way to distinguish between them. Whoever commits the batch first will receive the items from both batch uploads.
The end result is that users can wind up with BSOs in their sync data that have been encrypted with someone else's encryption key. When Firefox tries to sync that down, it will fail to decrypt it and the sync will fail.
| Reporter | ||
Comment 1•8 years ago
|
||
> I'll file a separate bug for digging into the cleanup there.
Actually I don't think I will, let's keep it all in this bug until we've got a coherent plan.
Comment 2•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #0)
> * We have disabled the batch upload API in production, so this issue should
> no longer be occurring.
To clarify, new instances of such records should no longer happen, but users already affected by this will continue to see log entries like "Bad HMAC event detected. Attempting recovery or signaling to other clients" and a PUT to crypto/keys (of the same keys that were already there) each time they sync.
> The end result is that users can wind up with BSOs in their sync data
> that have been encrypted with someone else's encryption key. When
> Firefox tries to sync that down, it will fail to decrypt it and
> the sync will fail.
For better or worse, Sync will download it and record the fact it failed to apply the record, so try again each and every subsequent sync. Best I can tell, this doesn't cause an "error" log to be generated, so most users are unlikely to have any clue that this is happening.
| Reporter | ||
Comment 3•8 years ago
|
||
Since this functionality has been disabled in production, I don't think we need to urgently rush out a fix for it. I've got a testcase that can reproduce it, and we can get it fixed and deployed on a normal timeframe with proper QA etc.
Our most urgent task is remediation for users already affected. I don't think there's anything we can do besides node-migrate all affected users, to ensure that any invalid items are cleared from their sync data.
So, can we determine which users were affected? Two ideas:
To cast a narrow net, we can look for users who generated conflicting batch ids. We'd grep through the nginx logs looking for POST requests with "batch=foobar". If we see a batchid of "foobar" associated with multiple userids, then all of those userids need to be node-migrated.
The one thing this wouldn't account for, is requests like "?batch=true&commit=true" that create and commit a batch in a single request. I'm trying to determine whether this scenario could have triggered the bug or not. If we can't re-assure ourselves that it didn't, we'll have to cast a wider net.
To cast a wider net, we can look for users who performed a batch upload request:
* At the same timestamp as another user (or within say +/- one second).
* where both userids are equal modulo 10 [1]
This second option is probably our safest one, besides node-migrating *everyone*. It may be that our write volume has high enough that most of the users fit this criteria.
[1] Yes really: https://github.com/mozilla-services/server-syncstorage/blob/master/syncstorage/storage/sql/__init__.py#L490
| Reporter | ||
Comment 4•8 years ago
|
||
> The one thing this wouldn't account for, is requests like "?batch=true&commit=true" that create
> and commit a batch in a single request. I'm trying to determine whether this scenario could have
> triggered the bug or not.
I've convinced myself that such single-request batches are safe from the issue if and only if the SQL isolation level prevents reading uncommitted data. We take open a transaction at the start of the request, and commit it at the end. If the two transactions can't see each other data then one can't steal the records from the other.
Unfortunately, :bobm tells me we're running at READ UNCOMMITTED isolation level, so we don't get this protection. We'll have to cast the wider net.
| Reporter | ||
Comment 5•8 years ago
|
||
For completeness, attaching my functional testcase for the server that reproduces this issue. It needs a lot of iterations to successfully trigger the bug, but pretty reliably fails for me with:
======================================================================
FAIL: test_users_with_the_same_batch_id_get_separate_data (syncstorage.tests.functional.test_storage.TestStorage)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/rfk/repos/mozilla/services/server-syncstorage/syncstorage/tests/functional/test_storage.py", line 1806, in test_users_
with_the_same_batch_id_get_separate_data
self.assertEquals(resp.json, ['b'])
AssertionError: Lists differ: [u'b', u'a'] != ['b']
First list contains 1 additional elements.
First extra element 1:
a
- [u'b', u'a']
+ ['b']
-------------------- >> begin captured logging << --------------------
Comment 6•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #3)
> To cast a narrow net, we can look for users who generated conflicting batch
> ids. We'd grep through the nginx logs looking for POST requests with
> "batch=foobar". If we see a batchid of "foobar" associated with multiple
> userids, then all of those userids need to be node-migrated.
I ran this first because it's much easier to test for. Luckily the number isn't that bad: 2454 users across our server fleet over the last 30 days. Most servers haven't had batches enabled for nearly that long, which explains the large variance by servers in the frequency of batch collisions.
I calculated the data by running the following (expedient excellence) by node:
zgrep batch /media/ephemeral0/logs/nginx/access.log*| grep -v batch=true | awk -F\" '{print $4}' | ./uidcount.pl
Where uidcount.pl is:
#!/usr/bin/perl
#
while(<STDIN>){
m/1\.5\/(\d+).*batch=(.*?)[\s&]/;
unless($hash{$2}{$1}){
$hash{$2}{$1}=1;
$hash{$2}{'totes'}++;
}
}
foreach $batch (keys(%hash)){
if($hash{$batch}{'totes'} > 1){
foreach $uid (keys(%{$hash{$batch}})){ # embrace the brace
$found{$uid}++;
}
}
}
foreach $uid (keys(%found)){
print "$uid\n";
}
Comment 7•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #5)
Interesting, could this be what we were running into here: https://bugzilla.mozilla.org/show_bug.cgi?id=1334671#c1 ?
Comment 8•8 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #6)
> (In reply to Ryan Kelly [:rfkelly] from comment #3)
> I ran this first because it's much easier to test for. Luckily the number
> isn't that bad: 2454 users across our server fleet over the last 30 days.
> Most servers haven't had batches enabled for nearly that long, which
> explains the large variance by servers in the frequency of batch collisions.
These users have been migrated.
Comment 9•8 years ago
|
||
(In reply to Mark Hammond [:markh] from comment #2)
> (In reply to Ryan Kelly [:rfkelly] from comment #0)
> > * We have disabled the batch upload API in production, so this issue should
> > no longer be occurring.
>
> To clarify, new instances of such records should no longer happen, but users
> already affected by this will continue to see log entries like "Bad HMAC
> event detected. Attempting recovery or signaling to other clients" and a PUT
> to crypto/keys (of the same keys that were already there) each time they
> sync.
>
> > The end result is that users can wind up with BSOs in their sync data
> > that have been encrypted with someone else's encryption key. When
> > Firefox tries to sync that down, it will fail to decrypt it and
> > the sync will fail.
>
> For better or worse, Sync will download it and record the fact it failed to
> apply the record, so try again each and every subsequent sync. Best I can
> tell, this doesn't cause an "error" log to be generated, so most users are
> unlikely to have any clue that this is happening.
Android Sync will download it, fail to verify the HMAC, and silently stop syncing. Android Sync doesn't try to recover at all, and we don't do anything to notify the user.
grisha, you might consider driving the FxA into the Separated state -- or the NeedsUpdate state -- so that the user at least knows something is up. You could add a message to the NeedsUpdate state explaining why they're there, if there's isn't one already... follow-up ticket, perhaps.
Flags: needinfo?(gkruglov)
| Reporter | ||
Comment 10•8 years ago
|
||
> Android Sync will download it, fail to verify the HMAC, and silently stop syncing.
Is there any way we can estimate the number of Android devices that might be stuck in in this state, either by looking at server logs or from client telemetry?
Flags: needinfo?(nalexander)
Comment 11•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> > Android Sync will download it, fail to verify the HMAC, and silently stop syncing.
>
> Is there any way we can estimate the number of Android devices that might be
> stuck in in this state, either by looking at server logs or from client
> telemetry?
Sadly, we have no client telemetry to tell us anything about this (or any other) situation. Grisha has talked to me about improving this, but it's a long road to pave, due to the architecture of Telemetry in Fennec.
I can't think of inexpensive ways to witness this on the server short of seeing:
- an Android client arrive to a meta/global
- witnessing a collection query with a since=t_0 with some timestamp t_0
- witnessing that meta/global has many collections with timestamps t_i > t_0
- witnessing that the Android client never tries to query the collections that it "should"
- concluding that a persistent failure is preventing the Android client from advancing
- hypothesizing that the persistent failure is the recently discovered HMAC failure (and not some other issue, like a corrupt local passwords database, which we do see in the wild)
Flags: needinfo?(nalexander)
Comment 12•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #10)
> Is there any way we can estimate the number of Android devices that might be
> stuck in in this state, either by looking at server logs or from client
> telemetry?
IIUC, migrating those users to a new node should unstick them (in the same way it should stop Desktop constantly retrying those records and reporting a HMAC mismatch in its logs)
Comment 13•8 years ago
|
||
You could roughly estimate this by finding two or more requests for the same collection (almost certainly history) from the same device with the same `since`.
Those devices are failing to apply downloaded records. Regardless of whether this bug is the cause, a node reassignment would do those users good!
Comment 14•8 years ago
|
||
(In reply to Richard Newman [:rnewman] from comment #13)
> You could roughly estimate this by finding two or more requests for the same
> collection (almost certainly history) from the same device with the same
> `since`.
>
> Those devices are failing to apply downloaded records. Regardless of whether
> this bug is the cause, a node reassignment would do those users good!
Gah, of course!
Comment 15•8 years ago
|
||
(In reply to Nick Alexander :nalexander from comment #9)
> Android Sync will download it, fail to verify the HMAC, and silently stop
> syncing. Android Sync doesn't try to recover at all, and we don't do
> anything to notify the user.
To clarify, Android's HMAC verification failure handling depends on the collection. If errors are encountered during clients or keys collection, it aborts the whole sync silently. For every other collection, we simply abort the collection, but continue on with others.
Comment 16•8 years ago
|
||
(In reply to Nick Alexander :nalexander from comment #9)
> grisha, you might consider driving the FxA into the Separated state -- or the NeedsUpdate state -- so that the user at
> least knows something is up. You could add a message to the NeedsUpdate state explaining why they're there, if there's
> isn't one already... follow-up ticket, perhaps.
Does this help the user in any way though? Perhaps they will be more tempted to complain to someone about sync not working. We have terrible error state reporting, perhaps if/once that's improved this kind of approach might be less frustrating.
A more immediate benefit is that we might stop wasting device's resources by syncing when we'll likely fail. However, if we straight up move to a Separated state, user won't benefit from having sync just "start working" if/once they're node-reassigned, for example. Nor will we benefit from having "unblocked" said user. I guess as always things are multifaceted.
(In reply to Nick Alexander :nalexander from comment #11)
> Sadly, we have no client telemetry to tell us anything about this (or any
> other) situation. Grisha has talked to me about improving this, but it's a
> long road to pave, due to the architecture of Telemetry in Fennec.
While not immediately helpful, Android Sync Ping is high on my priority list. Perhaps it's not as long of a road actually, see Bug 1308337. Either way this needs to be done yesterday O.o
Flags: needinfo?(gkruglov)
Comment 17•8 years ago
|
||
(In reply to :Grisha Kruglov from comment #16)
> (In reply to Nick Alexander :nalexander from comment #9)
> > grisha, you might consider driving the FxA into the Separated state -- or the NeedsUpdate state -- so that the user at
> > least knows something is up. You could add a message to the NeedsUpdate state explaining why they're there, if there's
> > isn't one already... follow-up ticket, perhaps.
>
> Does this help the user in any way though? Perhaps they will be more tempted
> to complain to someone about sync not working. We have terrible error state
> reporting, perhaps if/once that's improved this kind of approach might be
> less frustrating.
>
> A more immediate benefit is that we might stop wasting device's resources by
> syncing when we'll likely fail. However, if we straight up move to a
> Separated state, user won't benefit from having sync just "start working"
> if/once they're node-reassigned, for example. Nor will we benefit from
> having "unblocked" said user. I guess as always things are multifaceted.
Users can be driven to reset their password, which has the effect of node reassigning them. If the service has to coordinate with the client to node reassign, then the user will have Sync just "start working"... but they won't be able to control any part of that experience. (At least, they won't be _guided_ to control any part of that experience.)
In any case, telemetry will help!
Comment 18•8 years ago
|
||
(In reply to Nick Alexander :nalexander from comment #17)
> Users can be driven to reset their password, which has the effect of node
> reassigning them.
That would be a very large hammer to use on people who do actually know their current password.
| Reporter | ||
Comment 19•8 years ago
|
||
The severity of this bug is significantly increased by Bug 1350196 Comment 42, which I'll summarize as "Android devices may have generated weak encryption keys". While it would require a very specific chain of events, IIUC it's theoretically possible that:
* A user with an Android device could have been using weak encryption keys for their sync data due to Bug 1350196.
* Their encrypted password records could have been exposed to other users due to this bug.
* Those other users might be able to exploit the weak keys to break the encryption and access plaintext password data.
I've no sense of how hard that third step might be in practice, but probably not as hard as we'd like.
| Reporter | ||
Comment 20•8 years ago
|
||
To pick up the thread on this bug: we node-migrated everyone in order to ensure no corrupt or missing data was left in the system. Now we need to fix the code and try to re-deploy the batch upload API.
| Reporter | ||
Comment 21•8 years ago
|
||
:Natim could you please take a look at this fix?
Attachment #8845287 -
Attachment is obsolete: true
Attachment #8855136 -
Flags: review?(rhubscher)
Updated•8 years ago
|
Attachment #8855136 -
Flags: review?(rhubscher) → review+
| Reporter | ||
Comment 22•8 years ago
|
||
Thanks! Fixed in https://github.com/mozilla-services/server-syncstorage/commit/a9d4614582917260069d65885ca0514f8edaa8e7
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
| Reporter | ||
Comment 24•8 years ago
|
||
This is RESOLVED/FIXED, should we clear the security flag here?
Flags: needinfo?(gguthe)
Comment 25•8 years ago
|
||
(In reply to Ryan Kelly [:rfkelly] from comment #24)
> This is RESOLVED/FIXED, should we clear the security flag here?
Yep.
Group: cloud-services-security
Flags: needinfo?(gguthe)
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•