Closed Bug 1111579 Opened 10 years ago Closed 9 years ago

Hawk session and FxA account may not be in sync.

Categories

(Hello (Loop) :: Server, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: alexis+bugs, Assigned: alexis+bugs)

References

Details

Attachments

(1 file, 4 obsolete files)

When connected on the loop server using its FxA account, Standard8 found that he wasn't able to make direct calls.

My assumption is that the hawkid isn't associated to a FxA user. If that's the case, then we need to find out what's causing that.
Some quick tests on prod show that we actually have a disconnection between what the client think is true and what the server think is true.

The hawk id Standard8 is using does not have any FxA account associated with it. After creating a new account on a fresh profile, the server has a FxA account tied to the hawk id.

I'm now investigating what could be the root cause here. Since we're using the OAuth flow multiple times, I'm checking if it's possible to have an expired hawk session and then get it back, without setting the right FxA user alongside.
Blocks: 1110309
Alexis -- Standard8 and I were just discussing in irc how critical this bug is turning out to be.  Are you officially taking this bug?  

I ask because it's currently unassigned, and I know you, Remy and Tarek usually assign a bug to yourselves when you actively work on a problem.  

I think this bug may be the most important unresolved server bug at the moment for the Loop client team.

Thanks.
Flags: needinfo?(alexis+bugs)
One of the big reasons we need this fixed is to verify that bug 1110309 is resolved when this server bug is fixed and bug 1102432 is landed.  (Bug 1102432 should land tomorrow.)

Also NOTE: Due to the holidays, we need to get all our client-side bugs checked in by this week if we want them to make it into the Fx 35 release (which goes to release in mid-January).  The last Beta build for Fx35 Desktop is Dec 22.
Yes, I'm actively working on this. I'm trying to find out where the faulty bits are!
Assignee: nobody → alexis+bugs
Flags: needinfo?(alexis+bugs)
After further investigation on my side, it turns out I'm still left without much information as to what could be causing the trouble here. I'm putting this here mainly for reference for later.

First of all, I haven't been able to reproduce the issue. This is all running on my local machine, so there are no problems sharing the hawk session etc. here.

Here is what I've done:

- Connect with one profile (Mr Tee) which is using the c452eac367368f2491181ab5a54ad19dad28266a55f66bf409bab67f1ea50cb4 hawk id;
- In there, connect with a valid FxA account;
- Create a bunch of rooms using this profile, they are stored on the server;
- Start a new profile (Ms Bee) and connect to the same FxA account. I can see all the created rooms without any problem. The FxA hawk id for this profile is 8ebabab8b4f7ee54896b826e8d1784c38c3ad6d19d0f53a0003920f7b705c7eb

Then, to continue experimenting, I've tried to remove the association between Mr Tee and the FxA user. To do that, I've done (The id used in the db is a hmac of the hawk id and a secret-to-the-server salt). I don't know what would be causing this, but it seems the server was in this state when Mark reported the error. In theory, the hawk session and the FxA associated information do have the same TTL and should expire all together. I'm not sure why this wasn't the case for him.

> redis-cli del hawkuser.c11530b6ec2d915b88ce7f146e168fc6c226e3e2aba0387c7b6390e68e51f419
> (integer) 1

So one key had been removed from the db.

If I try to connect back with MrTee hawk session token, I can see that no rooms are associated to my user.

Now, if I login again with this user, I don't see my rooms straight after login. Inspecting the headers, I see the hawk id is 55c955201bf53c45b9dc33338056475d4cb62ac57df5f68f680c1d6c95ae2c91

I create a room and name it "Bananas", the UI shows I'm connected using my FxA user.

After a browser restart I can see the full list of rooms I had before, plus the new "Bananas" one. The hawk id is still 55c955201bf53c45b9dc33338056475d4cb62ac57df5f68f680c1d6c95ae2c91.
This was causing the FxA to be out of sync. As such, all the information that's
stored on behalf of the user was stored with a different key.

All users will need to log-out and log-in again to abtain the FxA profile
information again.
Attachment #8538538 - Flags: review?(mathieu)
Attachment #8538538 - Flags: feedback?(standard8)
Status: NEW → ASSIGNED
I should have said: "all affected users", which I believe is non likely, especially because this touches features that haven't landed into release yet.
I've updated the dev environment if you want to test. https://loop-dev.stage.mozaws.net/
Comment on attachment 8538538 [details] [diff] [review]
Touch FxA when touching the session.

Review of attachment 8538538 [details] [diff] [review]:
-----------------------------------------------------------------

::: test/storage_test.js
@@ +1003,5 @@
> +        });
> +
> +        it("should touch user session and associated account data", function(done) {
> +          // First, setup keys with an expiration in the future (800s)
> +          storage._settings.hawkSessionDuration = 800;

Redundant with beforeEach ?

@@ +1021,5 @@
> +                    expect(data).to.eql(null);
> +                    storage.getHawkUser(idHmac, function(err, data) {
> +                      if (err) throw err;
> +                      expect(data).to.eql(null);
> +                      storage.getHawkUser(idHmac, function(err, data) {

Why getHawkUser twice ?
Attached patch Patch update with fixed tests. (obsolete) — Splinter Review
Attachment #8538538 - Attachment is obsolete: true
Attachment #8538538 - Flags: review?(mathieu)
Attachment #8538538 - Flags: feedback?(standard8)
Attachment #8538576 - Flags: review?(mathieu)
Attachment #8538576 - Flags: feedback?(standard8)
Attachment #8538576 - Attachment is obsolete: true
Attachment #8538576 - Flags: review?(mathieu)
Attachment #8538576 - Flags: feedback?(standard8)
Attachment #8538579 - Flags: review?(mathieu)
Attachment #8538579 - Flags: feedback?(standard8)
(In reply to Alexis Metaireau (:alexis) from comment #7)
> I should have said: "all affected users", which I believe is non likely,
> especially because this touches features that haven't landed into release
> yet.

That's a bit strange, as I've seen this on my production profile. Although I guess you're referring to rooms?

Could this have happened with the call urls setup, or were we behaving differently?
Comment on attachment 8538579 [details] [diff] [review]
Hmm, I'm practicing this git-bz tool fu. I'm not quite there yet. This time should be the right one.

The idea seems reasonable, I've not looked at the code though.
Attachment #8538579 - Flags: feedback?(standard8) → feedback+
It actually seems to be more impacting than I originally tough: I believe this affects all connected FxA users, or will after one month of usage without reconnecting to FxA using the OAuth flow.

When a user is connected with its FxA information, we're using (a derivation of) their id as the primary key to store the associated data (call-urls, rooms, SP urls).

Once we lose this connection to FxA, we are storing and retrieving the information with a different key (a derivation of the hawk id).

It means all the users who have connected to Firefox Accounts will be in a broken state after one month (https://github.com/mozilla-services/loop-server/blob/master/loop/config.js#L280).

Once this hotfix is pushed to prod, users who logout and login again will get back to a normal behavior (losing the data they generated in the interim).

Mitigation plan
===============

Now, I'm not sure as to what the best course of action is. It seems authenticated users will need to register again. We need to actually make them reconnect, and I believe this means having to remove their sessions. Or maybe there is a better way to handle this, Mark?

1. find all users who are in this intermediate state.

   We should be able to know that by finding users who have a
  `hawk.{hawkIdHmac}` and a `userid.{hawkIdHmac}` key but no `hawkuser.{hawkIdHmac}` one.

   I don't know how many users this means on prod.

2. Remove the tokens that are related to them. It will cause them to see a red thing saying they need to login again. Once they're logged-in, everything should be working well again.

I'm NI Standard8 as he might have some more ideas as to how to do the OAuth dance again without troubling too much the user + Adam, Maire and Tarek for awareness.
Flags: needinfo?(tarek)
Flags: needinfo?(standard8)
Flags: needinfo?(mreavy)
Flags: needinfo?(adam)
Okay, I've tagged a hotfix release for what we currently have in production, it's 0.13.3. See the comparison at https://github.com/mozilla-services/loop-server/compare/0.13.2...0.13.3
Blocks: 1113411
OK, we're currently deploying a hotfix on the prod servers. This will stop having new users impacted.

Now to find how many user were impacted so far, and we need to take a decision as to how we make them log in again.
Attached file Counts the number of impacted users. (obsolete) —
Thanks for the heads up.  Looks like the hotfix has gone out already.  Thank you!
Flags: needinfo?(mreavy)
No longer depends on: 1114374
Depends on: 1114374
No longer depends on: 1114374
(In reply to Alexis Metaireau (:alexis) from comment #14)

> 1. find all users who are in this intermediate state.
> 
>    We should be able to know that by finding users who have a
>   `hawk.{hawkIdHmac}` and a `userid.{hawkIdHmac}` key but no
> `hawkuser.{hawkIdHmac}` one.


Okay, so this gets us all the tokens associated with accounts in this state. Is there any way to figure out which URLs they have generated and migrate them so they're not lost? If so, we need to migrate those URLs as part of the account fixup.
Flags: needinfo?(adam)
So, we actually changed a bit the script to be more parallel and the number of impacted users is 6011 over 38198 (FxA users). This is approx 16% of our FxA-connected users.

The script is at https://gist.github.com/ametaireau/2c4697da0750dcc373fe.

Adam, about migration, we're thinking about doing the following:

When users are sending an authenticated request, we can find out if they're impacted or not. In case they are, we can get back their associated email (it's stored somewhere else, but we can only read it when we have their hawk id) and store it in the way we need it.

As a result, users would see some rooms back in their room-list, and some call-urls that stopped working should be working again. It could be a little bit disturbing for them to see these appearing, but I believe that would be the good way forward.

Mark, what are your feelings about that?
Flags: needinfo?(tarek)
I'm tempted to say that as we're still sort-of in beta, and there's not a massive amount of users (compared to potential user-base), then we could - if it works - just remove the hawk ids, and force them to re-log in again.

This could loose some room data, but at least they'd be forced back to a working state with the least code complexity on our side.

If the numbers were different, then yes, I'd probably go for your option more.

We should just see what Adam thinks though.
Flags: needinfo?(standard8) → needinfo?(adam)
Also we have sorted out a way that we can use to fix broken account the next time they use Hello.
I wonder if it is not too risky to remove broken account.
(In reply to Mark Banner (:standard8) from comment #21)
> I'm tempted to say that as we're still sort-of in beta, and there's not a
> massive amount of users (compared to potential user-base), then we could -
> if it works - just remove the hawk ids, and force them to re-log in again.
> 
> This could loose some room data, but at least they'd be forced back to a
> working state with the least code complexity on our side.
> 
> If the numbers were different, then yes, I'd probably go for your option
> more.
> 
> We should just see what Adam thinks though.

I'm hesitant to add more server code to do this. If it were just a database operation that we could run once, that would be fine -- but if we need to check for impact on login, let's just let the lost URLs be lost.
Flags: needinfo?(adam)
(In reply to Adam Roach [:abr] from comment #23)
> (In reply to Mark Banner (:standard8) from comment #21)
> > I'm tempted to say that as we're still sort-of in beta, and there's not a
> > massive amount of users (compared to potential user-base), then we could -
> > if it works - just remove the hawk ids, and force them to re-log in again.
> > 
> > This could loose some room data, but at least they'd be forced back to a
> > working state with the least code complexity on our side.
> > 
> > If the numbers were different, then yes, I'd probably go for your option
> > more.
> > 
> > We should just see what Adam thinks though.
> 
> I'm hesitant to add more server code to do this. If it were just a database
> operation that we could run once, that would be fine -- but if we need to
> check for impact on login, let's just let the lost URLs be lost.

I'm in favor of the simplest approach.  We're actually in beta for Hello Beta.  If some data gets lost as a result of finding and fixing bugs, that's not great, but it's acceptable IMO.

If we do decide to let "lost URLs be lost", I suggest we post somewhere (dev-media, elsewhere?) what's happening to minimize surprise to users.  This should also keep bug reports down, and it would give us a place to refer users for an explanation if they ask, "What happened and why?".
Hi Alexis -- Is this bug just waiting on a review to land?  How soon can this be deployed?  I need a server fix for this to be deployed so I can verify that the client bugs this blocks (Bug 1109102 and Bug 1110309) are resolved.  Fx35 goes to release Jan 13, and the last Beta build is Monday (29th of Dec) and the release candidate is built 5th of Jan.  So I have maybe a week to fix a client-side issue if there is one (due to how the desktop release process works).  Thanks.
Flags: needinfo?(alexis+bugs)
This is already deployed on the server side (per Comment 16). It's not marked as fixed since we still need to clean the database entries that are wrong.
Flags: needinfo?(alexis+bugs)
Attachment #8538579 - Flags: review?(mathieu) → review+
Attachment #8538579 - Attachment is obsolete: true
Attachment #8538858 - Attachment is obsolete: true
Attachment #8544005 - Flags: review?(rhubscher)
Attachment #8544005 - Flags: review?(rhubscher) → review+
https://github.com/mozilla-services/loop-server/commit/1fa0676b3abf41fbabbf1fb57532275b7533d3f2
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: