Closed Bug 571343 Opened 14 years ago Closed 14 years ago

Make user name and email address unique in Tiki

Categories

(support.mozilla.org :: Knowledge Base Software, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jsocol, Assigned: paulc)

References

Details

Attachments

(2 files, 2 obsolete files)

The pastebin disappeared, so I'll reiterate: there are over 330 usernames with duplicates in the TikiWiki database. (One of these actually has 19 copies!)

Fortunately, duplicate emails are a much smaller problem: on my database I found zero.
I just saw the first "Multiple Items Returned" errors from production. You cannot be logged in to one of these accounts and use the Kitsune pages. We need to come up with a strategy for dealing with these duplicate accounts and patch Tiki to disallow duplicates.
All those users seem to belong to the live chat helpers category, and not anything else (i.e. contributors, registered).
(In reply to comment #3)
> All those users seem to belong to the live chat helpers category, and not
> anything else (i.e. contributors, registered).

What does that imply? Does it mean we can remove the duplicates?
I added Matthew to the CC list, so he can answer that.
Assignee: nobody → paulc
Severity: normal → critical
Summary: WHY do we allow duplicate usernames? → Make user name and email address unique in Tiki
Some stats... 332 usernames collide, for a total of 736 users that need to be handled somehow...

We need a way to decide which ones to keep as-is, and whether to rename or delete the others.

Note that, because of how tiki works, if we rename users, they will almost certainly lose ownership of their contributions.

I'll post again with an eval of the "damage" - that is, how many forum posts, KB pages, or any other data will be affected (i.e. was created by these users).
(In reply to comment #6)
> Some stats... 332 usernames collide, for a total of 736 users that need to be
> handled somehow...

Don't forget we also need unique email addresses, which was around 1300 collisions a couple days ago. Not sure what the overlap is there.

> We need a way to decide which ones to keep as-is, and whether to rename or
> delete the others.
> 
> Note that, because of how tiki works, if we rename users, they will almost
> certainly lose ownership of their contributions.

We should probably delete the least-recently-used and collapse any contributions as necessary. (When they're keyed on user name, Tiki couldn't tell the difference anyway.)

> I'll post again with an eval of the "damage" - that is, how many forum posts,
> KB pages, or any other data will be affected (i.e. was created by these users).

Awesome, thanks!
More stats...

Duplicate username damage:
* 1 KB page - users are jorge1621 (they all have the same email and consecutive userIds)
* 156 forum threads/posts
* 65 rows from tiki_user watches
* Are there any other important tables that I missed???

Another "interesting" fact:
* 3 users have empty emails (email = '').

This is a "email VS username overlap" done by userId's:
1. Duplicate users by username:           736
2. Duplicate users by email:              2787 (includes the 3 above)
3. Duplicate users by username AND email: 699 (317 distinct pairs)
4. (2) - (1):                             2088
5. (1) - (2):                             37
6. (1) - (3):                             37
7. (3) - (1):                             0
8. (1) intersect (2):                     699
9. (1) intersect (3):                     699
10.(1) intersect (2) intersect (3):       699

So it appears that only 37 of the users with duplicate usernames are actually legitimate distinct accounts, while the remaining 699 have the same email as well. Great!

We could consider merging each group of accounts with the same email into one account and emailing the person with the username that was kept. We could also mention that their username could be changed manually if they would rather have kept a different one.

If we go this route, only people in this group of 37 users will have to have their usernames changed, and I daresay we could do this by hand in the next few days?

Thoughts?
(In reply to comment #8)
> So it appears that only 37 of the users with duplicate usernames are actually
> legitimate distinct accounts, while the remaining 699 have the same email as
> well. Great!
I just realized that might be because changing the email address of one username changes it for all other matching usernames, heh. When one of these users was changing their email, it was changing the email of all other usernames matching. So we can't really tell if these 699 accounts were actually from different people. They probably were...

Nuf said. Argh.
For posterity: looking at those 37 usernames (which do not have the same email), a good percentage of these people probably followed these steps:
1. Register with the wrong email (typo)
2. No email received? Hit back in browser, or re-register within a few minutes
Result:
Same user, different email (typically by one character)

The reason they can register the same user is probably replication lag, since the way TikiWiki enforces uniqueness is in the PHP script only, and not in the database.

I've talked to the dev team and it looks like we'll be going this route:
* Collapse each group of duplicate emails to the MRU account (the most recently used, by login).
* Collapse each group of remaining duplicate usernames (~37 accounts total, per above) to the MRU account.
(In reply to comment #10)
> The reason they can register the same user is probably replication lag, since
> the way TikiWiki enforces uniqueness is in the PHP script only, and not in the
> database.

This gives me hope that adding a UNIQUE constraint on the table will not break too much.
Whew! This script does a lot of things to make usernames and emails unique. (So many, in fact, that I think we should have 2 reviewers.)
* Truncates long usernames
** one of these has been recently active (in 2010) so could be notified
* Registers unmatched Kitsune users (ones who are either mismatched by ID or whose IDs don't exist)
* Then, by email first and username second:
** updates all the tables relying on user ids or usernames (except tiki_forum_reads, which is inaccurate anyway)
** deletes all users by duplicate email first (~1515 deletions for the email, 23 remaining for the usernames)
* Creates and drops some indexes to speed up the process (by a factor of 3-5 on my machine).
* Ensures that the two tables are completely in sync (some emails have been updated in the Tiki table but not the Kitsune table) [1]
* Adds unique keys on the users_users table

One thing it does not do and needs to be done is adding a UNIQUE key on the auth_user.email column - I think we should use migrations and have a django model for that.

[1] Should we keep emails in sync when they change in Tiki? (Separate bug)
Attachment #458899 - Flags: review?(james)
Attachment #458899 - Flags: feedback?
(In reply to comment #12)
> One thing it does not do and needs to be done is adding a UNIQUE key on the
> auth_user.email column - I think we should use migrations and have a django
> model for that.

We can't actually do that, without hacking Django itself we'd break the model validation. Search for information about logging in by email address in Django yields some interesting results, but I think we're stuck either 1) writing our own User class or 2) enforcing uniqueness in code.

> [1] Should we keep emails in sync when they change in Tiki? (Separate bug)

Yes.
(In reply to comment #13)
> (In reply to comment #12)
> We can't actually do that, without hacking Django itself we'd break the model
> validation. Search for information about logging in by email address in Django
> yields some interesting results, but I think we're stuck either 1) writing our
> own User class or 2) enforcing uniqueness in code.
> 
> > [1] Should we keep emails in sync when they change in Tiki? (Separate bug)
> 
> Yes.
Filed bug 580751 for [1]. It seems that, because we enforce email uniqueness in Tiki and keep Django in sync, patching Kitsune code can wait until we drop the users_users table.
Comment on attachment 458899 [details] [diff] [review]
v1, script performing all the database updates and adding unique keys

Do we need to drop the indexes you create? It seems like they might create some speed benefits site-wide.

Also, let me introduce you to my friend, printf (http://us2.php.net/printf):

printf("done in %.2f seconds\n", $local_end - $local_start);

No need to change it now, just for future reference ;)
(In reply to comment #15)
> Comment on attachment 458899 [details] [diff] [review]
> v1, script performing all the database updates and adding unique keys
> 
> Do we need to drop the indexes you create? It seems like they might create some
> speed benefits site-wide.
I don't see why we should have to drop them. I suppose the only downside to keeping them is potential slowdown on INSERTs, but the majority of action should be SELECTs
> 
> Also, let me introduce you to my friend, printf (http://us2.php.net/printf):
I just said hello earlier today. Admittedly, I copied an existing script without bothering to think about it.

Anything else before I file a v1.1? :P
I wonder if we should re-nullify Django passwords at the end, to make sure the correct one gets stored.
(In reply to comment #17)
> I wonder if we should re-nullify Django passwords at the end, to make sure the
> correct one gets stored.
I don't see the harm in that, since logging in through Django is still ways away.
The only catch is that now you have to drop the indexes before starting the script over (if it fails at any point after creating them).
Attachment #458899 - Attachment is obsolete: true
Attachment #459956 - Flags: review?(james)
Attachment #458899 - Flags: review?(james)
Attachment #458899 - Flags: feedback?
Comment on attachment 459956 [details] [diff] [review]
v1.1, makes our friend happy, keeps indexes, and empties Django passwords

This works for me, and I'd like to get more eyes on the result.

File an IT bug to run this on stage after you commit it, and make sure it blocks this. Thanks!
Attachment #459956 - Flags: review?(james) → review+
The only change is on line #620 of the patch file.
Attachment #459956 - Attachment is obsolete: true
Attachment #460352 - Flags: review?(james)
Comment on attachment 460352 [details] [diff] [review]
v1.2, s/user_id/userId for deletion in users_users

OK, hopefully between the two of us we've caught everything. It looks good to me.
Attachment #460352 - Flags: review?(james) → review+
Depends on: 582117
Checked in on trunk, r71401.
Filed bug 582117 to run the script.
Script ran on staging (minor SQL tweak done in bug 582117). Calling this done.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: