Closed Bug 1331197 Opened 7 years ago Closed 7 years ago

Remove unused users from the Django auth_user table on stage/prod

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: emorley)

References

Details

Attachments

(1 file, 1 obsolete file)

Once bug 1319246 & bug 1326205 are fixed, we can expedite the updating of legacy hashed usernames by purging unused users.

Unless a user has classified failures in the last 4 months, or owns Hawk credentials, or has updated/created visibility profiles, then we can just outright delete the user, since there is no useful state stored that needs to be preserved.

The following model fields ForeignKey to the User object:
* BugJobMap.user
* JobNote.user
* Credentials.owner
* JobExclusion.author
* ExclusionProfile.author
* UserExclusionProfile.user
The PerformanceAlert model (in perf/models.py) also references User.
A user has no effective stored state and so can be safely removed, if:
* No other tables reference that user in a foreign key
* The user does not have the is_staff or is_superuser bits set
* The user hasn't been disabled (is_active = 0)

The list of tables that foreignkey to auth_user can be found using:

> SELECT TABLE_NAME, COLUMN_NAME
FROM information_schema.KEY_COLUMN_USAGE
WHERE REFERENCED_TABLE_NAME = 'auth_user'

+ --------------- + ---------------- +
| TABLE_NAME      | COLUMN_NAME      |
+ --------------- + ---------------- +
| auth_user_groups | user_id          |
| auth_user_user_permissions | user_id          |
| bug_job_map     | user_id          |
| credentials     | owner_id         |
| django_admin_log | user_id          |
| exclusion_profile | author_id        |
| job_exclusion   | author_id        |
| job_note        | user_id          |
| performance_alert | classifier_id    |
| user_exclusion_profile | user_id          |
+ --------------- + ---------------- +
10 rows

The SQL to find users that can be safely removed is therefore:

USE treeherder;
SELECT u.id, u.username, u.email
FROM auth_user AS u
WHERE
u.is_superuser = 0 AND
u.is_staff = 0 AND
u.is_active = 1 AND
NOT EXISTS (SELECT user_id FROM auth_user_groups WHERE u.id = user_id) AND
NOT EXISTS (SELECT user_id FROM auth_user_user_permissions WHERE u.id = user_id) AND
NOT EXISTS (SELECT user_id FROM bug_job_map WHERE u.id = user_id) AND
NOT EXISTS (SELECT owner_id FROM credentials WHERE u.id = owner_id) AND
NOT EXISTS (SELECT user_id FROM django_admin_log WHERE u.id = user_id) AND
NOT EXISTS (SELECT author_id FROM exclusion_profile WHERE u.id = author_id) AND
NOT EXISTS (SELECT author_id FROM job_exclusion WHERE u.id = author_id) AND
NOT EXISTS (SELECT user_id FROM job_note WHERE u.id = user_id) AND
NOT EXISTS (SELECT classifier_id FROM performance_alert WHERE u.id = classifier_id) AND
NOT EXISTS (SELECT user_id FROM user_exclusion_profile WHERE u.id = user_id);

I had to use the `NOT EXISTS` pattern (https://explainextended.com/2009/09/18/not-in-vs-not-exists-vs-left-join-is-null-mysql/) since the `NOT IN` version fails if the subquery returns rows with NULL in the foreign key (eg http://blog.9minutesnooze.com/sql-not-in-subquery-null/), and several of the tables use DEFAULT NULL. 

Out of the 1230 users on prod, 1048 can be safely deleted, expediting the point at which we can remove the migration code added in bug 1319246.
Blocks: 1337987
No longer depends on: 1326205
Depends on: 1337999
Attached file SQL to run on prod (obsolete) —
I've tested this on stage and seems to work fine.

The "exclude rows referenced in foreign keys" part is pretty safe, since if the row was still referenced the DELETE would just fail anyway. The main other thing to check is that the state in the auth_user table itself isn't important (ie is_staff etc).

If it helps reviewing, the auth_user table schema is:

CREATE TABLE `auth_user` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `password` varchar(128) COLLATE utf8_bin NOT NULL,
  `last_login` datetime(6) DEFAULT NULL,
  `is_superuser` tinyint(1) NOT NULL,
  `username` varchar(150) COLLATE utf8_bin NOT NULL,
  `first_name` varchar(30) COLLATE utf8_bin NOT NULL,
  `last_name` varchar(30) COLLATE utf8_bin NOT NULL,
  `email` varchar(254) COLLATE utf8_bin NOT NULL,
  `is_staff` tinyint(1) NOT NULL,
  `is_active` tinyint(1) NOT NULL,
  `date_joined` datetime(6) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `username` (`username`)
) ENGINE=InnoDB AUTO_INCREMENT=1052 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

The `first_name` and `last_name` fields are not set for any user. The `password` field is unused (and unset in all but legacy cases, which we want to purge anyway), since we use SSO.
Assignee: nobody → emorley
Status: NEW → ASSIGNED
Attachment #8835553 - Flags: review?(james)
Attachment #8835553 - Flags: review?(james) → review+
Depends on: 1338206
Attached file SQL to run on prod v2
Attachment #8835553 - Attachment is obsolete: true
Run against prod:

966 row(s) affected
0.187 sec
Run against prod again, now that cycle_data has run (so means more users are not referenced by job_notes etc):

13 row(s) affected
0.125 sec
I've run this against prod/stage/dev again, with updated cut-off dates (set to 2 weeks ago).

This left 3 undeletable users who haven't logged in since Feb. Two of them now have a new user_id since they had to switch from their alias email to the LDAP email, so I manually updated the user ID in affected tables (for one it was job_exclusion.author_id and the other performance_alert.classifier_id).

The one remaining user is someone who left Mozilla some months ago, so I updated the exclusion_profile.author_id referencing them to point to Cameron instead (since he is the author for some of the other profiles).

With that, there are no users remaining using the old style usernames.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: