Closed Bug 1171274 Opened 9 years ago Closed 9 years ago

cull the MozReview user database and prevent it from bloating again

Categories

(MozReview Graveyard :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

Details

Attachments

(4 files)

We have tons of bugs around user selection not working. Enough is enough. While there may certainly be bugs in Review Board and MozReview causing issues, we know one thing: our user database is large and most of the users in it will never interact with MozReview.

Here are some stats:

Total users:                 338,290
Users w/ '+' (no IRC nick):  336,748
Users w IRC nick:              1,542
Users submitting requests:       105
Users leaving reviews:           192

Let's say we've had 300 unique users of MozReview to date (which is probably a bit high). That's 0.09% of the total user database. If you throw in people with IRC syntax, that's 0.46%.

The overwhelming majority of users in the database are dormant users and are making it harder to find people in MozReview.

I think we should purge the MozReview database of users that aren't active and we should prevent them from returning. Here's what I'm thinking:

1) Change the user creation code to not import a new user automatically unless they have IRC nick syntax in their Bugzilla account
2) Write a script to delete users that don't have IRC nick syntax *and* have never interacted with MozReview
3) Bugzilla users can still have accounts created by logging in or by pushing changesets

Mozillians without IRC nicks in Bugzilla and who have never used MozReview will not be automatically imported into MozReview. This is unfortunate. The recourse is they log in once and problem solved. I don't think that's a major concern.

If we're worried about pruning the user database too far, perhaps we could ask the BMO people for a dump of active accounts within the past year and we could not delete them. I suspect that may add a few thousand more users, but it will be a far cry from 330,000. Progress is progress.
mozreview: limit which users are auto populated (bug 1171274); r?smacleod, r?mcote

MozReview currently has over 300,000 users and over 99.0% of them are
inactive. This is because MozReview currently populates all Bugzilla
users returned from search results.

Unfortunately, this mountain of inactive users creates chaos. They
take up space in autocomplete results. They conflict with valid
active users. They are a royal pain.

This commit establishes limits on what users are auto imported from
Bugzilla during searches. Instead of importing every user found by a
search, we instead only import users that have IRC nick syntax or
that exactly match the search query. This should drastically reduce
the number of users that are imported from Bugzilla during searches.

This commit is only half the solution to our excessive user database.
The other half is to purge the user database of inactive users. This
will be done in a separate commit.
Attachment #8615005 - Flags: review?(smacleod)
Attachment #8615005 - Flags: review?(mcote)
mozreview: INCOMPLETE ability to prune inactive users (bug 1171274)

We have over 300,000 next to worthless user accounts sitting around
inside MozReview. This commit adds functionality to prune them.

THIS COMMIT IS HALF FINISHED. THE CODE IS UNTESTED.
https://reviewboard.mozilla.org/r/10143/#review8901

smacleod could probably finish this commit in 1/4 of the time it will take me to figure out Django's ORM. Feel free to have it.
https://reviewboard.mozilla.org/r/10143/#review8925

::: pylib/mozreview/mozreview/bugzilla/models.py:125
(Diff revision 1)
> +    for u in User.objects.all():
> +        if u.id in active_uids:
> +            continue
> +
> +        print('will delete user %s' % (u.username))
> +        #BugzillaUserMap.objects.filter(user=u).delete()
> +        #u.delete()

this should work:
```python
BugzillaUserMap.objects.exclude(user_id__in=active_uids).delete()
User.objects.exclude(id__in=active_uids).delete()
```
https://reviewboard.mozilla.org/r/10143/#review8927

::: pylib/mozreview/mozreview/bugzilla/models.py:118
(Diff revision 1)
> +    for rr in ReviewRequest.objects.all():
> +        active_uids.add(rr.submitter.id)
> +        for u in rr.target_people:
> +            active_uids.add(u.id)
> +        for u in rr.participants:
> +            active_uids.add(u.id)

We should do this for both ReviewRequest and ReviewRequestDraft
Comment on attachment 8615005 [details]
MozReview Request: mozreview: limit which users are auto populated (bug 1171274); r?smacleod, r?mcote

https://reviewboard.mozilla.org/r/10141/#review9027

Hm, I'm a bit concerned that we might be breaking functionality that users like.  Do we have any idea how often a user searches for a reviewer by name?  If it's at all frequent, we might want to do what I suggest in https://bugzilla.mozilla.org/show_bug.cgi?id=1160419 instead.  That will return more hits, but it would make autocomplete a lot more useful, and you don't need to know the person's username/nick.

::: pylib/rbbz/rbbz/auth.py:15
(Diff revision 1)
> -from mozreview.models import get_or_create_bugzilla_users
> +from mozreview.bugzilla.models import (
> +    BZ_IRCNICK_RE,
> +    get_or_create_bugzilla_users,
> +)

I believe we normally structure these as

  from foo import (bar,
                   baz)
Attachment #8615005 - Flags: review?(mcote)
https://reviewboard.mozilla.org/r/10141/#review9033

> I believe we normally structure these as
> 
>   from foo import (bar,
>                    baz)

Hm messed up my indentation somehow, but I think you know what I meant.
https://reviewboard.mozilla.org/r/10141/#review9039

I'm fine with adding a time-based component to the importing. **anything** is better than what we have now. And, there should be no harm pruning users that have never used MozReview.
https://reviewboard.mozilla.org/r/10141/#review9297

As I understand the code, searching by name will still work perfectly fine (As long as the user has the ircnick syntax). The only previous use case this will break is finding non convention following BMO users when searching by a name - which I am fine with.
https://reviewboard.mozilla.org/r/10141/#review9299

Ah perhaps I misunderstood then.  If you can still search users with irc nicks by full name, then this is probably fine.
Comment on attachment 8615005 [details]
MozReview Request: mozreview: limit which users are auto populated (bug 1171274); r?smacleod, r?mcote

https://reviewboard.mozilla.org/r/10141/#review9399

This looks fine to me for the most part - I'm curious if this will have a large perf impact though, since we query for users quite a bit.
Attachment #8615005 - Flags: review?(smacleod) → review+
https://reviewboard.mozilla.org/r/10141/#review9027

> Hm messed up my indentation somehow, but I think you know what I meant.

I actually prefer this style over the currently used one (Review Board's convention). Let's switch. Going to drop this.
mozreview: limit which users are auto populated (bug 1171274); r=smacleod

MozReview currently has over 300,000 users and over 99.0% of them are
inactive. This is because MozReview currently populates all Bugzilla
users returned from search results.

Unfortunately, this mountain of inactive users creates chaos. They
take up space in autocomplete results. They conflict with valid
active users. They are a royal pain.

This commit establishes limits on what users are auto imported from
Bugzilla during searches. Instead of importing every user found by a
search, we instead only import users that have IRC nick syntax or
that exactly match the search query. This should drastically reduce
the number of users that are imported from Bugzilla during searches.

This commit is only half the solution to our excessive user database.
The other half is to purge the user database of inactive users. This
will be done in a separate commit.
mozreview: add function to prune inactive users (bug 1171274). r?gps

We have over 300,000 next to worthless user accounts sitting around
inside MozReview. This commit adds functionality to prune them.
Attachment #8627377 - Flags: review?(gps)
Comment on attachment 8627377 [details]
MozReview Request: mozreview: add function to prune inactive users (bug 1171274). r?gps

mozreview: add function to prune inactive users (bug 1171274). r?gps

We have over 300,000 next to worthless user accounts sitting around
inside MozReview. This commit adds functionality to prune them.
Attachment #8627377 - Flags: review?(gps)
Comment on attachment 8627377 [details]
MozReview Request: mozreview: add function to prune inactive users (bug 1171274). r?gps

https://reviewboard.mozilla.org/r/12217/#review10705

::: pylib/mozreview/mozreview/bugzilla/models.py:153
(Diff revision 2)
> +    for u in User.objects.filter(email__in=SPECIAL_EMAILS):
> +        active_uids.add(u.id)
> +
> +    for u in User.objects.filter(username__in=SPECIAL_USERNAMES):
> +        active_uids.add(u.id)
> +
> +    for u in User.objects.all():
> +        if (u.date_joined - u.last_login) > MAX_LOGIN_DIFFERENCE:
> +            active_uids.add(u.id)

I think it would be useful if we printed how many users came from where. That way we know how important each item is to keeping users around.

::: pylib/mozreview/mozreview/bugzilla/models.py:167
(Diff revision 2)
> +        for u in User.objects.filter(id__in=active_uids):

Can we sort this?

::: pylib/mozreview/mozreview/bugzilla/models.py:181
(Diff revision 2)
> +        User.objects.exclude(id__in=active_uids).delete()

I believe there are query length limits in MySQL. Want to take a bet on whether Django's ORM is smart enough to work around it? What I'm saying is be prepared to rewrite this.
Comment on attachment 8627377 [details]
MozReview Request: mozreview: add function to prune inactive users (bug 1171274). r?gps

mozreview: add function to prune inactive users (bug 1171274). r?gps

We have over 300,000 next to worthless user accounts sitting around
inside MozReview. This commit adds functionality to prune them.
Attachment #8627377 - Flags: review?(gps)
Comment on attachment 8627377 [details]
MozReview Request: mozreview: add function to prune inactive users (bug 1171274). r?gps

https://reviewboard.mozilla.org/r/12217/#review10707

::: pylib/mozreview/mozreview/bugzilla/models.py:190
(Diff revisions 2 - 3)
> -        User.objects.exclude(id__in=active_uids).delete()
> +        

Kill the whitespace.
Attachment #8627377 - Flags: review?(gps) → review+
https://hg.mozilla.org/hgcustom/version-control-tools/pushloghtml?changeset=5cbe65b797ec

Leaving this open until we've actually performed the user removal on production.
We pruned the user database a few weeks back and all looks well.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Developer Services → MozReview
You need to log in before you can comment on or make changes to this bug.