Closed Bug 747914 Opened 12 years ago Closed 12 years ago

don't index questions in elastic search where creator is not active

Categories

(support.mozilla.org :: Search, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: willkg, Unassigned)

Details

We shouldn't index questions where the creator is not active.

Two things here:

1. check to see if the link for a question where the creator is not active kicks up a 404 (morbid curiosity)

2. add a get_indexable() to the Question model that excludes questions where the creator .is_active is false
I checked my db and there are only 200 questions that would get excluded out of like 180,000. So this doesn't affect much and probably should be a low priority issue to work on.
Don't we delete these submissions after a day or two so it's a much higher ratio.
Ooops, I totally could have checked this myself: nope... but in the past week, it's 14/272 = 5% (Which is pretty significant)
(In reply to [:Cww] from comment #3)
> Ooops, I totally could have checked this myself: nope... but in the past
> week, it's 14/272 = 5% (Which is pretty significant)

I have no idea what you're talking about. Are you replying to the right bug?
14 threads out of the last week (272 total new threads) didn't have a creator... so it's 5% of threads moving forward (it's not really fair to include all the threads from before we had is_active)

Not going to fight your decision, but I wanted to correct your metric.
(In reply to [:Cww] from comment #2)
> Don't we delete these submissions after a day or two so it's a much higher
> ratio.

No.

(In reply to [:Cww] from comment #3)
> Ooops, I totally could have checked this myself: nope... but in the past
> week, it's 14/272 = 5% (Which is pretty significant)

This same percentage was called not-important by Kadir in IRC. Hopefully we can get some agreement on that.
So, I wrote this bug to cover not indexing questions with a creator marked as not active. 

As far as I can tell, the effects of this bug are not noticeable. You won't see these questions show up in search results unless you try really hard and search with the Advanced Search form. This doesn't affect users or contributors.

The only effect this has is on the total amount of time it takes to reindex. Given that it's 200 questions out of 180,000 or thereabouts, that's pretty minute number and won't affect the total time it takes to reindex by a measurable amount. It takes 11 seconds to index 1000 documents on my machine. It takes slightly longer in production. A full indexing run takes about 40 minutes on my machine and between 50 and 59 minutes in production.

Cww brings up growth over time. Let's assume we increase that total of 200 problem documents by 20 documents every week for the next year (20 * 54). That's another 900 problem documents out of like 200,000 documents). That's 11 seconds or so of indexing time for problem documents a year from now. That's not enough of a difference to be noticeable. I'm pretty sure it would take longer to fix this bug than we'd save in reindexing over the next year.
Whoops--my off-the-cuff math is wrong. 20 * 45 is 900 (not sure why I wrote 45 earlier, but it's late), but we want 20 * 52 = 1040.
That's fine. I wasn't sure if this covered the reported metrics (where we also talked about this problem) ... anyhow, all I wanted to do was clear up your metric.

(Also, we can consider not indexing old questions and time-limit our search to the last year or last 6 months or something which will shave a lot of time off indexing)
(In reply to [:Cww] from comment #9)
> 
> (Also, we can consider not indexing old questions and time-limit our search
> to the last year or last 6 months or something which will shave a lot of
> time off indexing)

We should discuss that in a different but.
I think we should nix this bug. It's wrong-headed.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.