Closed Bug 1474755 Opened 3 years ago Closed 3 years ago

Err on the side of autofilling too much rather than too little by reducing the frecency threshold from the mean + one standard deviation to simply the mean

Categories

(Toolkit :: Places, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla63
Tracking Status
firefox62 + fixed
firefox63 --- fixed

People

(Reporter: adw, Assigned: adw)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fxsearch])

Attachments

(2 files, 1 obsolete file)

(In reply to Drew Willcoxon :adw from bug 1464154 comment #57)
> I'm starting to think we should lower the threshold.  It's better to err on
> the side of autofilling too much rather than too little.  We're erring on
> the side of too little right now, it seems clear to me.  That's especially
> frustrating since the previous autofill algorithm filled too much, so we're
> basically regressing people's autofill right now.
> 
> Lower it to maybe just the average frecency, instead of the average + one
> standard deviation.
Blocks: 1464154
Do we have evidence this is an actual problem?
In general I agree with your vision that we were autofilling too much and now we'll do less, but at the same time I think we should not be in a hurry to push this out, because autofilling much also has its downsides and we could be floating at a good compromise already.
I'd personally wait one release cycle to collect feedback, unless we have obvious evidence this is affecting most users.
Anecdotally -- Marco in bug 1464154, and also Brian mentioned it to me in person and there are a couple of people on Slack who've commented on it too, but in those cases I don't know for sure that lowering the threshold is the best fix for their problems, and I don't know whether bug 1467627 and bug 1469651 helped them.

The mean is still large compared to the long tail.  We wouldn't have the problem of before where low-frecency sites are autofilled.

I'm also worried about having to answer questions from management about why we regressed autofill, if in fact we get a lot of bad feedback about it on release.

I'll clear the request for now.
Attachment #8991155 - Flags: review?(mak77)
We could try some more testing, or ask people to test internally at Mozilla. Probably best if that happens on beta 62. What do you think? Maybe we could come up with an outline of how to test this out and what to expect.
There is a bias problem here, if something was autofilled before, people may think (by muscle memory) it should be filled. For reducing that bias we'd need to let the feature bake for a while.
Then we could surely ask people to report any case where
1) they think they use a page often enough to be filled, but it's not
2) they think they didn't visit a page often enough, but it's filled

Actually this kind of request was posted at least twice in the Firefox desktop Nightly blog posts but we didn't get much reports ( most reports were internal through Slack, or irc, or directly filed bugs due to broken muscle memory)

It sounds more like something for an A/B experiment, the success could be measured by checking how many times:
a) the user types the start of an origin, we don't autofill, the user visits a page in that origin
b)  the start of an origin, we autofill, the user goes elsewhere

We could actually already add some kind of probe to measure autofill efficiency based off these.
Component: Autocomplete → Places
Whiteboard: [fxsearch]
(In reply to Marco Bonardo [::mak] from comment #6)
> There is a bias problem here, if something was autofilled before, people may
> think (by muscle memory) it should be filled. For reducing that bias we'd
> need to let the feature bake for a while.

If we used to autofill something and users found it useful enough to expect to keep it working, that doesn't seem like a bad bias but a bar we should meet.
(In reply to Dão Gottwald [::dao] from comment #7)
> If we used to autofill something and users found it useful

We don't know, without proper measurement. The only thing we have is a couple cases where users found surprising that a domain was not autofilled, and at least in one of those cases the feature was mis-used (a keyword or a search alias would have worked much better).
We also used to autofill a lot of stuff that users didn't find useful, and we have various bugs filed about that and users who disabled autofill because "I typed a wrong domain once and it keeps suggesting it to me".

With the data we currently have at hand we can't take any decision, if not by guessing based on our experience.
(In reply to Marco Bonardo [::mak] from comment #8)
> (In reply to Dão Gottwald [::dao] from comment #7)
> > If we used to autofill something and users found it useful
> 
> We don't know, without proper measurement. The only thing we have is a
> couple cases where users found surprising that a domain was not autofilled,

We do know from these reports that people found it useful, otherwise they wouldn't complain. This may not be representative but it's valid anecdotal data that I don't think we should dismiss this easily, especially if we don't have better data.

> and at least in one of those cases the feature was mis-used (a keyword or a
> search alias would have worked much better).

Keywords and search aliases are harder to discover, so we can't really blame users for using a feature we provided them with in a way we may not have anticipated.

FWIW, I wasn't misusing the feature.

> We also used to autofill a lot of stuff that users didn't find useful, and
> we have various bugs filed about that and users who disabled autofill
> because "I typed a wrong domain once and it keeps suggesting it to me".

If you think that other user misused autofill, it's only fair to say that people disabling it because of a wrong suggestion didn't understand autofill either. It's designed to allow false positives: you can basically ignore them and use the awesomebar as you would have otherwise.

> With the data we currently have at hand we can't take any decision, if not
> by guessing based on our experience.

If we can't make a data-based decision either way, that seems like an argument for being conservative.
As a raw idea, I'd add 2 probes, to be measured only in profiles where autofill is enabled:
1. number of times the user picks a non-heuristic match whose url starts with the typed text
2. number of times the user picks a non-heuristic match when the heuristic match was autofill

3. we already have a count of the autofill picks in URLBAR_SELECTED_RESULT_TYPES

Comparing these with the global number of picks could give us some estimate of the problem. The second probe would also give us an estimate whether autofilling domains is actually a good idea, since if most users pick something else when we autofill, then maybe we should autofill something else.
(In reply to Dão Gottwald [::dao] from comment #9)
> We do know from these reports that people found it useful, otherwise they
> wouldn't complain. This may not be representative but it's valid anecdotal
> data that I don't think we should dismiss this easily, especially if we
> don't have better data.

Personally I've only seen cases where people found the lack of autofill surprising, but not a lack of usefulness.
Drew could have got better/more reports, being in an office.
I also don't think we should dismiss this feedback easily (this bug is a P1), I just think it's still below the emergency threshold, that means we could want to do better measurement before moving.

> Keywords and search aliases are harder to discover, so we can't really blame
> users for using a feature we provided them with in a way we may not have
> anticipated.
> 
> FWIW, I wasn't misusing the feature.

I'm sorry, I don't know about your specific case, I was commenting about the wordreference case, where Marco used to pick a url from history, edit the word and submit. In that case something like tab-to-search or keyword makes far more sense. It's not even clear what autofill would provide for that use-case, probably nothing useful.
I shouldn't have said "misuse", what I really mean is that we provide better features to achieve that use-case.

Can you elaborate more about your cases where we don't autofill anymore? It could be useful to analyze those checking stats and actual numbers.
(In reply to Marco Bonardo [::mak] from comment #10)
> As a raw idea, I'd add 2 probes, to be measured only in profiles where
> autofill is enabled:
> 1. number of times the user picks a non-heuristic match whose url starts
> with the typed text
> 2. number of times the user picks a non-heuristic match when the heuristic
> match was autofill

Yes! That's actually really useful to gather data for and analyze after a while in a dashboard.

> Comparing these with the global number of picks could give us some estimate
> of the problem. The second probe would also give us an estimate whether
> autofilling domains is actually a good idea, since if most users pick
> something else when we autofill, then maybe we should autofill something
> else.

I like where this discussion is going :)
Depends on: 1476354
Based on bug 1476354, I'm starting thinking our distribution in some cases is not normal, but it's instead very skewed, and thus we can't just sum stddev to the average. that would make the threshold not too high, but wrong in general.
Maybe we should sum the standard error of mean, rather than the stddev:
(sum / count) + sqrt((squares - ((sum * sum) / count)) / (count * count))
(In reply to Marco Bonardo [::mak] from comment #13)
> Based on bug 1476354, I'm starting thinking our distribution in some cases
> is not normal, but it's instead very skewed, and thus we can't just sum
> stddev to the average. that would make the threshold not too high, but wrong
> in general.
> Maybe we should sum the standard error of mean, rather than the stddev:
> (sum / count) + sqrt((squares - ((sum * sum) / count)) / (count * count))

I'm not sure that makes sense.  My understanding of the standard error of the mean is that it's useful when you're taking a small sample of a larger population.  Each set of samples you could take would produce a different mean, and the standard error of the mean is a measure of how much your sample mean differs from the population mean, if you could actually compute the entire population's mean.

But here we're not sampling at all, we actually are computing the population mean because we're using the entire population -- frecencies of all moz_places.  So if we were to use the mean + standard error of then mean as the threshold, it would end up being simply the mean (which I think we should try, as I said :-)).

Your larger point is that we need to "correct" in some way for distributions that aren't normal.  I'm not sure about that either.  Maybe Dave has thoughts.
Flags: needinfo?(dzeber)
Bug 1476354 has more discussion.
Blocks: 1476354
No longer depends on: 1476354
(In reply to Drew Willcoxon :adw from comment #14)
> I'm not sure that makes sense.  My understanding of the standard error of
> the mean is that it's useful when you're taking a small sample of a larger
> population.

It's small (much smaller than stddev clearly) but existing, actually we are somehow taking a sample of a larger population, that is all the frecencies values the user *may* have, the frecency values we have are a sample of users' browsing.
Probably I'm stretching the concept a little bit too much.

> Your larger point is that we need to "correct" in some way for distributions
> that aren't normal.

I don't think we should normalize things, at least for now, that's going to be expensive. I'm just saying we should not apply calculation for normal distibutions (like mean+stddev).

Anyway, based on the fancy distribution reports I see (also :ato reported a similar thing) and the discussion here, I re-evaluate my decision regarding this bug, let's lower the threshold.
I'll file a separate bug for new telemetry probes, that I think will be useful in the future.
Blocks: 1476517
Comment on attachment 8991155 [details]
Bug 1474755 - Err on the side of autofilling too much rather than too little by reducing the frecency threshold from the mean + one standard deviation to simply the mean.

https://reviewboard.mozilla.org/r/256122/#review264598

Potentially we don't need anymore origin_frecency_sum_of_squares, right? I'm ok with keeping that for a while, in case we want to reuse it.
Attachment #8991155 - Flags: review+
Here's my 2c on a couple points:

> our distribution in some cases is not normal, but it's instead very skewed

In general, I expect these count distributions to be non-normal, if not very much so. I would work under that assumption until seeing data proving otherwise.

> thus we can't just sum stddev to the average. that would make the threshold not too high, but wrong in general.

I don't we really have the concept of a "right" threshold here, but rather we are looking for the most generally useful one. You can always compute the value "mean + stdev" for any given dataset, but if I'm interpreting correctly, you mean that it doesn't have the usual interpretation that it would for a normal distribution. For normally distributed data, we know that ~68% of it lies between (mean - stdev) and (mean + stdev), so 84% is less than (mean + stdev) overall. If the distribution is not normal, but rather skewed, there will be a different percentage less than (mean + stdev), but it will be >50% (since skewed implies mean > median, and median is the 50% mark).

If we are looking for a threshold that is guaranteed to include Y% of the data, that needs to be done by computing percentiles of the dataset (which is likely more computationally demanding and requires sorting). Otherwise, I think it's fine to use (mean + k stdev) as a reasonable intuitive threshold for "some reasonable amount above the mean, which itself is some way out beyond the median/50% mark", but knowing that it doesn't translate to a specific coverage proportion. Instead, we can modify the multiplier k as a tuning mechanism. To get any more precise, we either need to collect some data in order to analyse the actual distributions, or else use a percentile-based threshold, which ensures a coverage proportion but is more intensive to compute.

> Maybe we should sum the standard error of mean, rather than the stddev

As Drew points out, the standard error of the mean quantifies how much the mean itself would vary if we had many samples, and computed the mean from each sample. It is in fact (stdev / sqrt(count)). I don't think using this here would improve on the existing threshold, as we're not sampling, and it is in fact just a multiple of stdev (so it doesn't add any new information).

> So if we were to use the mean + standard error of then mean as the threshold, it would end up being simply the mean

Since stderr of the mean is (stdev / sqrt(count)), it wouldn't actually work out to 0. It would give a number that applies if you were to theoretically have multiple samples. :)
Flags: needinfo?(dzeber)
(In reply to Marco Bonardo [::mak] from comment #16)
> let's lower the threshold.

(In reply to Dave Zeber [:dzeber] from comment #18)
> I don't we really have the concept of a "right" threshold here, but rather
> we are looking for the most generally useful one.
> 
> ...
> 
> Otherwise, I think it's fine to use (mean + k stdev) as a reasonable
> intuitive threshold ... Instead, we can modify the multiplier k as a
> tuning mechanism.

Marco, before I land this, do you want to go back to my original idea of storing the multiplier in a pref?  We'd be able to easily tweak it in Shield studies if we wanted to.

(In reply to Dave Zeber [:dzeber] from comment #18)
> > So if we were to use the mean + standard error of then mean as the threshold, it would end up being simply the mean
> 
> Since stderr of the mean is (stdev / sqrt(count)), it wouldn't actually work
> out to 0. It would give a number that applies if you were to theoretically
> have multiple samples. :)

OK, thanks for the reply.  My understanding was that you need to multiply the stderr by a corrective that approaches zero once you start to sample a large portion of the population, or in our case the entire population.
Flags: needinfo?(mak77)
Needed to update test_autofill_origins.js
Marco, in the interest of saving time I posted a second patch to Phabricator that stores the standard deviation multiplier in prefs.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7973d85b70eb21b28b061653688fda2aee9e11dd
Attachment #8991155 - Attachment is obsolete: true
Flags: needinfo?(mak77)
Comment on attachment 8993189 [details]
Bug 1474755 - Change autofill threshold calculation to multiply the standard deviation by a constant, which is stored in prefs. r?mak

Marco Bonardo [::mak] has approved the revision.

https://phabricator.services.mozilla.com/D2233
Attachment #8993189 - Flags: review+
Ok, let's go for the pref. Since it's hard to guess a value by hand, I'm fine starting from 0.0, we'll have to add telemetry to be able to calculate a better value in the future.

I suggested a couple changes to the patch in Phabricator, and there's a failure in xpcshell-test that should be addressed, apart from that it looks good to me.
nevermind the float comments, let's just fix the xpcshell test.
Pushed by dwillcoxon@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4b7f106772af
Change autofill threshold calculation to multiply the standard deviation by a constant, which is stored in prefs. r=mak
https://hg.mozilla.org/mozilla-central/rev/4b7f106772af
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Attached patch Beta/62 patchSplinter Review
Approval Request Comment
[Feature/Bug causing the regression]:
Autofill improvements in bug 1239708

[User impact if declined]:
Some URLs that the user expects to be autofilled will not be autofilled.  This patch won't fix every single case and it's not intended to, but it should really help in many cases.

[Is this code covered by automated tests?]:
Yes

[Has the fix been verified in Nightly?]:
No

[Needs manual test from QE? If yes, steps to reproduce]:
No -- it doesn't really need it because this is a just a simple change to a threshold value, and it's well covered by tests

[List of other uplifts needed for the feature/fix]:
None

[Is the change risky?]:
No

[Why is the change risky/not risky?]:
Small simple fix that simply changes a threshold value

[String changes made/needed]:
None
Attachment #8993545 - Flags: approval-mozilla-beta?
Comment on attachment 8993545 [details] [diff] [review]
Beta/62 patch

Trusting y'all's discussion and your judgement here, for uplift to beta 11. 

I also like the idea of adding probes (for 63 anyway) and doing a/b testing for future changes, so that we can have a better idea of whether users are perceiving changes to be improvements or not.
Attachment #8993545 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
See Also: → 1478163
Marking qe-verify- based on commment 29.
Flags: qe-verify-
You need to log in before you can comment on or make changes to this bug.