Bug 1595244 Comment 10 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Erik Rose [:erik][:erikrose]

on 2020-03-13 17:37:33 PDT

Hi, all! Here is the first draft of the model. I've highlighted the lines you should lift into the product: https://github.com/mozilla-services/fathom-login-forms/blob/5d02dbbe7c7f2f43dd7f3a39e1ee5e44a43b0011/new-password/rulesets.js#L6-L265.

Obviously, we're beating the pants off what the product is doing now, recall-wise. :-) 2-4% → 88%. Can't very well beat 100% precision, but we're cranking it as close as we can!

Here are the current numbers:

Training accuracy per tag: 0.98057 95% CI: (0.96919, 0.99194)
FPR: 0.00233 95% CI: (0.00000, 0.00688)
FNR: 0.07353 95% CI: (0.02966, 0.11740)
Precision: 0.99213 Recall: 0.92647

Validation accuracy per tag: 0.96129 95% CI: (0.93982, 0.98276)
FPR: 0.02459 95% CI: (0.00516, 0.04402)
FNR: 0.09091 95% CI: (0.02155, 0.16027)
Precision: 0.90909 Recall: 0.90909

Testing accuracy per tag: 0.93333 95% CI: (0.90358, 0.96309)
FPR: 0.05000 95% CI: (0.01979, 0.08021)
FNR: 0.11429 95% CI: (0.03975, 0.18882)
Precision: 0.86111 Recall: 0.88571

Validation results are getting decent, at a false-positive rate of 2.5%†. The blind testing set, which I just used for the first time, raises that to 5%, so there's still room to improve. Though, if you wanted to launch today, you could crank the FPR down to 2% by setting the confidence cutoff to 0.8. You'd go from finding 88% of new-password fields to 80%, but that's probably a good tradeoff for this application.

With 0.8 cutoff:

Testing accuracy per tag: 0.93333 95% CI: (0.90358, 0.96309)
FPR: 0.02000 95% CI: (0.00060, 0.03940)
FNR: 0.20000 95% CI: (0.10629, 0.29371)
Precision: 0.93333 Recall: 0.80000

There's also a another knob called positive-weight which should do something similar with slightly different tradeoffs. We haven't played with that yet.

Our next big step is to double the size of our corpus. We have the samples ready and will land them on Monday. They concentrate on the 100 domains Firefox users visit most often. Our biggest problem right now is success: we've fit the training corpus as closely as we reasonably can—upward of 98%—so there's not much more to examine to inspire new sources of signal. An infusion of samples should knock that percentage back down and give us grist for the feature mill. And more validation samples will dilute any unrepresentativeness between the training and validation sets, though the two have tracked very closely so far.

We'll keep hammering on this until April, and you can feel free to draw from our improvements or not, as your comfort allows. Otherwise, you can grab them in a later release.

One last thing: we've been tracking performance numbers as we go. Here are the current ones, from a test-corpus run:

Time per page (ms): 10 ▇█▅▁ 61 Average per tag: 4.3

The first part is a histogram showing how long Fathom would take to consider every text-input on a page. This doesn't really apply to you, since you're firing on click. The other number is the interesting one: 4.3ms per input field. We haven't run a profiler yet, nor have we done any feature ablation (which removes rules that don't provably improve our numbers). We plan to do at least the latter before the end.

Let's keep in close touch as you start integration. Have a great weekend!

† Note that the readouts now say FPR rather than the FP I was pasting in earlier comments. We changed our false-positive math to conform to the more common formula, which is more demanding. It now represents the proportion of non-new-password fields spuriously identified as new-password fields. Before, it was the proportion of total fields spuriously identified as new-password fields.

Revision 1 by