Bug 1595244 Comment 10 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Hi, all! Here is the first draft of the model. I've highlighted the lines you should lift into the product: https://github.com/mozilla-services/fathom-login-forms/blob/5d02dbbe7c7f2f43dd7f3a39e1ee5e44a43b0011/new-password/rulesets.js#L6-L265.

Obviously, we're beating the pants off what the product is doing now, recall-wise. :-) 2-4% → 88%. Can't very well beat 100% precision, but we're cranking it as close as we can!

Here are the current numbers:

      Training accuracy per tag:  0.98057    95% CI: (0.96919, 0.99194)
                            FPR:  0.00233    95% CI: (0.00000, 0.00688)
                            FNR:  0.07353    95% CI: (0.02966, 0.11740)
                      Precision:  0.99213    Recall: 0.92647

    Validation accuracy per tag:  0.96129    95% CI: (0.93982, 0.98276)
                            FPR:  0.02459    95% CI: (0.00516, 0.04402)
                            FNR:  0.09091    95% CI: (0.02155, 0.16027)
                      Precision:  0.90909    Recall: 0.90909

       Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                            FPR:  0.05000    95% CI: (0.01979, 0.08021)
                            FNR:  0.11429    95% CI: (0.03975, 0.18882)
                      Precision:  0.86111    Recall: 0.88571

Validation results are getting decent, at a false-positive rate of 2.5%†. The blind testing set, which I just used for the first time, raises that to 5%, so there's still room to improve. Though, if you wanted to launch today, you could crank the FPR down to 2% by setting the confidence cutoff to 0.8. You'd go from finding 88% of new-password fields to 80%, but that's probably a good tradeoff for this application.

    With 0.8 cutoff:

    Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                         FPR:  0.02000    95% CI: (0.00060, 0.03940)
                         FNR:  0.20000    95% CI: (0.10629, 0.29371)
                   Precision:  0.93333    Recall: 0.80000

There's also a another knob called positive-weight which should do something similar with slightly different tradeoffs. We haven't played with that yet.

Our next big step is to double the size of our corpus. We have the samples ready and will land them on Monday. They concentrate on the 100 domains Firefox users visit most often. Our biggest problem right now is success: we've fit the training corpus as closely as we reasonably can—upward of 98%—so there's not much more to examine to inspire new sources of signal. An infusion of samples should knock that percentage back down and give us grist for the feature mill. And more validation samples will dilute any unrepresentativeness between the training and validation sets, though the two have tracked very closely so far.

We'll keep hammering on this until April, and you can feel free to draw from our improvements or not, as your comfort allows. Otherwise, you can grab them in a later release.

One last thing: we've been tracking performance numbers as we go. Here are the current ones, from a test-corpus run:

    Time per page (ms): 10 ▇█▅▁       61    Average per tag: 4.3

The first part is a histogram showing how long Fathom would take to consider every text-input on a page. This doesn't really apply to you, since you're firing on click. The other number is the interesting one: 4.3ms per input field. We haven't run a profiler yet, nor have we done any feature ablation (which removes rules that don't provably improve our numbers). We plan to do at least the latter before the end.

Let's keep in close touch as you start integration. Have a great weekend!

† Note that the readouts now say FPR rather than the FP I was pasting in earlier comments. We changed our false-positive math to conform to the more common formula, which is more demanding. It now represents the proportion of non-new-password fields spuriously identified as new-password fields. Before, it was the proportion of total fields spuriously identified as new-password fields.
Hi, all! Here is the first draft of the model. I've highlighted the lines you should lift into the product: https://github.com/mozilla-services/fathom-login-forms/blob/5d02dbbe7c7f2f43dd7f3a39e1ee5e44a43b0011/new-password/rulesets.js#L6-L265.

Obviously, we're beating the pants off what the product is doing now, recall-wise. :-) 2-4% → 88%. Can't very well beat 100% precision, but we're cranking it as close as we can!

Here are the current numbers:

      Training accuracy per tag:  0.98057    95% CI: (0.96919, 0.99194)
                            FPR:  0.00233    95% CI: (0.00000, 0.00688)
                            FNR:  0.07353    95% CI: (0.02966, 0.11740)
                      Precision:  0.99213    Recall: 0.92647

    Validation accuracy per tag:  0.96129    95% CI: (0.93982, 0.98276)
                            FPR:  0.02459    95% CI: (0.00516, 0.04402)
                            FNR:  0.09091    95% CI: (0.02155, 0.16027)
                      Precision:  0.90909    Recall: 0.90909

       Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                            FPR:  0.05000    95% CI: (0.01979, 0.08021)
                            FNR:  0.11429    95% CI: (0.03975, 0.18882)
                      Precision:  0.86111    Recall: 0.88571

Validation results are getting decent, at a false-positive rate of 2.5%†. The blind testing set, which I just used for the first time, raises that to 5%, so there's still room to improve. Though, if you wanted to launch today, you could crank the FPR down to 2% by setting the confidence cutoff to 0.8. You'd go from finding 88% of new-password fields to 80%, but that's probably a good tradeoff for this application.

    With 0.8 cutoff:

    Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                         FPR:  0.02000    95% CI: (0.00060, 0.03940)
                         FNR:  0.20000    95% CI: (0.10629, 0.29371)
                   Precision:  0.93333    Recall: 0.80000

There's also a another knob called positive-weight which should do something similar with slightly different tradeoffs. We haven't played with that yet.

Our next big step is to double the size of our corpus. We have the samples ready and will land them on Monday. They concentrate on the 100 domains Firefox users visit most often. Our biggest problem right now is success: we've fit the training corpus as closely as we reasonably can—upward of 98%—so there's not much more to examine to inspire new sources of signal. An infusion of samples should knock that percentage back down and give us grist for the feature mill. And more validation samples will dilute any unrepresentativeness between the training and validation sets, though the two have tracked very closely so far.

We'll keep hammering on this until April, and you can feel free to draw from our improvements or not, as your comfort allows. Otherwise, you can grab them in a later release.

One last thing: we've been tracking performance numbers as we go. Here are the current ones, from a test-corpus run:

    Time per page (ms): 10 ▇█▅▁       61    Average per tag: 4.3

The first part is a histogram showing how long Fathom would take to consider every text-input on a page. This doesn't really apply to you, since you're firing on click. The other number is the interesting one: 4.3ms per input field. We haven't run a profiler yet, nor have we done any feature ablation (which removes rules that don't provably improve our numbers). We plan to do at least the latter before the end.

Let's keep in close touch as you start integration. Have a great weekend!

† Note that the readouts now say FPR rather than the FP I was pasting in earlier comments. We changed our false-positive math to conform to the more common formula, which is more demanding: (false positives / (false positives + true negatives)). Before, it was the proportion of total fields spuriously identified as new-password fields.

Back to Bug 1595244 Comment 10