Use field labels and attributes to determine new-password field types
Categories
(Toolkit :: Password Manager, enhancement, P1)
Tracking
()
People
(Reporter: MattN, Assigned: bdanforth)
References
(Depends on 15 open bugs, Blocks 3 open bugs)
Details
(Whiteboard: [passwords:heuristics] [passwords:generation])
Attachments
(2 files)
We can use field labels e.g. via <label>, name/id or just adjacent text like Form Autofill does to help identify new-password fields. Hopefully we can share this code with Form Autofill.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
We will try to use Fathom for this and this will be the main bug for that.
Assignee | ||
Comment 2•5 years ago
|
||
ssage: As I mentioned in our meeting yesterday, the Fathom team asked us for a target false positive rate (i.e. what is an acceptable percentage of pages where we incorrectly show the password generation autocomplete popup when we shouldn't). I think the answer to this question may be more complex than a single number, but ultimately, it's a product question, so I wanted to get your thoughts. Here are some of mine:
- Currently, our popups have 100% accuracy with 0% false positives or false negatives, but the popup only shows up on a very small subset of positive pages due to the fact that we gate it based on the presence of the
"autocomplete"="new-password"
attribute and value. - Of course, in an ideal world, we would want a 0% false positive rate for all pages, but we can't be 100% certain on whether or not to display the popup in cases where
"autocomplete"="new-password"
is not present on the field, so we have to decide what our risk tolerance is. In general, the higher the acceptable false positive rate, the more often we can surface the feature to users. - Some options are:
- We could decide we don't want to risk any false positives on some subset of the most popular sites, and:
- ...prevent Fathom from running on those sites until we are more certain.
- ...try and hardcode either Fathom or some other solution to guarantee* that they work.
- We could decide that we want to pick a false positive rate irrespective of how popular the site is.
- This is the approach the Fathom team is currently pursuing in developing their model.
- We could decide we don't want to risk any false positives on some subset of the most popular sites, and:
- So my questions are:
- How tolerant would we/users be if we got it wrong?
- Do we/users care more about getting it right on certain pages (e.g. amazon.com) than we do for the long tail of all pages?
- If so, what sites do we care most about?
- For any of these options, it may be a good idea to see what Chrome's and our other major competitors' accuracy and false positive rates are for the same set of sites. Do you think that this is important to find out?
*: Note that pages can change, so we would want to monitor the feature to ensure it continues to work over time.
Comment 3•5 years ago
|
||
Sandy, don't feel you need to come up with a number; that's a whole project in itself. But what would be helpful is a general sense of whether it's better to miss a new-password field or falsely identify one. Right now, Firefox misses about 96% of them (on an unweighted sample of the top 151,000 Trexa domains). So one easy call would be to proceed in that direction and lean toward continuing to miss the questionable ones. Cheers!
Assignee | ||
Comment 4•5 years ago
|
||
I wanted to summarize the conclusions that we reached in our team meeting yesterday; Sandy, can you confirm that this matches your thinking?:
- The assumption is that the Fathom model will continually improve, and we can land updates as needed/desired.
- To that end, MattN and I have asked the Fathom team for a false positive rate between 2-3% for this first version.
- We can land use of the model behind a preference.
- Ideally, we would ensure that we can easily disable it on a per-site basis.
- It is unlikely we can do this in 76, but it is tracked in Bug 1620649.
- As part of QA, measure the false positive rate of the Fathom model for ~50 pages in Firefox and for the same password generation feature in Chrome.
Reporter | ||
Updated•5 years ago
|
Comment 5•5 years ago
•
|
||
Good news! Our latest model has about 2.5% FP and FN rates, so it looks like we can rest easy about that:
FP: 0.02564 FN: 0.02564
(commit 9bb2beb37bc5d5cb47d79ae31e2b29be69fdacce)
Assignee | ||
Comment 6•5 years ago
|
||
(In reply to Erik Rose [:erik][:erikrose] from comment #5)
Good news! Our latest model has about 2.5% FP and FN rates, so it looks like we can rest easy about that:
FP: 0.02564 FN: 0.02564
(commit 9bb2beb37bc5d5cb47d79ae31e2b29be69fdacce)
How this number is arrived at is important, so I'd like to add some context here. Looking at that commit, this is the false positive rate for a validation run for a corpus with 74 training pages and 48 validation pages. Is that right?
Comment 7•5 years ago
|
||
I count 75 and 48, but close enough. And these are drawn from a uniform random distribution over top 151,000 top Trexa domains, which are themselves designed to be representative of where Firefox users go.
Assignee | ||
Comment 8•5 years ago
|
||
Okay thanks. I don’t think we should be overly confident in that number, in part because of the small number of sample pages, but also because as I understand it, a validation run (as opposed to a test run) is still tuning the model based on the validation set. As MattN said in our parallel email discussion:
To clarify, our current target is 2-3% on a test set our QA team uses though they will first collect the data on Chrome and if Chrome is worse than that we may lower our bar accordingly.
Also, as a quick check, I did a test run (via fathom-test
) for the model at 9bb2beb37... against all the sample pages in the unused
directory (about 67 pages, 50/50 positive and negative samples), and, while I was able to reproduce the 2.5% FP rate you reported for the validation run, I got a FP rate of 6.3% for the test run. Maybe I am missing something?
Testing accuracy per tag: 0.89723 95% CI: (0.85982, 0.93465)
FP: 0.06324 95% CI: (0.00000, 0.13139)
FN: 0.03953 95% CI: (0.01279, 0.06626)
Precision: 0.70909 Recall: 0.79592
Comment 9•5 years ago
|
||
There are a lot of things here. First, the amount of fitting to a validation set varies based on approach. You'll always be fitting to it a little, as long as you its results as a criterion to inform any of your decisions at all—in our case, whether to keep working. But, mechanically, we use it only to trigger our overfit protection. It's not like we're using some giant AutoML process that's trying 100 different neural nets of different sizes and shapes and picking the one that performs best on the validation set. That would be more like cherry-picking; it would be fitting much more closely to the validation set.
Second, running fathom-test against the unused samples…well…let's just say it would take some time to figure out how to interpret what you did. The current sets are built out of a few distinct distributions, which we've tried to proportionally balance between them: 31 domains Vlad collected initially, off the top of his head; a larger number of Trexa-sampled domains; the negative samples; and samples of various languages. It's possible the unused ones vary in some respect from the combination of distributions our active sets comprise. If you want to really do an early test (we aren't done yet, which is why we've taken care to keep ourselves blind to it), run it against the actual test set and just don't tell me how it goes.
A FP of 6.3% isn't totally crazy, though. The 95% CI for it reaches almost to that point:
Validation accuracy per tag: 0.94855 95% CI: (0.92400, 0.97311)
FP: 0.02251 95% CI: (0.00000, 0.05829)
FN: 0.02894 95% CI: (0.00795, 0.04993)
Precision: 0.89062 Recall: 0.86364
Until we run out of time, we're going to focus on tamping down the FPs. I spent yesterday redoing all the reporting in fathom-train so it now shows per-tag results, which makes this much easier. (I sat down to work on accuracy and said "I can't see anything! Let me fix that first.") We also can turn 2 different knobs to automatically trade FPs for FNs; we haven't played with those yet.
Comment 10•5 years ago
•
|
||
Hi, all! Here is the first draft of the model. I've highlighted the lines you should lift into the product: https://github.com/mozilla-services/fathom-login-forms/blob/5d02dbbe7c7f2f43dd7f3a39e1ee5e44a43b0011/new-password/rulesets.js#L6-L265.
Obviously, we're beating the pants off what the product is doing now, recall-wise. :-) 2-4% → 88%. Can't very well beat 100% precision, but we're cranking it as close as we can!
Here are the current numbers:
Training accuracy per tag: 0.98057 95% CI: (0.96919, 0.99194)
FPR: 0.00233 95% CI: (0.00000, 0.00688)
FNR: 0.07353 95% CI: (0.02966, 0.11740)
Precision: 0.99213 Recall: 0.92647
Validation accuracy per tag: 0.96129 95% CI: (0.93982, 0.98276)
FPR: 0.02459 95% CI: (0.00516, 0.04402)
FNR: 0.09091 95% CI: (0.02155, 0.16027)
Precision: 0.90909 Recall: 0.90909
Testing accuracy per tag: 0.93333 95% CI: (0.90358, 0.96309)
FPR: 0.05000 95% CI: (0.01979, 0.08021)
FNR: 0.11429 95% CI: (0.03975, 0.18882)
Precision: 0.86111 Recall: 0.88571
Validation results are getting decent, at a false-positive rate of 2.5%†. The blind testing set, which I just used for the first time, raises that to 5%, so there's still room to improve. Though, if you wanted to launch today, you could crank the FPR down to 2% by setting the confidence cutoff to 0.8. You'd go from finding 88% of new-password fields to 80%, but that's probably a good tradeoff for this application.
With 0.8 cutoff:
Testing accuracy per tag: 0.93333 95% CI: (0.90358, 0.96309)
FPR: 0.02000 95% CI: (0.00060, 0.03940)
FNR: 0.20000 95% CI: (0.10629, 0.29371)
Precision: 0.93333 Recall: 0.80000
There's also a another knob called positive-weight which should do something similar with slightly different tradeoffs. We haven't played with that yet.
Our next big step is to double the size of our corpus. We have the samples ready and will land them on Monday. They concentrate on the 100 domains Firefox users visit most often. Our biggest problem right now is success: we've fit the training corpus as closely as we reasonably can—upward of 98%—so there's not much more to examine to inspire new sources of signal. An infusion of samples should knock that percentage back down and give us grist for the feature mill. And more validation samples will dilute any unrepresentativeness between the training and validation sets, though the two have tracked very closely so far.
We'll keep hammering on this until April, and you can feel free to draw from our improvements or not, as your comfort allows. Otherwise, you can grab them in a later release.
One last thing: we've been tracking performance numbers as we go. Here are the current ones, from a test-corpus run:
Time per page (ms): 10 ▇█▅▁ 61 Average per tag: 4.3
The first part is a histogram showing how long Fathom would take to consider every text-input on a page. This doesn't really apply to you, since you're firing on click. The other number is the interesting one: 4.3ms per input field. We haven't run a profiler yet, nor have we done any feature ablation (which removes rules that don't provably improve our numbers). We plan to do at least the latter before the end.
Let's keep in close touch as you start integration. Have a great weekend!
† Note that the readouts now say FPR rather than the FP I was pasting in earlier comments. We changed our false-positive math to conform to the more common formula, which is more demanding: (false positives / (false positives + true negatives)). Before, it was the proportion of total fields spuriously identified as new-password fields.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 11•5 years ago
|
||
WIP - A first pass to get everything wired up.
- Add a new module, NewPasswordModel.jsm, which imports the Fathom library from Bug 1618956, and copies over the Fathom model at 5d02dbbe7c... into it.
- There are a bunch of inline TODOs here, as this is a straight copy, aside from fixing eslint errors.
- This module will likely need its own unit tests or even full DOM (i.e. mochitest-plain) tests.
- The module is imported into LoginManagerChild.jsm, which runs the model against the page and returns true if the candidate input element it found has a confidence greater than or equal to FATHOM_CONFIDENCE_THRESHOLD and also matches the input element under consideration in LoginManagerChild.jsm.
- This approach will need to be modified to work for new password confirmation fields, as in that case more than one new password field would be on the page.
Comment 12•5 years ago
•
|
||
We've upped the recall from 80% to 90% while keeping a 2% false-positive rate—this time while keeping the confidence threshold at 0.5 (so remember to update that bit).
Testing accuracy per tag: 0.95955 95% CI: (0.94125, 0.97786)
FPR: 0.02090 95% CI: (0.00558, 0.03621)
FNR: 0.10000 95% CI: (0.04394, 0.15606)
Precision: 0.93396 Recall: 0.90000
F1 Score: 0.91667
Time per page (ms): 5 ▁█▇▃▂▁ 54 Average per tag: 4.7
Just for kicks, if you stuck with the 0.8 threshold, you'd get…
Testing accuracy per tag: 0.94382 95% CI: (0.92243, 0.96522)
FPR: 0.00896 95% CI: (0.00000, 0.01904)
FNR: 0.20000 95% CI: (0.12525, 0.27475)
Precision: 0.96703 Recall: 0.80000
F1 Score: 0.87562
So, back down to the old recall but with a really small FPR. I recommend 0.5 given those figures.
We have a couple more tricks up our sleeves we'll land early next week, which you're welcome to take or leave as your QA plan permits. Happy weekend!
Assignee | ||
Comment 13•5 years ago
|
||
WIP
Still need to fix existing broken tests.
Assignee | ||
Comment 14•5 years ago
|
||
Reporter | ||
Comment 15•5 years ago
|
||
The Windows 10 x64 debug Mochitests test-windows10-64/debug-mochitest-plain-e10s-1 M(1)
failure seems to be intermittent as it passed the 2nd time. I've re-triggered it to see how frequent it is.
Talos results | Raptor results
As far as I can tell, there are no performance regressions without low confidence so I think this can land when you're ready assuming the M1 re-triggers don't fail on that same test.
Comment 16•5 years ago
|
||
Comment 17•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ca9302a3bf52
https://hg.mozilla.org/mozilla-central/rev/c72482e89c47
Comment 18•5 years ago
|
||
Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.
Reporter | ||
Updated•5 years ago
|
Comment 19•5 years ago
•
|
||
Here's one more model update, in time for a few days of real-world telemetry. The main attraction is the addition of change-password forms: samples of them and rules to sniff them out. Less syntactically impressive but perhaps more foundational is the switch from evaluating any old text field to assuming the app passes in only hasEverBeenPasswordType ones. That changes the denominator, so numbers like FPR aren't directly comparable to what we had before. We also roll in several other improvements:
- More signal words from more languages
- Better detection of forgot-password controls
- Detection of "remember me" and newsletter signup controls
- Finding and reading the closest header or header-like element
- Speed optimizations, care of bdanforth. Still working on porting some of those atop the latest changes.
Precision is up 4% and recall 10%:
Testing accuracy per tag: 0.98137 95% CI: (0.96048, 1.00000)
FPR: 0.06818 95% CI: (0.00000, 0.14266)
FNR: 0.00000 95% CI: (0.00000, 0.00000)
Precision: 0.97500 Recall: 1.00000
F1 Score: 0.98734
Time per page (ms): 2 |▁▂▅▆█▂ | 29 Average per tag: 9.3
I was actually surprised to see those numbers, because our validation numbers fell the other way, with a higher precision than recall—which is what we generally have been shooting for:
fathom-train vectors/training.json -a vectors/validation.json -s -i 10000 -l .001 -q -p .25
Training accuracy per tag: 0.98295 95% CI: (0.96943, 0.99648)
FPR: 0.00000 95% CI: (0.00000, 0.00000)
FNR: 0.02597 95% CI: (0.00546, 0.04649)
Precision: 1.00000 Recall: 0.97403
F1 Score: 0.98684
Validation accuracy per tag: 0.94578 95% CI: (0.91134, 0.98023)
FPR: 0.01961 95% CI: (0.00000, 0.05766)
FNR: 0.06957 95% CI: (0.02307, 0.11606)
Precision: 0.99074 Recall: 0.93043
F1 Score: 0.95964
Time per page (ms): 2 | ▂█▄▂ | 37 Average per tag: 8.7
But I'll not complain about a 100% recall measurement, even though it's a cheeky thing to claim in this big, bad universe. I expect the switch is because one sample can make a big difference when you're down to a few mistakes in the corpus: we got only 3 elements wrong in the entirety of the test set. I'm excited to see what the real world has in store!
Reporter | ||
Comment 20•5 years ago
|
||
Release Note Request
[Why is this notable]: Password generation is currently only supported on sites that adopted autocomplete="new-password"
on their password fields but now it will "just work" on "many more" websites. We estimate <10% of sites used that attribute but the list of ones that do is somewhat biased towards more popular ones. We are hoping for at least a doubling of how many times password generation is offered.
[Affects Firefox for Android]: Not yet (maybe in a few months they will get password generation)
[Suggested wording]: Password generation and saving work on many more websites
- The "saving" piece refers to bug 1536728 which doesn't need its own relnote.
[Links (documentation, blog post, etc)]: None yet
Updated•5 years ago
|
Assignee | ||
Comment 21•5 years ago
|
||
It may be helpful to update the support page on password generation to reflect this change. See comment #20 for a good description of the change.
My suggestion would be to edit Step 2:
- 2. Hold down the control key while you click on the password field to open the Context Menu.
+ 2. If there is no option presented automatically to generate a password as shown in Step 4, hold down the control key while you click on the password field to open the Context Menu.
Comment 22•5 years ago
|
||
Marking this as Verified-fixed as it was part of FX 76 feature testing which included the verification of this on over 60+ sites.
Reporter | ||
Updated•5 years ago
|
Description
•