1595244 - Use field labels and attributes to determine new-password field types

Reporter

Description

•

5 years ago

We can use field labels e.g. via <label>, name/id or just adjacent text like Form Autofill does to help identify new-password fields. Hopefully we can share this code with Form Autofill.

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Flags: qe-verify+

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

status-firefox70: --- → wontfix

status-firefox71: --- → wontfix

status-firefox72: --- → fix-optional

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

status-firefox72: fix-optional → wontfix

Sam Foster [:sfoster] (he/him)

Updated

•

5 years ago

Comment 1

•

5 years ago

We will try to use Fathom for this and this will be the main bug for that.

status-firefox73: --- → wontfix

status-firefox74: --- → wontfix

Priority: P2 → P1

Summary: Use login field labels/id/name to determine new-password field types → Use field labels and attributes to determine new-password field types

Erik Rose [:erik][:erikrose]

Updated

•

5 years ago

Depends on: 1618956

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1616356

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1619498

Bianca Danforth [:bdanforth]

Assignee

Comment 2

•

5 years ago

ssage: As I mentioned in our meeting yesterday, the Fathom team asked us for a target false positive rate (i.e. what is an acceptable percentage of pages where we incorrectly show the password generation autocomplete popup when we shouldn't). I think the answer to this question may be more complex than a single number, but ultimately, it's a product question, so I wanted to get your thoughts. Here are some of mine:

Currently, our popups have 100% accuracy with 0% false positives or false negatives, but the popup only shows up on a very small subset of positive pages due to the fact that we gate it based on the presence of the "autocomplete"="new-password" attribute and value.
Of course, in an ideal world, we would want a 0% false positive rate for all pages, but we can't be 100% certain on whether or not to display the popup in cases where "autocomplete"="new-password" is not present on the field, so we have to decide what our risk tolerance is. In general, the higher the acceptable false positive rate, the more often we can surface the feature to users.
Some options are:
- We could decide we don't want to risk any false positives on some subset of the most popular sites, and:
  - ...prevent Fathom from running on those sites until we are more certain.
  - ...try and hardcode either Fathom or some other solution to guarantee* that they work.
- We could decide that we want to pick a false positive rate irrespective of how popular the site is.
  - This is the approach the Fathom team is currently pursuing in developing their model.
So my questions are:
- How tolerant would we/users be if we got it wrong?
- Do we/users care more about getting it right on certain pages (e.g. amazon.com) than we do for the long tail of all pages?
  - If so, what sites do we care most about?
- For any of these options, it may be a good idea to see what Chrome's and our other major competitors' accuracy and false positive rates are for the same set of sites. Do you think that this is important to find out?

*: Note that pages can change, so we would want to monitor the feature to ensure it continues to work over time.

Flags: needinfo?(ssage)

Erik Rose [:erik][:erikrose]

Comment 3

•

5 years ago

Sandy, don't feel you need to come up with a number; that's a whole project in itself. But what would be helpful is a general sense of whether it's better to miss a new-password field or falsely identify one. Right now, Firefox misses about 96% of them (on an unweighted sample of the top 151,000 Trexa domains). So one easy call would be to proceed in that direction and lean toward continuing to miss the questionable ones. Cheers!

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1551730

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Blocks: 1620649

Bianca Danforth [:bdanforth]

Assignee

Comment 4

•

5 years ago

I wanted to summarize the conclusions that we reached in our team meeting yesterday; Sandy, can you confirm that this matches your thinking?:

The assumption is that the Fathom model will continually improve, and we can land updates as needed/desired.
- To that end, MattN and I have asked the Fathom team for a false positive rate between 2-3% for this first version.
We can land use of the model behind a preference.
Ideally, we would ensure that we can easily disable it on a per-site basis.
- It is unlikely we can do this in 76, but it is tracked in Bug 1620649.
As part of QA, measure the false positive rate of the Fathom model for ~50 pages in Firefox and for the same password generation feature in Chrome.

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

No longer blocks: 1620649

Depends on: 1620649

Erik Rose [:erik][:erikrose]

Comment 5

•

5 years ago

•

Edited

Good news! Our latest model has about 2.5% FP and FN rates, so it looks like we can rest easy about that:

FP:  0.02564     FN:  0.02564

(commit 9bb2beb37bc5d5cb47d79ae31e2b29be69fdacce)

Bianca Danforth [:bdanforth]

Assignee

Comment 6

•

5 years ago

(In reply to Erik Rose [:erik][:erikrose] from comment #5)

Good news! Our latest model has about 2.5% FP and FN rates, so it looks like we can rest easy about that:
FP:  0.02564     FN:  0.02564
(commit 9bb2beb37bc5d5cb47d79ae31e2b29be69fdacce)

How this number is arrived at is important, so I'd like to add some context here. Looking at that commit, this is the false positive rate for a validation run for a corpus with 74 training pages and 48 validation pages. Is that right?

Flags: needinfo?(erik)

Erik Rose [:erik][:erikrose]

Comment 7

•

5 years ago

I count 75 and 48, but close enough. And these are drawn from a uniform random distribution over top 151,000 top Trexa domains, which are themselves designed to be representative of where Firefox users go.

Flags: needinfo?(erik)

Bianca Danforth [:bdanforth]

Assignee

Comment 8

•

5 years ago

Okay thanks. I don’t think we should be overly confident in that number, in part because of the small number of sample pages, but also because as I understand it, a validation run (as opposed to a test run) is still tuning the model based on the validation set. As MattN said in our parallel email discussion:

To clarify, our current target is 2-3% on a test set our QA team uses though they will first collect the data on Chrome and if Chrome is worse than that we may lower our bar accordingly.

Also, as a quick check, I did a test run (via fathom-test) for the model at 9bb2beb37... against all the sample pages in the unused directory (about 67 pages, 50/50 positive and negative samples), and, while I was able to reproduce the 2.5% FP rate you reported for the validation run, I got a FP rate of 6.3% for the test run. Maybe I am missing something?

Testing accuracy per tag:  0.89723    95% CI: (0.85982, 0.93465)
                         FP:  0.06324    95% CI: (0.00000, 0.13139)
                         FN:  0.03953    95% CI: (0.01279, 0.06626)
                  Precision:  0.70909    Recall: 0.79592

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1621526

Erik Rose [:erik][:erikrose]

Comment 9

•

5 years ago

There are a lot of things here. First, the amount of fitting to a validation set varies based on approach. You'll always be fitting to it a little, as long as you its results as a criterion to inform any of your decisions at all—in our case, whether to keep working. But, mechanically, we use it only to trigger our overfit protection. It's not like we're using some giant AutoML process that's trying 100 different neural nets of different sizes and shapes and picking the one that performs best on the validation set. That would be more like cherry-picking; it would be fitting much more closely to the validation set.

Second, running fathom-test against the unused samples…well…let's just say it would take some time to figure out how to interpret what you did. The current sets are built out of a few distinct distributions, which we've tried to proportionally balance between them: 31 domains Vlad collected initially, off the top of his head; a larger number of Trexa-sampled domains; the negative samples; and samples of various languages. It's possible the unused ones vary in some respect from the combination of distributions our active sets comprise. If you want to really do an early test (we aren't done yet, which is why we've taken care to keep ourselves blind to it), run it against the actual test set and just don't tell me how it goes.

A FP of 6.3% isn't totally crazy, though. The 95% CI for it reaches almost to that point:

Validation accuracy per tag:  0.94855    95% CI: (0.92400, 0.97311)
                         FP:  0.02251    95% CI: (0.00000, 0.05829)
                         FN:  0.02894    95% CI: (0.00795, 0.04993)
                  Precision:  0.89062    Recall: 0.86364

Until we run out of time, we're going to focus on tamping down the FPs. I spent yesterday redoing all the reporting in fathom-train so it now shows per-tag results, which makes this much easier. (I sat down to work on accuracy and said "I can't see anything! Let me fix that first.") We also can turn 2 different knobs to automatically trade FPs for FNs; we haven't played with those yet.

Erik Rose [:erik][:erikrose]

Comment 10

•

5 years ago

•

Edited

Hi, all! Here is the first draft of the model. I've highlighted the lines you should lift into the product: https://github.com/mozilla-services/fathom-login-forms/blob/5d02dbbe7c7f2f43dd7f3a39e1ee5e44a43b0011/new-password/rulesets.js#L6-L265.

Obviously, we're beating the pants off what the product is doing now, recall-wise. :-) 2-4% → 88%. Can't very well beat 100% precision, but we're cranking it as close as we can!

Here are the current numbers:

  Training accuracy per tag:  0.98057    95% CI: (0.96919, 0.99194)
                        FPR:  0.00233    95% CI: (0.00000, 0.00688)
                        FNR:  0.07353    95% CI: (0.02966, 0.11740)
                  Precision:  0.99213    Recall: 0.92647

Validation accuracy per tag:  0.96129    95% CI: (0.93982, 0.98276)
                        FPR:  0.02459    95% CI: (0.00516, 0.04402)
                        FNR:  0.09091    95% CI: (0.02155, 0.16027)
                  Precision:  0.90909    Recall: 0.90909

   Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                        FPR:  0.05000    95% CI: (0.01979, 0.08021)
                        FNR:  0.11429    95% CI: (0.03975, 0.18882)
                  Precision:  0.86111    Recall: 0.88571

Validation results are getting decent, at a false-positive rate of 2.5%†. The blind testing set, which I just used for the first time, raises that to 5%, so there's still room to improve. Though, if you wanted to launch today, you could crank the FPR down to 2% by setting the confidence cutoff to 0.8. You'd go from finding 88% of new-password fields to 80%, but that's probably a good tradeoff for this application.

With 0.8 cutoff:

Testing accuracy per tag:  0.93333    95% CI: (0.90358, 0.96309)
                     FPR:  0.02000    95% CI: (0.00060, 0.03940)
                     FNR:  0.20000    95% CI: (0.10629, 0.29371)
               Precision:  0.93333    Recall: 0.80000

There's also a another knob called positive-weight which should do something similar with slightly different tradeoffs. We haven't played with that yet.

Our next big step is to double the size of our corpus. We have the samples ready and will land them on Monday. They concentrate on the 100 domains Firefox users visit most often. Our biggest problem right now is success: we've fit the training corpus as closely as we reasonably can—upward of 98%—so there's not much more to examine to inspire new sources of signal. An infusion of samples should knock that percentage back down and give us grist for the feature mill. And more validation samples will dilute any unrepresentativeness between the training and validation sets, though the two have tracked very closely so far.

We'll keep hammering on this until April, and you can feel free to draw from our improvements or not, as your comfort allows. Otherwise, you can grab them in a later release.

One last thing: we've been tracking performance numbers as we go. Here are the current ones, from a test-corpus run:

Time per page (ms): 10 ▇█▅▁       61    Average per tag: 4.3

The first part is a histogram showing how long Fathom would take to consider every text-input on a page. This doesn't really apply to you, since you're firing on click. The other number is the interesting one: 4.3ms per input field. We haven't run a profiler yet, nor have we done any feature ablation (which removes rules that don't provably improve our numbers). We plan to do at least the latter before the end.

Let's keep in close touch as you start integration. Have a great weekend!

† Note that the readouts now say FPR rather than the FP I was pasting in earlier comments. We changed our false-positive math to conform to the more common formula, which is more demanding: (false positives / (false positives + true negatives)). Before, it was the proportion of total fields spuriously identified as new-password fields.

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Assignee: nobody → bdanforth

Status: NEW → ASSIGNED

Bianca Danforth [:bdanforth]

Assignee

Comment 11

•

5 years ago

Attached file Bug 1595244 - Use field labels and attributes to determine new-password field types — Details

WIP - A first pass to get everything wired up.

Add a new module, NewPasswordModel.jsm, which imports the Fathom library from Bug 1618956, and copies over the Fathom model at 5d02dbbe7c... into it.
- There are a bunch of inline TODOs here, as this is a straight copy, aside from fixing eslint errors.
- This module will likely need its own unit tests or even full DOM (i.e. mochitest-plain) tests.
The module is imported into LoginManagerChild.jsm, which runs the model against the page and returns true if the candidate input element it found has a confidence greater than or equal to FATHOM_CONFIDENCE_THRESHOLD and also matches the input element under consideration in LoginManagerChild.jsm.
- This approach will need to be modified to work for new password confirmation fields, as in that case more than one new password field would be on the page.

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Blocks: 1623431

Erik Rose [:erik][:erikrose]

Updated

•

5 years ago

Depends on: 1623646

Erik Rose [:erik][:erikrose]

Comment 12

•

5 years ago

•

Edited

New model! https://github.com/mozilla-services/fathom-login-forms/blob/bff6995c32e86b6b5810ff013632f68db8747346/new-password/rulesets.js

We've upped the recall from 80% to 90% while keeping a 2% false-positive rate—this time while keeping the confidence threshold at 0.5 (so remember to update that bit).

   Testing accuracy per tag:  0.95955    95% CI: (0.94125, 0.97786)
                        FPR:  0.02090    95% CI: (0.00558, 0.03621)
                        FNR:  0.10000    95% CI: (0.04394, 0.15606)
                  Precision:  0.93396    Recall: 0.90000
                   F1 Score:  0.91667

Time per page (ms): 5 ▁█▇▃▂▁     54    Average per tag: 4.7

Just for kicks, if you stuck with the 0.8 threshold, you'd get…

Testing accuracy per tag:  0.94382    95% CI: (0.92243, 0.96522)
                     FPR:  0.00896    95% CI: (0.00000, 0.01904)
                     FNR:  0.20000    95% CI: (0.12525, 0.27475)
               Precision:  0.96703    Recall: 0.80000
                F1 Score:  0.87562

So, back down to the old recall but with a really small FPR. I recommend 0.5 given those figures.

We have a couple more tricks up our sleeves we'll land early next week, which you're welcome to take or leave as your QA plan permits. Happy weekend!

Bianca Danforth [:bdanforth]

Assignee

Comment 13

•

5 years ago

Attached file Bug 1595244 - Add unit test for LoginAutoComplete._isProbablyANewPasswordField — Details

WIP

Still need to fix existing broken tests.

Bianca Danforth [:bdanforth]

Assignee

Comment 14

•

5 years ago

Try server run
TV run

Matthew N. [:MattN]

Reporter

Comment 15

•

5 years ago

The Windows 10 x64 debug Mochitests test-windows10-64/debug-mochitest-plain-e10s-1 M(1) failure seems to be intermittent as it passed the 2nd time. I've re-triggered it to see how frequent it is.

Talos results | Raptor results

As far as I can tell, there are no performance regressions without low confidence so I think this can land when you're ready assuming the M1 re-triggers don't fail on that same test.

Pulsebot

Comment 16

•

5 years ago

Pushed by mozilla@noorenberghe.ca: https://hg.mozilla.org/integration/autoland/rev/ca9302a3bf52 Use field labels and attributes to determine new-password field types r=MattN https://hg.mozilla.org/integration/autoland/rev/c72482e89c47 Add unit test for LoginAutoComplete._isProbablyANewPasswordField r=MattN

Narcis Beleuzu [:NarcisB]

Comment 17

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ca9302a3bf52
https://hg.mozilla.org/mozilla-central/rev/c72482e89c47

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox76: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla76

BugBot [:suhaib / :marco/ :calixte]

Comment 18

•

5 years ago

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

status-firefox75: --- → ?

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

status-firefox75: ? → wontfix

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Blocks: 1625028

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Blocks: 1625601

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Blocks: 1577366

Erik Rose [:erik][:erikrose]

Comment 19

•

5 years ago

•

Edited

Here's one more model update, in time for a few days of real-world telemetry. The main attraction is the addition of change-password forms: samples of them and rules to sniff them out. Less syntactically impressive but perhaps more foundational is the switch from evaluating any old text field to assuming the app passes in only hasEverBeenPasswordType ones. That changes the denominator, so numbers like FPR aren't directly comparable to what we had before. We also roll in several other improvements:

More signal words from more languages
Better detection of forgot-password controls
Detection of "remember me" and newsletter signup controls
Finding and reading the closest header or header-like element
Speed optimizations, care of bdanforth. Still working on porting some of those atop the latest changes.

https://github.com/mozilla-services/fathom-login-forms/blob/4a2349963df5d3c428847565f44d3d4ec65ae38b/new-password/rulesets.js#L6-L271

Precision is up 4% and recall 10%:

   Testing accuracy per tag:  0.98137    95% CI: (0.96048, 1.00000)
                        FPR:  0.06818    95% CI: (0.00000, 0.14266)
                        FNR:  0.00000    95% CI: (0.00000, 0.00000)
                  Precision:  0.97500    Recall: 1.00000
                   F1 Score:  0.98734

Time per page (ms): 2 |▁▂▅▆█▂    | 29    Average per tag: 9.3

I was actually surprised to see those numbers, because our validation numbers fell the other way, with a higher precision than recall—which is what we generally have been shooting for:

fathom-train vectors/training.json -a vectors/validation.json -s -i 10000 -l .001 -q -p .25

  Training accuracy per tag:  0.98295    95% CI: (0.96943, 0.99648)
                        FPR:  0.00000    95% CI: (0.00000, 0.00000)
                        FNR:  0.02597    95% CI: (0.00546, 0.04649)
                  Precision:  1.00000    Recall: 0.97403
                   F1 Score:  0.98684

Validation accuracy per tag:  0.94578    95% CI: (0.91134, 0.98023)
                        FPR:  0.01961    95% CI: (0.00000, 0.05766)
                        FNR:  0.06957    95% CI: (0.02307, 0.11606)
                  Precision:  0.99074    Recall: 0.93043
                   F1 Score:  0.95964

Time per page (ms): 2 | ▂█▄▂     | 37    Average per tag: 8.7

But I'll not complain about a 100% recall measurement, even though it's a cheeky thing to claim in this big, bad universe. I expect the switch is because one sample can make a big difference when you're down to a few mistakes in the corpus: we got only 3 elements wrong in the entirety of the test set. I'm excited to see what the real world has in store!

Matthew N. [:MattN]

Reporter

Comment 20

•

5 years ago

Release Note Request
[Why is this notable]: Password generation is currently only supported on sites that adopted autocomplete="new-password" on their password fields but now it will "just work" on "many more" websites. We estimate <10% of sites used that attribute but the list of ones that do is somewhat biased towards more popular ones. We are hoping for at least a doubling of how many times password generation is offered.
[Affects Firefox for Android]: Not yet (maybe in a few months they will get password generation)
[Suggested wording]: Password generation and saving work on many more websites

The "saving" piece refers to bug 1536728 which doesn't need its own relnote.

[Links (documentation, blog post, etc)]: None yet

relnote-firefox: --- → ?

Flags: needinfo?(ssage)

Ryan VanderMeulen [:RyanVM]

Updated

•

5 years ago

relnote-firefox: ? → 76+

Bianca Danforth [:bdanforth]

Assignee

Comment 21

•

5 years ago

It may be helpful to update the support page on password generation to reflect this change. See comment #20 for a good description of the change.

My suggestion would be to edit Step 2:

- 2. Hold down the control key while you click on the password field to open the Context Menu.
+ 2. If there is no option presented automatically to generate a password as shown in Step 4, hold down the control key while you click on the password field to open the Context Menu.

user-doc-firefox: --- → docs-needed

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629892

Andrei Purice

Updated

•

5 years ago

Depends on: 1629894

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629895

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629899

Andrei Purice

Updated

•

5 years ago

Depends on: 1629901

Andrei Purice

Updated

•

5 years ago

Depends on: 1629903

Andrei Purice

Updated

•

5 years ago

Depends on: 1629904

Andrei Purice

Updated

•

5 years ago

Depends on: 1629909

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629912

Andrei Purice

Updated

•

5 years ago

Depends on: 1629913

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629916

Timea Cernea [:tbabos][inactive]

Updated

•

5 years ago

Depends on: 1629920

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Updated

•

5 years ago

Depends on: 1631818

Timea Cernea [:tbabos][inactive]

Comment 22

•

5 years ago

Marking this as Verified-fixed as it was part of FX 76 feature testing which included the verification of this on over 60+ sites.

Status: RESOLVED → VERIFIED

status-firefox76: fixed → verified

Flags: qe-verify+

Bianca Danforth [:bdanforth]

Assignee

Updated

•

5 years ago

Depends on: 1633837

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Regressions: 1634819

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1635833

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

No longer blocks: 1625601

Depends on: 1625601

Matthew N. [:MattN]

Reporter

Updated

•

5 years ago

Depends on: 1638187

Sam Foster [:sfoster] (he/him)

Updated

•

5 years ago

Depends on: 1640265

Matthew N. [:MattN]

Reporter

Updated

•

4 years ago

Depends on: 1641964

Matthew N. [:MattN]

Reporter

Updated

•

4 years ago

Depends on: 1655165

Matthew N. [:MattN]

Reporter

Updated

•

4 years ago

Regressions: 1673196

Matthew N. [:MattN]

Reporter

Updated

•

4 years ago

Depends on: 1674499

Bug 1595244 - Use field labels and attributes to determine new-password field types 5 years ago Bianca Danforth [:bdanforth] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1595244 - Add unit test for LoginAutoComplete._isProbablyANewPasswordField 5 years ago Bianca Danforth [:bdanforth] 47 bytes, text/x-phabricator-request		Details \| Review