Closed Bug 1726022 Opened 1 month ago Closed 1 month ago

(August 2021 regression?) Firefox password manager is slow to load (22s for 361 passwords on an i7), using 100% CPU and causing laptop fans to spin up

Categories

(NSS :: Libraries, defect, P1)

3.69

Tracking

(firefox-esr78 unaffected, firefox-esr91 unaffected, firefox91 unaffected, firefox92 unaffected, firefox93+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- unaffected
firefox91 --- unaffected
firefox92 --- unaffected
firefox93 + fixed

People

(Reporter: ronan, Assigned: rrelyea)

Details

(Whiteboard: [nss-fx])

Attachments

(2 files)

Steps to reproduce

  1. Open about:logins

Actual results

  • Load takes 22s...
  • ... during which the main firefox-bin process keeps 1 CPU core (out of 4) 100% busy, making the laptop fans spin :-/

Symptoms are very reminiscent of Bug 1606992 from 2 years ago.

Expected results

Opening the password manager is usually instant and doesn't make the fans spin.

This was the case a few days / weeks ago. I'm not sure in which Nightly this started happening, probably somewhere early August.

Debug information

  • Firefox 93.0a1 - 20210816094534 (official binary build from mozilla.org)
  • Arch Linux 64bit
  • 361 passwords
  • Using a master password, whose config didn't change recently (master password wasn't recently deleted/recreated)
  • Passwords are not synced with Firefox Sync, they're local only
  • Feel free to ask for more info

Hi Ronan, thanks for filing this and for the detailed report! I'm able to reproduce this slowdown on both Windows and Linux with around 500 logins on a clean profile.

In order to script 500 logins into a clean instance via browser toolbox:

const { LoginHelper } = ChromeUtils.import(
  "resource://gre/modules/LoginHelper.jsm"
);

var logins = [];
for (let i = 0; i < 500; i++) {
  logins.push({origin: `https://${i}.example.com`, username: `user${i}`, password: `password${i}`});
}
await LoginHelper.maybeImportLogins(logins);

According to this profile I took showing the problem (in StreamTrans #31 just in case I did the permalink incorrectly), it looks like we're spending a lot of time in pk11Decrypt and the various functions it seems to be calling. I'll see if I can find a regression range since I'm not sure what would cause this change.

Status: UNCONFIRMED → NEW
Ever confirmed: true

Going through mozregression, ended up with this pushlog which shows Bug 1724869 might have something to do with this.

:beurdouche, can you provide some insight/redirection into what the issue might be? Thanks!

Flags: needinfo?(bbeurdouche)

From a quick look, it looks like https://phabricator.services.mozilla.com/D120011#change-pmKEWXJhdZTs might be the cause of this, however the change looks legitimate and most of the time is spent in the crypto, which I don't think we can that compress much.

:rrelyea, :mt
Does that look the plausible cause and do you think we might make that check optional or something?

Flags: needinfo?(rrelyea)
Flags: needinfo?(mt)
Flags: needinfo?(bbeurdouche)
Flags: needinfo?(bbeurdouche)

This is a pretty major regression for password manager. Moving component for triage.

Assignee: nobody → nobody
Component: about:logins → Libraries
Product: Firefox → NSS
Version: Firefox 93 → other

Sigh, I don't see how we can make the check optional, but it may be turning up a bug in the fix for Bug 1606992. If it's not caching the right stuff for pba verification (instead of pbe encryption), that could be causing the issue.

Flags: needinfo?(rrelyea)

OK, it looks like the issue is the cache keeps exactly one pbe item for PKCS5v2. We should use 2 items for each operation (keypbe and keypba), so we are trashing our cache each time and loose the consistency of the cache.

We can fix this in the following ways: 1) keep a 2 item first in first out cache for PKCS5v2. or 2) keep a separate pbe and pba cache for PKCS5v2. The relevant code is in lib/softoken/lowpbe.c

bob

Looks like at the level we are at, it's hard to tell if the the requested value is a PBE or PBA, so I just implemented a 2 value cache (generically so we can increase the cache count in the future if we need to. I've testing to make sure I haven't broken anything obvious, then I'll attach a patch for some one else to test to see if it fixes the underlying issue. We'll need an actual regression test case to add to our test suites to make sure we don't regress again in 2-5 years;).

bob

Sounds like Bob reached the same answer I did. A two item cache makes sense.

Flags: needinfo?(mt)

Possible solution based on increasing the number of cache entries used by the PKCS5v2 values.

This patch is tested for correctness, but not fixing the issue. New test cases are needed.

Attachment #9236748 - Attachment is obsolete: true
Attachment #9236748 - Attachment is obsolete: false

[Tracking Requested - why for this release]:

Setting tracking because this is a pretty bad regression from a perceived performance standpoint. For ~400 passwords in my database, it takes something like 10 seconds for the logins to load - and in the meantime, about:logins shows an empty list, which is pretty jarring. I can imagine some users feeling like all of their passwords were suddenly missing if they when to about:logins and didn't wait the full amount of time necessary to complete the load.

(In reply to Mike Conley (:mconley) (:⚙️) from comment #10)

I can imagine some users feeling like all of their passwords were suddenly missing if they when to about:logins and didn't wait the full amount of time necessary to complete the load.

Reporter here. I confirm this impression, this is what I thought initially. Then, hearing my CPU fans go nuts and being a tech literate Nightly user (and having reported an identical bug two years ago 😄), I understood what was going on, but I can certainly imagine less-technical Beta users freaking out thinking all their passwords were lost.

I've checked in my potential fix, bug I'm having a problem reproducing this with just NSS. pwdecrypt is able to decrypt 300 entries in under a second, but I'm not sure that they are getting a good iteration count.

Here's a profile I took today on my Nightly (built on 2021-08-23): https://share.firefox.dev/3sI1HF6

I don't know NSS very well, but it looks to me like the majority of time is being spent in sftkdb_DecryptAttribute, split pretty evenly between sftkdb_VerifyAttribute and nsspkcs5_CipherData.

Does that help narrow it down? I'm also willing to locally build and test diagnostic patches / speculative fixes.

Flags: needinfo?(rrelyea)

I already have a theory on why this happens, and I've checked in a potential fix into the NSS tree already. What I don't have is an nss only test case reproducer so I can verify that it does indeed fix the issue, and so we can make sure that we don't regress again.

Here's the patch I've checked in: https://hg.mozilla.org/projects/nss/rev/fe82761e35aad3a2c1a0a6459b82993a07ac5cc3

Flags: needinfo?(rrelyea)

I will do an uplift of NSS to mozilla-central tomorrow morning to see if it solves the problem.

OK, I'm able to verify my patch. The try builds are running now. Here is the test path:
https://hg.mozilla.org/projects/nss-try/rev/2bc601b2f13deea818e3c446b4f44245eae77739

My issue was 2 fold. 1) I needed to run the tests on the NSS-3.69-BRANCH (my test tree was too old), and
2) The sdr tests needed to run with password protected databases (null pw data bases have a '1' iteration count).

These test verify the already checked in patch, and fail on the NSS-3.69 branch (subsecond runs on the latest branch, 20 second rund on NSS 3.69).
Once the try completes (I want to make sure there aren't any failures on slower platforms). I'll push them to phabricator.

Looking at the test, it seems plausible. I would use date +%s to measure time though and look for a difference of <2. The time utility/builtin doesn't produce a consistent output format in my experience.

This change is on its way to mozilla-central as I landed https://phabricator.services.mozilla.com/D123442.

Assignee: nobody → rrelyea
Severity: -- → S4
Flags: needinfo?(bbeurdouche)
Priority: -- → P1
Whiteboard: [nss-fx]
Version: other → 3.69
Flags: needinfo?(bbeurdouche)

Updated test cases to verify pbe caching fix.
NOTE: putting passwords on databases are key to reproducing the original issue.

I've pushed the test patch to phabricator. It passes on all our platforms now (after dealing with different mac and windows issues). No platform takes more then .45 seconds, and most platforms run in about .2 seconds, so testing for more than zero seems reasonable (I thought maybe I might have to handle a case where it naturally takes 1 second). All three seem to handle time fine. There are special format options for time which I didn't use because I didn't trust them to be universal.

Status: NEW → ASSIGNED

This appears to be fixed for me in the latest Nightly! Ronan, can you confirm on your end as well?

Flags: needinfo?(ronan)

(In reply to Mike Conley (:mconley) (:⚙️) from comment #21)

This appears to be fixed for me in the latest Nightly! Ronan, can you confirm on your end as well?

Fix confirmed in Nightly 93.0a1 (2021-08-24): about:logins now loads instantly for my 361 passwords. Thanks everybody!

Flags: needinfo?(ronan)

Mike, Ronan,
Any chance that one of you has a copy of Fx92 Beta and could confirm that this problem happens there?
The issue was introduced in NSS 3.69 so I am quite sure it is affected.
I think the patch is a good candidate for uplift... Thank you

Flags: needinfo?(ronan)
Flags: needinfo?(mconley)
Severity: S4 → S2

I exported my logins from my Nightly, and imported them into a copy of Beta and made sure I had a primary password set on the Beta build. Upon restarting and opening about:logins on Beta, it took less than a second for my ~400 logins to appear. So it doesn't look to me like Beta is affected with the same issue.

Flags: needinfo?(mconley)

I confirm Mike's finding that 92.0b8 is unaffected.

I did my test differently from Mike, and am sure too I used a primary password: I copied my 93-nightly profile, disconnected from the internet (to avoid possibilities for data corruption through Firefox Sync of my 93-nightly profile) and ran Beta with the -allow-downgrade -P my93profile . about:logins opened instantly. An uplift to Beta 92 doesn't seem needed.

Flags: needinfo?(ronan)

Thank you both ! : )
I confirm that the problem doesn't happen in Fx92 Beta.

I was surprised that this bug didn't appear there but I now understand why:
something weird happen during the uplift of NSS 3.69 to Fx92 while I was away in vacation and the change causing this issue was never actually uplifted to Fx92. Hence Beta 92 is safe...

Thanks both, and thank you Bob for fixing it quickly :) B.

Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Flags: needinfo?(bbeurdouche)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.