Open Bug 1370516 Opened 9 years ago Updated 2 years ago

NSS should be initialized off main thread

Categories

(Core :: Security: PSM, enhancement, P2)

enhancement

Tracking

()

Performance Impact medium

People

(Reporter: florian, Unassigned)

References

(Depends on 2 open bugs, Blocks 6 open bugs)

Details

(Keywords: perf:responsiveness, perf:startup, Whiteboard: [psm-blocked][bhr:nsNSSComponent::InitializeNSS])

In my profiles, I've observed NSS being initialized synchronously on the main thread the first time it is needed. From what I remember, in warm startup profiles, it took ~16ms, but in cold startup profiles it's usually more than 100ms (300ms is not uncommon) due to main thread IO when loading libraries. We initially discussed this in bug 1362364, but the bug got repurposed. For now to improve startup we are working on avoiding NSS initialization before first paint (bug 1362364, bug 1359031, bug 1367450). I think the real solution would be to start initializing NSS on a different thread as early as possible during startup, so that it's ready by the time we may need it. We currently have 2 cases where we would like to use SSL before first paint: - captive portal would like to do an async XHR asap to that we can know if we should avoid restoring the previous session. - bug 808104 would like to speculatively connect to the home page so that we are ready to load it as soon as the XUL window is ready. I think that we would like is to have a promise that's resolved as soon as network is available, so that we can start both of these things asap, but without the risk of triggering main thread IO.
Flags: needinfo?(mcmanus)
Meant to ask a question with the needinfo: Patrick, is this likely to be possible to do in the quantum/photon time frame (ie. for 57)?
Blocks: 808104
Component: Networking → Security: PSM
Flags: needinfo?(mcmanus) → needinfo?(dkeeler)
I'm still investigating this, but I wanted to at least give a response. Unfortunately it's of the form "we don't know yet, but it would be difficult and not without risk". I will update this bug as I determine a more complete answer.
Flags: needinfo?(dkeeler)
Whiteboard: [qf] → [qf:p1]
Depends on: 1372656
Priority: -- → P2
Whiteboard: [qf:p1] → [qf:p1][psm-blocked]
This bug is not going to be fixed in time for 57. I am moving this to P2 for post 57 work.
Whiteboard: [qf:p1][psm-blocked] → [qf:p2][psm-blocked]
Keywords: perf
(In reply to Dana Keeler [:keeler] (use needinfo) from comment #2) > I will update this bug as I determine a more complete > answer. Hey keeler, have you been able to determine a more complete answer?
Flags: needinfo?(dkeeler)
Whiteboard: [qf:p2][psm-blocked] → [qf:p1][psm-blocked]
This would depend on bug 1421084 (if we can even do it, which isn't certain). If/when that work is done, I think the way we would have to do this would be to move the majority of nsNSSComponent::InitializeNSS() inside LoadLoadableRootsTask::Run() (I imagine we would rename that). Then EnsureNSSInitializedChromeOrContent() would have to call BlockUntilLoadableRootsLoaded(). The last piece would be explicitly causing a PSM initialization early in startup after getting the user's profile directory.
Depends on: 1421084
Flags: needinfo?(dkeeler)
Whiteboard: [qf:p1][psm-blocked] → [qf:i60][qf:p1][psm-blocked]
Whiteboard: [qf:i60][qf:p1][psm-blocked] → [qf:f60][qf:p1][psm-blocked]
Whiteboard: [qf:f60][qf:p1][psm-blocked] → [qf:f61][qf:p1][psm-blocked]
Whiteboard: [qf:f61][qf:p1][psm-blocked] → [qf:f61][qf:p1][psm-blocked] [fxperf]
Whiteboard: [qf:f61][qf:p1][psm-blocked] [fxperf] → [qf:f61][qf:p1][psm-blocked][fxperf:p3]
See Also: → 441355
Hey dkeeler, with bug 1421084 fixed, is fixing this bug still as you described in comment 6? Is this still something we can attempt?
Flags: needinfo?(dkeeler)
Whiteboard: [qf:f61][qf:p1][psm-blocked][fxperf:p3] → [qf:f64][qf:p1][psm-blocked][fxperf:p3]
Essentially, yes. The added code from bug 1427248 would have to be handled a bit differently, but that wouldn't be too hard. My main concern here is that currently consumers assume it's safe to call NSS functions if nsINSSComponent has been initialized, but this would no longer be the case. We'd have to sprinkle around calls to `BlockUntilLoadableRootsLoaded()` (or whatever it would be called) in a number of places, and I don't see a way of guaranteeing that new code will do the appropriate waiting. This could lead to a long litany of whack-a-mole crashes where we ship a version and then go "oh, whoops - forgot to wait here... forgot to wait here... forgot to wait here...". I think before we go further with this bug, we need to figure out what exactly is slow about initializing NSS. With the loadable roots loading on another thread, I suspect the long pole is now the user's certificate and key databases. If so, we may be able to load those on the background thread as well (and if we ever got the waiting wrong all that would happen is e.g. some TLS connections may fail and the user can just reload the page, rather than having Firefox crash).
Flags: needinfo?(dkeeler)
Whiteboard: [qf:f64][qf:p1][psm-blocked][fxperf:p3] → [qf:p1:f64][psm-blocked][fxperf:p3]
Blocks: 1445965
Whiteboard: [qf:p1:f64][psm-blocked][fxperf:p3] → [qf:p2:responsiveness][psm-blocked][fxperf]

Marking P2 for investigation because the last comment from keeler is that to move forward we need more information about exactly what is slow about NSS initialization

Whiteboard: [qf:p2:responsiveness][psm-blocked][fxperf] → [qf:p2:responsiveness][psm-blocked][fxperf:p3]

Here are two profiles showing what happens during NSS initialization: https://perfht.ml/2G4F0pl https://perfht.ml/2G4Mw3x

Depends on: 1557378
Blocks: 1538631
See Also: → 1596429
See Also: → 1596430
Whiteboard: [qf:p2:responsiveness][psm-blocked][fxperf:p3] → [qf:p2:responsiveness][psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS]

We're seeing this on Fenix startup profiles, about 100ms (with profiler) on a Moto G5

https://share.firefox.dev/3f8Yeen

Raising the priority of this, this is still showing up frequently on BHR, combined with startup profiles on Fenix this seems like it could be a real win. Florian, if you believe I'm overstating the value here please correct me.

Whiteboard: [qf:p2:responsiveness][psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS] → [qf:p1:responsiveness][psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS]
Performance Impact: --- → P1
Whiteboard: [qf:p1:responsiveness][psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS] → [psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS]

So, on Windows it seems like the primary cost here is just loading two dlls? Have we experimented with just loading those dlls in the background early during startup, as a safer win here?

The Android profile looks to be exhibiting a different problem - if I'm right then maybe we should split this into separate bugs.

(In reply to Doug Thayer [:dthayer] (he/him) from comment #13)

So, on Windows it seems like the primary cost here is just loading two dlls? Have we experimented with just loading those dlls in the background early during startup, as a safer win here?

It looks like there was an experiment along those lines a few years ago: bug 1640087. It looks like you were actually involved - do you know the outcome of that?

Flags: needinfo?(dothayer)

Ack - I simply forgot that these dlls were involved in that experiment. Bug 1640087 was a really weird one. On every machine I could get my hands on with an HDD and not an SSD, it yielded an outstanding startup performance win, typically on the order of 15-25%. However in the wild the performance improvements evaporated, and we observed almost no difference, but with I believe a very very slight win on HDDs, but not enough to be significant. IIRC the modified list of dlls was experimented with locally across a range of machines to find the optimal set.

We never figured out why the performance differences evaporated - we checked and double checked the experiment and couldn't identify any problem. I certainly think it's time we clean up the set of dlls we prefetch, but it's hard to figure out what the proper list should be. I would lean toward going with what our local testing highlighted, even though the experiment flopped. Sorry I can't be more conclusive on this.

Flags: needinfo?(dothayer)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #15)

I certainly think it's time we clean up the set of dlls we prefetch, but it's hard to figure out what the proper list should be.

Couldn't we save during startup the order in which dll files are accessed (we record profiler markers for DLL loads, so it should be possible to also record the information into some other storage), so that at the next startup we pre-read these dll files off main thread? We already do something similar for non-dll files.

Flags: needinfo?(dothayer)

Right, but it's not actually clear to me that we get much out of that. Especially on Windows, where the OS is trying to prefetch files for us anyway.

Revising my recent take on this bug, has anyone actually been able to reproduce this problem on anything other than on one of our spinny disk reference machines? I can't get it to show up to any serious degree in profiling on any of my SSD machines. If it only shows up on slow spinny disk systems, then we need to be explicit that that is who would be affected, and I don't feel confident anymore that A) that hardware is a priority, since it is slowly dying out, or B) that the problem is even very tractable for them.

Flags: needinfo?(dothayer)

(In reply to Doug Thayer [:dthayer] (he/him) from comment #17)

has anyone actually been able to reproduce this problem on anything other than on one of our spinny disk reference machines? I can't get it to show up to any serious degree in profiling on any of my SSD machines.

I haven't tried on machines with SSDs, but that now makes me wonder if we should include the information about whether there's an SSD as an annotation in BHR reports.

A) that hardware is a priority, since it is slowly dying out,

Do we have data about how quickly it's dying (or not dying) for our users? I think we still have lots of Windows 7 users, so that's likely users that haven't updated their machines in a while.

(In reply to Florian Quèze [:florian] from comment #18)

Do we have data about how quickly it's dying (or not dying) for our users?

Answering my own question a bit: https://www.statista.com/statistics/285474/hdds-and-ssds-in-pcs-global-shipments-2012-2017/ shows that in new machines, SSDs became the majority only in 2021, with 52% of disks shipped in 2021 being SSD. It seems Microsoft would like all new Windows 11 PCs to have SSDs in 2023, but it's unclear if it will actually happen as there's some pushback from the industry.

I think we still have lots of Windows 7 users, so that's likely users that haven't updated their machines in a while.

17% of our users are still running Windows 7, according to https://data.firefox.com/dashboard/hardware

It's still just questionable to me that the problem is even tractable for HDD users. Again, Windows is trying to prefetch files off disk anyway, and this behavior varies from OS version to OS version, so we would have to experiment across a range of drives across a range of OS versions and fundamentally we're working with less information and less expertise than the OS has.

The Performance Priority Calculator has determined this bug's performance priority to be P2. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: [x] Windows [x] macOS [x] Linux [x] Android
Impact on browser UI: Causes noticeable startup delay

Bug requires additional experimentation on HDD vs SDD. Clear actionable next steps are requested.

Performance Impact: P1 → P2
Keywords: perfperf:startup
Severity: normal → S3
Whiteboard: [psm-blocked][fxperf:p3][bhr:nsNSSComponent::InitializeNSS] → [psm-blocked][bhr:nsNSSComponent::InitializeNSS]
You need to log in before you can comment on or make changes to this bug.