Closed Bug 1662247 Opened 4 years ago Closed 3 years ago

osclientcerts: slow devices can make firefox startup slow

Categories

(Core :: Security: PSM, defect, P2)

x86_64
Windows 10
defect

Tracking

()

RESOLVED DUPLICATE of bug 1669414

People

(Reporter: bugs, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [psm-backlog])

All HTTP sites work fine. All HTTPS ones time out.
Problem first encountered after updating to nightly mid-last week, and since it was a busy week, just switched to stable. Tried a clean profile today, was still failing. Even after manually enabling enterprise roots. Tried downloading clean nightly, still failed.

Tried a few sites with browser console open - nothing useful in there apart from all the telemetry etc https timing out.

Not sure exactly how to diagnose such issue. JC, do you know of some NSS/PSM changes that happened in the past weeks? What kind of logs would we need to figure out what's wrong?

Component: Networking → Security: PSM
Flags: needinfo?(jjones)

Thanks for the needinfo Valentin -

nemo, just to confirm, this only happens with Nightly 82 and 83? Does it happen with beta (or release) (and could you let me know what version(s) you tested)?

We might need a wireshark capture and key (from SSLKEYLOGFILE) - is that something you'd be able to do?

Flags: needinfo?(jjones) → needinfo?(bugs)

Release is fine. I haven't tested beta, but it is (probably) fine given how recent the error is. Request for Wireshark is not new (corporate mitm was broken in firefox back in january too) I'll try to get on that when I get a few free minutes.

Flags: needinfo?(bugs)
User Story: (updated)

Hm... and a key? I guess it's probably fine... I need to look up what we did last time, if I mailed the mozilla engineer privately. Even if the contents are probably not sensitive I'd rather not attach to the bug if possible...

User Story: (updated)

Thanks, nemo, I remember now. You can email me the key (jc@mozilla.com), or mark the key as private here in the bug and then only core-security-release team can get to it, which is a small number of trusted Mozilla folk. :)

Hey. Haven't had time to do a full bisect or wireshark, but I needed to recover this profile, and since -offline no longer functions with firefox, and the broken HTTPS causes firefox to hang to the point of unresponsiveness on startup, I grabbed a few nightlies.
2020-09-04, 2020-08-23, 2020-08-22, 2020-08-21 (I was a bit optimistic about that week ☺ ) all hang.

2020-08-11 works.

When I have some time I'll download a few more to bisect down that 2020-08-11 - 2020-08-21 range (and do the wireshark)

Aw crud... sorry if I just spawned a wild goose chase. 2020-08-11 is hanging too!

I'd swear it worked. Briefly.

Ok. Will try some really old nightlies. Maybe it's something they changed network config-wise.
Ugh. Guess I'd better try beta too.

Hi nemo, you can use mozregression do to the bisection for you.
Just point it to your profile, and say good/bad for each version and it'll track down the issue to the commit level. It's very convenient.

A few more questions: Do you use a proxy or VPN?
Could you also check in about:config if you have any doh-rollout prefs and also attach the output of about:support? Thanks!

Flags: needinfo?(bugs)

Normally I would use mozregression, but unfortunately this is on a Windows 10 VMware View I VPN into to get work done.

WRT DOH, it is definitely not DOH related since it happens even on internal dev machines on the .local domain when SSL is enabled.
But in any case it happened even with DOH disabled in the clean profile I created, and when accessing sites by IP.
The other weird thing about this is I thought they did not MitM internal dev machines...
BTW, the removal of support for Offline mode... really really annoying.

In the brand new profile that is failing, I didn't see anything in particular in about:support that might interest you. AV is all Windows standard... the NS security stuff looks totally normal - NSPR 4.29beta, NSS 3.57 beta

Since thankfully about:config isn't being blocked yet on desktop (my poor stable android is now on fdroid), I force-enabled enterprise roots (I'm guessing the hanging was preventing it from being set) and restarted. It made no difference whatsoever.

Flags: needinfo?(bugs)

Oh... and just to be clear, on the Windows 10 VM I'm VPN'd into, connecting OUT from firefox proceeds normally apart from the corporate MitM that sometimes (but not always it seems?) somerwhere in the routing that inserts itself into connections to remote sites. There is no other VPNing or proxying, and Firefox Stable works fine so long as enterprise certs are enabled.

Ohhhh. I think I know what happened with the 2020-08-11 that worked, then failed. I neglected to disable autoupdate. Gotta make sure I do that before loading the profile ☺ ☺

I'm going to restore 2020-08-11, disable autoupdate in testing profiles, and verify, but I'm going to stick with the breakage being somewhere in that range. I'd just verified 2020-08-01 was working and was like.... waiiiiitasec, that green ↑ in the corner... I bet that's why 2020-08-11 failed the 2nd time I started it 😃

We still need either a regression range or a packet trace to diagnose this further.

Severity: -- → S4
Flags: needinfo?(bugs)
Priority: -- → P3
Whiteboard: [psm-waiting]

Well, you have a 10 day range at least between 2020-08-11 (works) and 2020-08-21 (fails) - I'll try to set aside some time to download and test few more builds tomorrow to get that down to 1 day and maybe a wireshark. It's been crazy busy lately both at work and with online school.

Flags: needinfo?(bugs)

Ok! Spent a boring half hour at end of work session today downloading builds... Nightly from 2020-08-17 with Build ID 20200817214602 WORKS Nightly from 2020-08-18 with Build ID ?????????????? (locks up, can't get to about:support) FAILS. Next time I have some free time, wireshark... Oh, and since a couple of times I neglected to disable autoupdate while setting up profiles, I can confirm that latest nightly still fails.

Heh. This morning when I connected to the work Windows 10 VM it was finally semi-responsive, so I typed in about:support - which promptly hung again (does everything require an HTTPs connection now?). After checking back in half an hour it had unlocked so I was finally able to get a build ID.

Nightly from 2020-08-17 with Build ID 20200817214602 WORKS

Nightly from 2020-08-18 with Build ID 20200818092452 FAILS

Thank you for chasing this down!

In one of the affected Nightlies, does the behavior change if you go to about:config, change security.osclientcerts.autoload to false, and restart the browser?

Flags: needinfo?(bugs)

Yes!

Flags: needinfo?(bugs)

So... sounds like you have a theory and fact that everything is working now confirms it. Is this flag new? Does this still mean you need wireshark from me? Or... I guess... possibly more detail on the OS certs? Or are we good.

The only NSS/PSM change that landed on the 17th is https://hg.mozilla.org/mozilla-central/rev/7e5f806e44a8d35de67bf4ea3e9060836bcf440d. osclientcerts is only enabled by default on Nightly, which also aligns with your report.

I don't think we need a PCAP any more. Dana probably has a better idea of what, if any, further information would be useful from your environment.

Flags: needinfo?(dkeeler)

Logs from osclientcerts would help. Can you follow the instructions in bug 1629002 comment 16 and attach the output here? Thanks!

Flags: needinfo?(dkeeler) → needinfo?(bugs)

Just FYI, I turned over a log to J.C. on Tuesday. I hope I did it correctly.

Flags: needinfo?(bugs)

You did, sorry, I don't think Dana had a chance to look them over yet. I definitely haven't :(

No worries, just wanted to be sure you got it. And also to clear out the needinfo nag message 😝

Oh, and just so I understand what we have to look forward to... so if I read https://wiki.mozilla.org/Release_Management/Calendar correctly, on 2020-10-20 all the people on release firefox over here will have completely non-functional, frozen browsers?

Unfortunately the log doesn't indicate what may be wrong. You say Firefox hangs - can you use Process Hacker [0] to see what Firefox is doing when it hangs?
When you run Process Hacker, you should see Firefox in the list of running programs. If you double-click on the Firefox process that is hanging, you should see a window with a "Threads" tab. On that tab, if you select the thread that's hanging and double-click it, another window should open up with a stack trace.

Also, this feature is only enabled in Nightly, so this won't be released as-is.

[0] https://processhacker.sourceforge.io/downloads.php

Flags: needinfo?(bugs)

The symptoms (to me) suggest it's just blocking waiting on the remote HTTPS request until it times out. Loading firefox, loading about:support all take a very long time to finally become functional. I assume it's all the background HTTPS requests firefox does when phoning home for statistics, checking for updates, addons to inactivate, whatever. Could be wrong though!

The main UI seems to be behaving identically to what happens if I load an HTTPS site. The tab just sits there for a long long time and then finally fails to connect.

I suspect if I could figure out how to get offline mode to work these days there'd be no more UI hangs.

I will try to see if there's any indication of what Firefox is doing with the tool described. I bet if I breakpoint it'll be in something due to establishing an SSL connection..

Fortunately the workaround described above perfectly fixes the bug. Still worried about others once this makes it to release though. I rather doubt my VM is that different from all the others around here.

Flags: needinfo?(bugs)

Oh, my apologies. That last sentence was pretty important. Thanks, I was a bit worried.

Ok. Will try process hacker tomorrow when I sign back onto the work network.

Hm... looks like they won't let me install process hacker. Gets yanked immediately by, I guess, the antivirus, and don't seem to be allowed to run it either. Getting permissions for this sort of thing often fails and takes forever anyway. I kinda feel like you just want be to do the equivalent of gdb attach to process, and backtrace at attach point. Is there some other way to do that in windows?

DDGing around, I see there is WinDbg - gonna try installing that.

Found a promising thread, some sort of hang in

KERNELBASE!WaitForMultipleObjects
WinSCard!_SendSCardIOCTL
WinSCard!_SendSCardConnect

Soooo, smartcard related. Odd since smartcard seems to work fine in other apps (Outlook, smartcard management).

But I will send all the threads over to you guys.

Summary: Firefox Nightly has been nonfunctional with SSL at work for past week. Corporate MitM bug? → osclientcerts locks up Firefox Nightly since 2020-08-17. possibly smartcard related.

(In reply to nemo from comment #31)

Found a promising thread, some sort of hang in

KERNELBASE!WaitForMultipleObjects
WinSCard!_SendSCardIOCTL
WinSCard!_SendSCardConnect

Thanks! Can you include the full stack?

Flags: needinfo?(bugs)

I emailed it to JC, same as the other debug info you'd requested.

Flags: needinfo?(bugs)

Oh, BTW, I can also email text from the smartcard diagnostic tool if that helps at all.
I also reverified that my cert is accessible in the standard UI but firefox still locks up.

That might be helpful. You can also email me directly instead of going through J.C.

Thanks for the logs. We're determining how to address this.

Priority: P3 → P2
Summary: osclientcerts locks up Firefox Nightly since 2020-08-17. possibly smartcard related. → osclientcerts: slow devices can make firefox startup slow
Whiteboard: [psm-waiting] → [psm-backlog]

Hm. I'm not 100% certain the new title is correct. Once firefox stopped hanging on startup, I was able to trigger a hang by accessing about:support or any https site (http sites were fine).
So it's clearly not just startup. And... I'm not sure it is just a slow device since I'd literally wait for half an hour before getting a usable URL bar. The smartcard is slow, it is not that slow.

But maybe there are several related problems with osclientcerts or recent ssl changes. In any case, turning it off fixed all my issues.

Ooh... Or maybe every single SSL operation was rescanning the card instead of caching?
That sure would explain it 😃 So it's not just one slow-ish card action but like.. hundreds of them. Maybe even stacking.

Bug 1669414 should have improved things somewhat. How does the latest Nightly work for you?

Flags: needinfo?(bugs)

Nightly firefox in a clean profile no longer hangs. security.osclientcerts.autoload set to true.
Don't see any blue flickering lights on smart card to indicate read/access.
So seems to be fixed!

Is there anything else I should check? I admit I didn't really read the other bug too carefully.

Flags: needinfo?(bugs)

Great - thanks! If normal use of the browser works for you, I'm satisfied this bug is fixed. If you do see any issues related to client certificates, feel free to file a new bug in this component.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.