Closed Bug 1462303 Opened Last year Closed Last year

Intermittent SSL_RX_MALFORMED_SERVER_HELLO on HN and YouTube (TLS 1.3)

Categories

(NSS :: Libraries, defect, P1)

3.35
defect

Tracking

(firefox-esr52 unaffected, firefox-esr6060+ fixed, firefox60+ fixed, firefox61+ fixed, firefox62+ fixed)

RESOLVED FIXED
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 60+ fixed
firefox60 + fixed
firefox61 + fixed
firefox62 + fixed

People

(Reporter: ekr, Assigned: mt)

References

Details

Attachments

(1 file)

We are seeing occasional reports of SSL_RX_MALFORMED_SERVER_HELLO on TLS 1.3-hosted sites with Fx60. Here's the available data:

- These sites are running BoringSSL
- It's intermittent: if you reload, it works
- At least one user reports this error when they are pretty sure they are not behind a middlebox. I saw it myself on an airplane, but again, a reload fixed it.
- We didn't get any reports of this with Fx59 even though it had TLS 1.3 on.

There are no super-obvious changes to the NSS code that would cause this, though our current best bet is the error is coming from the session-id comparison code in compat mode:

https://searchfox.org/nss/rev/535129af9c575342f8cd1fb227da7fc5722f3acd/lib/ssl/ssl3con.c#6165

We are trying to get a PCAP, but I think the next step is to run NSS (or better yet, Firefox) in a loop and try to reproduce.
We also need to measure the actual error rate we're getting here, and see if it's worth scaling back TLS 1.3 on 60 while we sort this out.
We're working on a Firefox loop based on the Canary + tcpdump. Assigning to me.
Assignee: nobody → jjones
Priority: -- → P1
ekr found the bug in theory, I reproduced with a test case and there is a fix inbound.  Let's talk about uplift.  It's a small change, but one that will prevent some ugly errors during TLS 1.3 rollout.
https://hg.mozilla.org/projects/nss/rev/43018e031dee1b0c7cac16d3de268de3de6d938d
Assignee: jjones → martin.thomson
Status: NEW → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → 3.38
Version: trunk → 3.35
OS: Unspecified → All
Hardware: Unspecified → All
Ok, the case for/against uplift.

This IS user visible.  It will manifest with a scary warning once (for each user and each site).  Reloading the page (or the browser) will make it go away.

It only happens when the site upgrades to TLS 1.3, which we've seen a lot of recently.  It only happens if the user has an open session during the upgrade.

This is a pure interop issue.  There is no security risk here, just an ugly warning.  So...

I would say that we want to uplift to Beta and - provided that release management are ok - wait for the next regular build of 60 to catch the change.  I can start the process of getting the NSS releases ready in case someone wants to move.
(In reply to Eric Rescorla (:ekr) from comment #1)
> We also need to measure the actual error rate we're getting here, and see if
> it's worth scaling back TLS 1.3 on 60 while we sort this out.

Is someone working on getting that data?  Should this block rolling out 60 further in the mean time?
Flags: needinfo?(ekr)
Comment on attachment 8976993 [details]
Bug 1462303 - Allow TLS 1.3 compat mode when attempting to resume TLS 1.2, r?ekr

Tim Taubert [:ttaubert] has approved the revision.

https://phabricator.services.mozilla.com/D1310
Attachment #8976993 - Flags: review+
I am running a Spark job to try to collect this,
Flags: needinfo?(ekr)
(In reply to Eric Rescorla (:ekr) from comment #9)
> I am running a Spark job to try to collect this,

But now that we have diagnosed it, I expect it to be pretty rare and given that we know a reload fixes it, I'm tempted not to block. What do others think?
(In reply to Eric Rescorla (:ekr) from comment #10)
> (In reply to Eric Rescorla (:ekr) from comment #9)
> > I am running a Spark job to try to collect this,
> 
> But now that we have diagnosed it, I expect it to be pretty rare and given
> that we know a reload fixes it, I'm tempted not to block. What do others
> think?

Looking at the patch -- and assuming I understand it correctly -- I don't see a need to block deployment. I vote to continue.
Spark doesn't show that 60 is worse than 59, so I don't think there is a need to throttle 60. I suggest that we do as MT says and trial if for a little while on Beta and then look at promoting to Release.
3.37: https://hg.mozilla.org/projects/nss/rev/2f5e8b4bf89729ea5ee18c089e684d4e8138b557
3.36: https://hg.mozilla.org/projects/nss/rev/ff35a3edaafb1586d839fca8cdcae53dcdf02ee9

I'll need help cutting (NSS) releases.  The process is not nearly automated enough for my liking.
Hey J.C., is there a good telemetry probe that we can monitor on release 60 to get a sense of how severe this issue is? Thanks!
Flags: needinfo?(jjones)
(In reply to Ritu Kothari (:ritu) from comment #15)
> Hey J.C., is there a good telemetry probe that we can monitor on release 60
> to get a sense of how severe this issue is? Thanks!

I believe this is bucket 29 [1] of SSL_HANDSHAKE_RESULT [2]. Currently 0.01% of Nightly 62 errors.

Graph: https://screenshots.firefox.com/ai8yzHlOVbK06VYn/telemetry.mozilla.org

[1] https://searchfox.org/mozilla-central/rev/dc6d85680539cb7e5fe2843d38a30a0581bfefe1/security/nss/lib/ssl/sslerr.h#70
[2] https://mzl.la/2GJfiCM
Flags: needinfo?(jjones)
Do we know what the story was for Release 58 (no TLS 1.3)?
(In reply to Eric Rescorla (:ekr) from comment #17)
> Do we know what the story was for Release 58 (no TLS 1.3)?

No idea. It looks like the background noise for that one is ~1k-4k for most release cycles.

(I'm not seeing Release 58 as an option on t.m.o anymore, but Beta 58 has ~4k errors noted, which is 0%. I'm guessing you're looking at stmo?)
You need to log in before you can comment on or make changes to this bug.