Closed Bug 1556491 Opened 5 months ago Closed 4 months ago

Crash in [@ ssl3_SendClientHello | ssl_BeginClientHandshake | ssl_SecureRecv | ssl_Recv | PSMRecv]

Categories

(Core :: Networking: HTTP, defect, P1, critical)

defect

Tracking

()

VERIFIED FIXED
mozilla69
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 68+ fixed
firefox67 --- wontfix
firefox68 + fixed
firefox69 + fixed

People

(Reporter: mayhemer, Assigned: mayhemer)

References

(Blocks 2 open bugs)

Details

(Keywords: crash, Whiteboard: [necko-triaged][secure-proxy-mvp])

Crash Data

Attachments

(2 files)

This bug is for crash report bp-6584daaa-a299-4054-9b2c-684b90190528.

Top 10 frames of crashing thread:

0 libnss3.dylib ssl3_SendClientHello security/nss/lib/ssl/ssl3con.c:4977
1 libnss3.dylib ssl_BeginClientHandshake security/nss/lib/ssl/sslcon.c:200
2 libnss3.dylib ssl_SecureRecv security/nss/lib/ssl/sslsecur.c:41
3 libnss3.dylib ssl_Recv security/nss/lib/ssl/sslsock.c:2940
4 XUL PSMRecv security/manager/ssl/nsNSSIOLayer.cpp:1365
5 XUL mozilla::net::nsSocketTransport::IsAlive netwerk/base/nsSocketTransport2.cpp:2671
6 XUL non-virtual thunk to mozilla::net::nsSocketTransport::IsAlive netwerk/base/nsSocketTransport2.cpp
7 XUL mozilla::net::nsHttpConnection::CanReuse netwerk/protocol/http/nsHttpConnection.cpp:1117
8 XUL mozilla::net::nsHttpConnection::StartSpdy netwerk/protocol/http/nsHttpConnection.cpp:345
9 XUL mozilla::net::nsHttpConnection::EnsureNPNComplete netwerk/protocol/http/nsHttpConnection.cpp

Crashing here:
mozilla-beta: security/nss/lib/ssl/ssl3con.c@6ec3cb391f03a3d02c47e3cb4525fe9eb6abd242 (annotated)

This was reported by Andrew Sun in bug 1556005.

It's a relatively low rate crash. Filing in PSM, but this is more likely an NSS crash, please move at will.

Blocking on secure proxy because this was reproduced while using an h2 proxy.

Dana, can you please look at this or delegate to someone who can help diagnosing this? It's blocking the secure proxy project (can be reproduced when using an h2 proxy and doing a lot of traffic, see bug 1556005 for details)

Thanks.

Flags: needinfo?(dkeeler)
Priority: -- → P1

(moving to NSS and resetting priority so it gets triaged)

Assignee: nobody → nobody
Component: Security: PSM → Libraries
Flags: needinfo?(dkeeler)
Priority: P1 → --
Product: Core → NSS
QA Contact: jjones
Version: Trunk → other

My current theory here is that this is an interaction between the initialization routines in the stack and the way that necko uses the poorly-supported SSL_OptionSet(mFd, SSL_SECURITY, false) option for connections via proxies. It's an interesting condition that might require some interactions between what necko does and what NSS does.

Starting at the beginning, ss->ssl3.cwSpec is the set of keys that NSS is using to encrypt packets it sends. NSS starts with a special "null" value for this, which has a value where macDef->mac == ssl_mac_null. This is the common case. The check that crashes only exists because of renegotiation, where the value of cwSpec is set to the final value from the session that is being renegotiated.

I carefully checked all the paths where macDef (or cwSpec) might be left set to NULL. There are no cases where a socket does not have a value for cwSpec, so that is not the issue.

NSS does in one case fail to set the macDef value (ss->ssl3.cwSpec->macDef). In TLS 1.3 we only use an AEAD, so we didn't bother setting the macDef field. We could fix that and set the value to ssl_mac_aead. That would probably be a mistake, because it would paper over the issue that we're seeing here. Because if we are seeing this issue, it is because the handshake is being started twice.

What I think has happened is that somehow necko (or PSM) is calling into NSS and getting it to fire off a ClientHello. That ClientHello is going out and 0-RTT is being attempted. What will happen when 0-RTT is attempted is that NSS will switch cwSpec to a TLS 1.3 value (without macDef set). A second call to SSL_ResetHandshake() will reset most of the NSS state, but not the important bits. Critically, it resets ss->handshake to point to ssl_BeginClientHandshake, which causes a second call to ssl3_SendClientHello. (I don't know why the stack doesn't show ssl_Do1stHandshake, but I guess it is inlined or something.)

The second maybe-bug here is that ssl3_SendClientHello doesn't check the handshake state (ss->ssl3.hs.ws) before it starts. We should totally do that, but that's not the real problem - calling code should be calling SSL_ResetHandshake after the handshake starts. In any case, it hits this check and crashes because the current write spec is a TLS 1.3 spec that doesn't have macDef set.

I'll take a look at nsNSSIOLayer and the use of that in necko, but I think that we have a problem in there somewhere. That code is quite complex, so I might need to ask for help.

See Also: → 1559125

Martin, thanks a lot for all these details! So, to sum - the issue is that we call SSL_ResetHandshake on the same ssl socket twice, right?

Based on that, Andrew, as you are able to reproduce this, would you be willing to run the same test under the same setup with logging turned on? See HTTP logging - Mozilla | MDN.

Please adjust the list of modules as: sync,timestamp,rotate:200,nsHttp:5,pipnss:5.

One problem is that sync log writes will change timing and may cause the crash be no longer reproducible... But if it reproduces, we will likely find the cause quickly.

Thanks!

Flags: needinfo?(asun)

(In reply to Honza Bambas (:mayhemer) from comment #5)

I managed to reproduce a crash with logging enabled.

Logfile: https://drive.google.com/file/d/1h4nnbWWPH2EuKGz1g0kxLcfGMkFuotuT/view?usp=sharing
Crash: https://crash-stats.mozilla.org/report/index/4694ffb2-4163-4e6f-a4da-c459c0190613

Warning: the uncompressed logfile is nearly a gigabyte.

Flags: needinfo?(asun)

Thanks, are you sure you have added (or turned on) also the "sync" module (w/o a number) to the list? I can't see the last expected line on the crashing thread in the log, that suggests it's incomplete.

Priority: -- → P2

(In reply to Honza Bambas (:mayhemer) from comment #7)

Thanks, are you sure you have added (or turned on) also the "sync" module (w/o a number) to the list? I can't see the last expected line on the crashing thread in the log, that suggests it's incomplete.

I used the list of modules in your post above, which includes sync. It's interesting that the logfile was cut off during the crash.

Priority: P2 → P1
Whiteboard: [necko-triaged][secure-proxy-mvp]

Looking again to the log I can see the following:

  • a channel for https://b.thumbs.redditmedia.com/0PgZl68jAxA6T1BH6uvUQ5Bz1F1GrrJLCL8oi2Gz0Ak.png is racing network with cache
  • network is triggered TriggerNetworkWithDelay [this=0x13889d000, delay=0]
  • this creates a transaction
  • we are creating a new tunnel for it (new tunneling stream + spdy connect trans on the session to the proxy is created); the transaction is immediate temporarily activated on the connection to the proxy
  • in parallel, the cache in the channel wins and the channel cancels the transaction (main thread)
  • we get 200 from the channel and create a new 'intermediate' nsHttpConnection under the spdy connect trans, tls filter trans is created
    • as part of the above operation we also call nsHttpConnectionMgr::AddTransaction which schedules ProcessNewTransaction call on the socket thread
  • now, nsHttpConnectionMgr::OnMsgCancelTransaction arrives to the socket thread and closes the transaction and the connection (code 804b0002); we also kill the the filter trans and the ssl socket D/pipnss [0x13c34fdc0] Shutting down socket
  • then ProcessNewTransaction is called for the same transaction which has the intermediate connection set as sticky connection
  • and then following happens:
2019-06-13 14:41:43.065455 UTC - [Parent 94398: Socket Thread]: E/nsHttp nsHttpConnection::SetupSSL 0x13e111c00 caps=0x25 TS......[tlsflags0x00000000]b.thumbs.redditmedia.com:443 (https:127.0.0.1:7999)[:]
2019-06-13 14:41:43.065505 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::InitSSLParams [this=0x13e111c00] connectingToProxy=0
2019-06-13 14:41:43.065580 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::SetupNPNList 0x13e111c00 0
2019-06-13 14:41:43.065627 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::StartShortLivedTCPKeepalives[0x13e111c00] idle time[10s].
2019-06-13 14:41:43.065640 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::Activate [0x13e111c00] StartShortLivedTCPKeepalives failed rv[0x80004001]
2019-06-13 14:41:43.065654 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable [this=0x13e111c00] host=b.thumbs.redditmedia.com
2019-06-13 14:41:43.065668 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::GetSecurityInfo trans=0x13dee5400 tlsfilter=0x0 socket=0x13bcbd940
2019-06-13 14:41:43.065678 UTC - [Parent 94398: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=0x13dee5400 status=804b000c progress=0]
2019-06-13 14:41:43.065724 UTC - [Parent 94398: Socket Thread]: E/nsHttp nsHttpConnection::EnsureNPNComplete 0x13e111c00 [TS......[tlsflags0x00000000]b.thumbs.redditmedia.com:443 (https:127.0.0.1:7999)[:]] negotiated to 'h2'
2019-06-13 14:41:43.065770 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::EnsureNPNComplete [this=0x13e111c00] early data not accepted
2019-06-13 14:41:43.065777 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::StartSpdy [this=0x13e111c00, mDid0RTTSpdy=0]
2019-06-13 14:41:43.065 ⁃ Http2Session ⁃ 150511800 ⁃ created ⁃ key=n/a
2019-06-13 14:41:43.065788 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session::Http2Session 0x150511800 serial=0x74
Http2Session @150511800 --> nsHttpConnection @13e111c00
2019-06-13 14:41:43.065800 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session::SendHello 0x150511800
2019-06-13 14:41:43.065824 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Magic Connection Header]
2019-06-13 14:41:43.065844 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Generate Settings]
2019-06-13 14:41:43.065863 UTC - [Parent 94398: Socket Thread]: I/nsHttp Session Window increase at start of session 0x150511800 12517377
2019-06-13 14:41:43.065868 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Session Window Bump ]
2019-06-13 14:41:43.065882 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0x3 depends on 0x0 weight 200 for leader class
2019-06-13 14:41:43.065914 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.065948 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0x5 depends on 0x0 weight 100 for other class
2019-06-13 14:41:43.065954 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.065967 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0x7 depends on 0x0 weight 0 for background class
2019-06-13 14:41:43.065972 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.065989 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0x9 depends on 0x7 weight 0 for speculative class
2019-06-13 14:41:43.066000 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.066040 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0xB depends on 0x3 weight 0 for follower class
2019-06-13 14:41:43.066056 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.066081 UTC - [Parent 94398: Socket Thread]: I/nsHttp Http2Session 0x150511800 generate Priority Frame 0xD depends on 0x0 weight 240 for urgentStart class
2019-06-13 14:41:43.066086 UTC - [Parent 94398: Socket Thread]: V/nsHttp Http2Session::LogIO 0x150511800 stream=0x0 id=0x0 [Priority dep node]
2019-06-13 14:41:43.066100 UTC - [Parent 94398: Socket Thread]: V/nsHttp nsHttpConnection::TTL: 0x13e111c00 b.thumbs.redditmedia.com idle 0 timeout 5000
2019-06-13 14:41:43.066151 UTC - [Parent 94398: Socket Thread]: E/nsHttp nsHttpConnection::SetupSSL 0x13e111c00 caps=0x25 TS......[tlsflags0x00000000]b.thumbs.redditmedia.com:443 (https:127.0.0.1:7999)[:]
  • and then we apparently crash

Note that I have a local log for very likely the same crash where rcwn is not involved (the channel simply gets cancelled at the 'right' time).


I think the solution here is to check if the transaction or the intermediate connection is closed when we do either nsHttpConnectionMgr::ProcessNewTransaction or nsHttpConnection::Activate.

Assignee: nobody → honzab.moz
Status: NEW → ASSIGNED
Component: Libraries → Networking: HTTP
Product: NSS → Core
QA Contact: jjones
Version: other → unspecified

The real problem here is that the transaction has the connection assigned as sticky conn but the connection doesn't have hold ref back to the transaction in its mTransaction.

Hence, in nsHttpConnectionMgr::OnMsgCancelTransaction we call conn->CloseTransaction(trans, NS_ERROR_*) which doesn't call Close on the transaction. Then in nsHttpConnectionMgr::ProcessNewTransaction we don't early return. We set the sticky conn on the transaction here.

So, looking at this assertion suggests that we want to add another block here to also close the proxied transaction.

Hi Andrew, I think I have a fix for this crash, but it's extra hard for me to reproduce it. There is again a test build: target.dmg. Can you please test it for us as before? It does NOT contain the fix for bug 1556005, so it should behave as originally.

Thanks!

Flags: needinfo?(asun)

I was unable to reproduce the crash on this build.

However, I was also unable to reproduce the crash on the latest Nightly build. There may have been some changes that changed timings, etc. that made it harder to reproduce.

I think it's fixed?

Flags: needinfo?(asun)
Pushed by honzab.moz@firemni.cz:
https://hg.mozilla.org/integration/autoland/rev/cc8fb7381daf
Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana

Comment on attachment 9074356 [details]
Bug 1556491 - Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana!

Beta/Release Uplift Approval Request

  • User impact if declined: Possible null-deref crash when using HTTP/2 secure proxy.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Low - medium. Only reason I'm a bit careful here is that the crash rate is low and I was never able to explicitly confirm that when we hit this code path we gracefully continue. This was externally confirmed fixed by Andrew Sun from Cloudflare, so I'm fairly confident we are safe to take this one.
  • String changes made/needed:
Attachment #9074356 - Flags: approval-mozilla-beta?

(In reply to Andrew Sun from comment #13)

I was unable to reproduce the crash on this build.

However, I was also unable to reproduce the crash on the latest Nightly build. There may have been some changes that changed timings, etc. that made it harder to reproduce.

I think it's fixed?

Thanks Andrew. If everything else with the patch works as before, we are safe to take this patch then. I'm not aware of any change that would influence timing that much to not being able to reproduce the crash, but as I found, the timing is very tight to trigger this crash.

Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla69

Comment on attachment 9074356 [details]
Bug 1556491 - Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana!

Comments 13 and 15 are not filling me with a huge amount of confidence about taking this in a release candidate, I'd rather let this bake for a while...

Attachment #9074356 - Flags: approval-mozilla-beta? → approval-mozilla-beta-

(In reply to Julien Cristau [:jcristau] from comment #18)

Comment on attachment 9074356 [details]
Bug 1556491 - Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana!

Comments 13 and 15 are not filling me with a huge amount of confidence about taking this in a release candidate, I'd rather let this bake for a while...

Hi Julien, how long would you like this to bake for?

Julien, this is probably the most important fix we need for secure-proxy on 68. It's fairly simple, just hard to confirm it actually works. I will try to reproduce and check we do the right thing when we hit the new code path.

Please reconsider release uplift soon.

Flags: needinfo?(jcristau)

You can request release uplift, and we can consider this after it's been in beta for a week or two and after you have validated a set of changes that secure proxy needs on top of 68.

Flags: needinfo?(jcristau)

Bugbug thinks this bug is a regression, but please revert this change in case of error.

Keywords: regression
Keywords: regression

Managed to reproduce this one and the fix is not correct.

Details: we call nsHttpConnection::Close before its mTLSFilter even knows (via TLSFilterTransaction::SetProxiedTransaction) about the target transaction (mTLSFilter->Transaction() is null). Hence the assertion at [1] will fail in debug builds when this crash occurs!

[1] https://searchfox.org/mozilla-central/rev/11712bd3ce7454923e5931fa92eaf9c01ef35a0a/netwerk/protocol/http/nsHttpConnection.cpp#1797-1798

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED
Pushed by honzab.moz@firemni.cz:
https://hg.mozilla.org/integration/autoland/rev/1a5bc9c6cfcd
Always close http transactions from nsHttpConnection::CloseTransaction even when the connection or its filter does not refer any, r=dragana

Comment on attachment 9074356 [details]
Bug 1556491 - Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana!

Beta/Release Uplift Approval Request

  • User impact if declined: Possible null-deref crash when using HTTP/2 secure proxy.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): I was able to verify the fix is correct by hitting the code path.
  • String changes made/needed:
Attachment #9074356 - Flags: approval-mozilla-release?
Attachment #9076268 - Flags: approval-mozilla-release?
Status: ASSIGNED → RESOLVED
Closed: 4 months ago4 months ago
Resolution: --- → FIXED
Blocks: 1565518

Using FF68 I saw that fast.com fails after a few seconds- when it easily completed when the proxy was off. Speedtest.net failed with a socket error during the download test. Testing was based on bug 1556005.

If these are not valid ways of verifying the bug, please add repro steps and expected results.

(In reply to Rebecca Billings [:rbillings] from comment #28)

Using FF68 I saw that fast.com fails after a few seconds- when it easily completed when the proxy was off. Speedtest.net failed with a socket error during the download test. Testing was based on bug 1556005.

"Based on" means? There has never been a patch landed for that bug.

You have to test with the following builds:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=256193339&revision=95268ddd3e04fc5d7259fd38b9d1db1b00f0f9f0

If these are not valid ways of verifying the bug, please add repro steps and expected results.

This is not easy to reproduce as well, but a good way is to repeatedly reload (F5, Alt-D+enter) a more complicated page like reddit or a facebook profile page (have lot of resources).

You will likely sooner or later crash w/o this patch and not w/ it.

(In reply to Honza Bambas (:mayhemer) from comment #29)

(In reply to Rebecca Billings [:rbillings] from comment #28)

Using FF68 I saw that fast.com fails after a few seconds- when it easily completed when the proxy was off. Speedtest.net failed with a socket error during the download test. Testing was based on bug 1556005.

"Based on" means?

Ah, based on the symptoms, right? OK, then it was the right approach.

There has never been a patch landed for that bug.

You have to test with the following builds:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=256193339&revision=95268ddd3e04fc5d7259fd38b9d1db1b00f0f9f0

You probably have a reference to the right build already.

If these are not valid ways of verifying the bug, please add repro steps and expected results.

This is not easy to reproduce as well, but a good way is to repeatedly reload (F5, Alt-D+enter) a more complicated page like reddit or a facebook profile page (have lot of resources).

You will likely sooner or later crash w/o this patch and not w/ it.

This mostly stands - the crash happens when a request is cancelled at a certain moment to cause a race. Reloads during a load in progress produce such cancellations.

(In reply to Rebecca Billings [:rbillings] from comment #28)

Using FF68 I saw that fast.com fails after a few seconds- when it easily completed when the proxy was off. Speedtest.net failed with a socket error during the download test.

These symptoms (general brokenness) is visible when any of 1563538, 1563695 is missing.

note that the try build ref of off 68 I'm providing in comment 29 has preferences for each of the fixes, if that makes it easier.

During testing, Firefox did not crash while browsing, downloading or doing any other action with the Secure Proxy installed and enabled. Tested on Linux, Mac, and Windows.

My results matched Bogdan's, marking as Verified.

Status: RESOLVED → VERIFIED

Comment on attachment 9074356 [details]
Bug 1556491 - Close nsHttpConnection.mTLSFilter when transaction of that connection is being closed to correctly carry cancellation status and prevent calls on a closed ssl socket, r=dragana!

fix a crash when using http2 proxy. needed for the secure-proxy project, approved for 68.0.1 and 68.1esr

Attachment #9074356 - Flags: approval-mozilla-release?
Attachment #9074356 - Flags: approval-mozilla-release+
Attachment #9074356 - Flags: approval-mozilla-esr68+
Attachment #9076268 - Flags: approval-mozilla-release?
Attachment #9076268 - Flags: approval-mozilla-release+
Attachment #9076268 - Flags: approval-mozilla-esr68+

Per discussion with jcristau, we're uplifting this to 68.0.1esr also to maintain parity with the non-ESR 68.0.1 release and hopefully avoid some confusion.

Regressions: 1572790
Blocks: 1579064
You need to log in before you can comment on or make changes to this bug.