Open Bug 1428901 Opened 2 years ago Updated 5 months ago

Persist TLS session tickets across browsing sessions

Categories

(Core :: Networking, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: u408661, Unassigned)

References

Details

(Keywords: sec-other, Whiteboard: [necko-triaged])

Attachments

(2 files)

Right now, our tls tickets are only used in one browser session. That's not optimal, we should keep them on disk and re-use them after the browser restarts (of course clearing out old ones, ones we've found to not work, etc). This is the bug for the necko portion of that work.
Adding EKR so it’s on his radar (if it wasn’t already)

Note: Doing so can come with non-trivial security risk, including the ability to decrypt past traffic a user has sent if you are able to obtain their device. Chrome and Android did some early experiments with this, and in addition to not providing a meaningful speed benefit (disk seek versus network), determined the security risks to non-forward secure connections would be catastrophic, and thus removed the code.

This also can affect the developer expectations (for example, client cert auth and removed smartcards).

While TLS1.3 improves this situation, due to how the secrets are ratcheted, care should be taken here, both to quantify the performance gains (if any, in practice, particularly for desktop users) and to evaluate the security risks.

Just choking on from the Peanut Gallery - where we evaluated this several times and ended up rejecting it each time.
Thanks for looping me in, Ryan

1. There appear to be some inconsistency in the measurements: in conversations with Facebook, they say that there is actually a pretty big performance boost, so minimally we'd like to get some measurements.

2. As you say, the situation isn't great with TLS 1.3, so we should limit this to TLS 1.3 and above. I hadn't thought to do that, so thanks for pointing it out.

3. WRT to client cert, is it enough to just not export the ticket if client auth was used, or are there other concerns?
1. It's likely the difference between mobile (which traditionally has fast disk, slow network) and desktop (which traditionally has slow disk, fast network). It also likely relates to the eviction strategies of mobile apps (frequently evicted under pressure) versus desktop apps (typically long-running), and the foreground/background state. We've seen similar things when Chrome's network stack is embedded in devices. We also see (on Android) high certificate verification latency, as it's done in-process versus iPhone's out-of-process daemon, and these get mistaken as SSL/TLS costs. I'm also not sure if Facebook is reverifying certificates on resumption - we know of others who don't, whereas Chrome and Firefox do.

Of course, you'll want to do your own measurements to reflect your user population and distributions, but I wanted to encourage a metrics-first measurement. For example, it may be reasonable to experiment by figuring out what your serialized session size would be, and serializing that to disk (as dummy data), racing disk loads with session resumption, and seeing how your distributions look. This would allow you to get a sense of what your users disk speeds are (if you were to block on the load) versus the network speed. You may also want to think about how this would affect/impact multiple parallel TLS session negotiations - whether the multiple connects would end up blocking on the disk load, and how that'd affect your metrics. The one and only time I hosted an intern (with Wan-Teh), they looked at the performance tradeoffs between artificially holding back TLS client hellos to improve our session hit rate, and we determined that, save for 2G/slow 3G, the CPU/network for our user population and doing full TLS handshakes outperformed improve resumption hit rates.

2. Great. My understanding was TLS 1.3 makes this not-unreasonable, but I hadn't kept up to all the ratcheting to know if there's any risks.

3. No, I think that's fine. It's a difference in implementation that Firefox is somewhat unique in - Chrome, Safari, and Edge allow TLS session resumption even if the smart card has since been evicted, whereas Firefox kicks off a thread-per-module to poll and, if a card is ejected that held a client key, the TLS session is closed (within a few seconds, given the device polling). Given that, I wanted to make sure it was on the radar that, at least given existing Firefox/NSS behaviour, they present an extra 'surprise' worth thinking about in the design/implementation.
MT, you will want to see tjhis.
Flags: needinfo?(martin.thomson)
Ack.  This all seems reasonable.  We might make the client authentication thing part of bug 1399439, but I would keep the TLS 1.3 thing in Firefox rather than in NSS.  That's a policy that is specific to the using application.
Flags: needinfo?(martin.thomson)
I just landed a new version of NSS on inbound that has the session cache API.
MT, shall I open a bug to add the client authentication things?
Please do.  I don't think that we should turn this feature on until we have the agreed safeguards.
Changing priorities mean I may not be working on this after all.
Assignee: hurley → nobody
So Honza is going to take this, and it's going to be part of a larger effort to store a number of per-origin and/or per-server attributes in a centralized way, across session restarts.  (Honza: we should probably open a bug for the overall project and make it block this?)

Marion, I know your folks are similarly centralizing DOM storage.  I'm guessing we should make sure that we use the same database engine under the covers (I forget the name of the key-value DB you're using--refresh my memory?).  And perhaps there's a chance to share some data in a common per-origin DB here...

The list of things we've got to store per-origin (so far):

Server's Alt-Svcs settings
Whether to use HSTS
Session tickets
Protocol version (ALPN)
IPv4/v6 preference
Lists of hosts that can be coalesced on same connection
Whether domain is on TRR blacklist
Whether domain server speaks Websockets over H/2
Assignee: nobody → honzab.moz
Flags: needinfo?(mdaly)
(In reply to Jason Duell [:jduell] (needinfo me) from comment #9)
> So Honza is going to take this, and it's going to be part of a larger effort
> to store a number of per-origin and/or per-server attributes in a
> centralized way, across session restarts.  (Honza: we should probably open a
> bug for the overall project and make it block this?)
> 
> Marion, I know your folks are similarly centralizing DOM storage.  I'm
> guessing we should make sure that we use the same database engine under the
> covers (I forget the name of the key-value DB you're using--refresh my
> memory?).  And perhaps there's a chance to share some data in a common
> per-origin DB here...
> 
> The list of things we've got to store per-origin (so far):
> 
> Server's Alt-Svcs settings
> Whether to use HSTS
> Session tickets
> Protocol version (ALPN)
> IPv4/v6 preference
> Lists of hosts that can be coalesced on same connection
> Whether domain is on TRR blacklist
> Whether domain server speaks Websockets over H/2

This is something we need to coordinate on. We want to centralize not just DOM storage but storage generally ideally. We are still developing ideas around this.
Flags: needinfo?(mdaly)
Marion--is lmdb in the tree already?  Do we have a preferred API (xpcom or otherwise) for accessing it via C++?  If there's any specific developers who already know some ins and outs of lmdb, we'd love to have a chance to talk to them.

It sounds like the plan here may be for us to implement a lmdb database to store the per-origin stuff, and then we can fold it into a larger storage service/infrastructure as that effort become ready?   I guess it'll depend on time lines (we'd like to do this work in Q2 or maybe Q3 at latest).
Flags: needinfo?(mdaly)
No, not from our team anyway. We're still in the midst of reworking localstorage only atm. 

The original plan was SQL4 with LSM however, that's not possible now. We are seeking alternatives. LMDB is ideal but we haven't explored it yet.
Flags: needinfo?(mdaly)
Depends on: rkv
Honza--FYI: this doc has some good info on lmdb and our storage plans generally:

https://docs.google.com/document/d/15oX1y9emV_RdMnOg8JZVPVjORmsQX6mK3-cssFF0N6M/edit#heading=h.ppxyqi4pyngg
Assignee: honzab.moz → michal.novotny
Depends on: 1489945
Attached patch wip patchSplinter Review
This is a work in progress patch that doesn't store the tokens to disk yet. As proposed in bug 1489945, I disabled TLS 1.3 and it doesn't assert anymore. But it doesn't work correctly. When the token is reused the connection sometimes fails with error SSL_ERROR_RX_UNEXPECTED_CHANGE_CIPHER which comes from https://searchfox.org/mozilla-central/rev/a0333927deabfe980094a14d0549b589f34cbe49/security/nss/lib/ssl/ssl3con.c#3169.

#0  0x00007f2074f87c0b in ssl3_HandleChangeCipherSpecs (ss=0x7f2037652000, buf=0x7f2037652290)
    at /mnt/work/opt/moz/hg-inbound-3/security/nss/lib/ssl/ssl3con.c:3169
#1  0x00007f2074f877e2 in ssl3_HandleNonApplicationData (ss=0x7f2037652000, rType=ssl_ct_change_cipher_spec, epoch=0, seqNum=1, databuf=0x7f2037652290) at /mnt/work/opt/moz/hg-inbound-3/security/nss/lib/ssl/ssl3con.c:12339
#2  0x00007f2074f89039 in ssl3_HandleRecord (ss=0x7f2037652000, cText=0x7f20524fdc28)
    at /mnt/work/opt/moz/hg-inbound-3/security/nss/lib/ssl/ssl3con.c:12627
#3  0x00007f2074f9c9ab in ssl3_GatherCompleteHandshake (ss=0x7f2037652000, flags=0)
    at /mnt/work/opt/moz/hg-inbound-3/security/nss/lib/ssl/ssl3gthr.c:527
#4  0x00007f2074fa8983 in SSL_ForceHandshake (fd=0x7f20485e8bb0)
    at /mnt/work/opt/moz/hg-inbound-3/security/nss/lib/ssl/sslsecur.c:399
#5  0x00007f2064334c17 in nsNSSSocketInfo::DriveHandshake() (this=0x7f2036c7b930)
    at /mnt/work/opt/moz/hg-inbound-3/security/manager/ssl/nsNSSIOLayer.cpp:433
#6  0x00007f205de16e97 in mozilla::net::nsHttpConnection::EnsureNPNComplete(nsresult&, unsigned int&) (this=0x7f2036cbf800, aOut0RTTWriteHandshakeValue=@0x7f20524fe0c4: nsresult::NS_OK, aOut0RTTBytesWritten=@0x7f20524fe0c0: 0)
    at /mnt/work/opt/moz/hg-inbound-3/netwerk/protocol/http/nsHttpConnection.cpp:510
#7  0x00007f205de18207 in mozilla::net::nsHttpConnection::OnSocketWritable() (this=0x7f2036cbf800)
    at /mnt/work/opt/moz/hg-inbound-3/netwerk/protocol/http/nsHttpConnection.cpp:1866


Martin, do you have any idea what's happening here?
Flags: needinfo?(martin.thomson)
Can you enable SSLTRACE (SSLTRACE=100 will get everything) and send me logs from the failing session?
Flags: needinfo?(martin.thomson)
Attached file SSLTRACE
Flags: needinfo?(martin.thomson)
Comment on attachment 9010234 [details] [diff] [review]
wip patch

Review of attachment 9010234 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/nsSocketTransport2.cpp
@@ +1570,5 @@
>                          "started [this=%p]\n", this));
>          }
>      }
>  
> +   if (usingSSL && HostAttributesStorage::IsSSLTokenCachingEnabled()) {

It seems that this feature, as implemented in this patch, is going to poke a hole into our existing privacy protections, by allowing a third-party channel for example to store a TLS resumption token where our cookie policy disallows it.

If I understand the setup here correctly, in order to address this, you should probably add a check to nsContentPolicy::StorageAllowedForChannel() here.  Also, I'd appreciate if you could add some tests for it under toolkit/components/antitracking (when the final patch is ready!).  I can help with writing the tests, of course, but I don't know how to test the TLS resumption part.
(In reply to :Ehsan Akhgari from comment #18)
> It seems that this feature, as implemented in this patch, is going to poke a
> hole into our existing privacy protections, by allowing a third-party
> channel for example to store a TLS resumption token where our cookie policy
> disallows it.

But this is problem even without this patch, isn't it? The resumption tokens are now cached internally in NSS, but the cache isn't persistent across restart.
Also note that socket transports will soon only be living on the isolated socket process.  Any persistence will have to be proxied to the parent process with r/w access to the profile.  Not sure how complicated it would be to pass data for nsContentPolicy::StorageAllowedForChannel down to the socket process also because I'm not sure how much TLS sockets are isolated by 3rd-partiness - can be shared between channels with different storage allowance.  But it may more be just about dis/allowing reuse of the ticket as a possible cookie, decision made on the parent processed and passed down via channel/transaction<-->IPC/socket creation logic.
(In reply to Michal Novotny (:michal) from comment #19)
> (In reply to :Ehsan Akhgari from comment #18)
> > It seems that this feature, as implemented in this patch, is going to poke a
> > hole into our existing privacy protections, by allowing a third-party
> > channel for example to store a TLS resumption token where our cookie policy
> > disallows it.
> 
> But this is problem even without this patch, isn't it? The resumption tokens
> are now cached internally in NSS, but the cache isn't persistent across
> restart.

Yes indeed it is.  I'm suggesting that we should fix it here since now we'll be controlling the cache at the channel level.  (I don't know how we can achieve this level of control at the NSS level...)

(In reply to Honza Bambas (:mayhemer) from comment #20)
> Also note that socket transports will soon only be living on the isolated
> socket process.  Any persistence will have to be proxied to the parent
> process with r/w access to the profile.  Not sure how complicated it would
> be to pass data for nsContentPolicy::StorageAllowedForChannel down to the
> socket process also because I'm not sure how much TLS sockets are isolated
> by 3rd-partiness - can be shared between channels with different storage
> allowance.  But it may more be just about dis/allowing reuse of the ticket
> as a possible cookie, decision made on the parent processed and passed down
> via channel/transaction<-->IPC/socket creation logic.

Good point.  My suggestion is to perform the check at the Necko level.  Not sure if it is all of Necko that's being moved to another process or the socket access at the level below nsIChannel.  But I'll answer the question in general terms, hopefully it's helpful.

For the privacy checks, we generally need to know a few bits of information:

  * The channel's third-partiness status, including things like the top-level window URI.
  * The channel's tracking status.
  * The permission manager for checking several permissions.
  * The channel's current URI (or final URI if known at the time).
  * The active preferences.
  * Some principals indicating information about the envi ronment the load occurred in, stored on the loadinfo object.
 
Currently, we ensure that our privacy checks work both in the parent and in the content process, so the nsContentUtils::IsStorageAllowedForFoo() functions can be called in either process.  Some of the above information is mirrored into the process we need through the services we use.  For example, the permission manager makes the necessary permissions visible in the content process at the right times.  For some other parts, we manually mirror the required information in the right process.  For example, the principals on the loadinfo object are mirrored into the parent process manually from the content process.  We can probably mirror the loadinfo object into the socket process, if needed (and I suspect that is needed for other reasons as well.)  We should also make sure the third-party utils service works in that process, which again should probably happen for other reasons too.

In general this seems doable with some work.  But that being said, we can also perform our checks e.g. in the parent process before sending requests to the socket process.  Not sure which way makes more sense given the current design at hand.  Happy to help answer more questions about this.
After discussing with MT, marking security until we sort out.
Group: core-security
Group: core-security → network-core-security
This may be a long shot, but I'm wondering if a useful path forward here would be to examine the performance concerns Ryan brought up in comment 1.  I.e., if "load from disk then send to a different process" winds up being slow enough that caching isn't a win, then that would make moot the rest of the brainstorming we're doing here about privacy checks, etc.

However, since we can't cache TLS 1.3 yet (bug 1489945), we'd only be able to test TLS 1.2 (which we're not planning to ship, per comment 2).  So this experiment would only make sense if we expect TLS 1.2 timings to be similar to 1.3 ones for caching.  (Do they store similar amounts of stuff?  Any other reason perf is likely to differ?)

No guarantee that even this plan would work (we've got a couple other crashes--comment 14 and comment 16, which I think we should split out into new bugs at this point).  But figured I'd explore it.

Ekr: does this sound worth doing to you?

Michal: does it sound like a reasonable amount of effort?
Flags: needinfo?(ekr)
Flags: needinfo?(michal.novotny)
Oh, I now remember there's a (not yet reviewed) patch for the TLS 1.3 issue, so we might be able to investigate performance with 1.3 after all.
Yes, I think we should take some measurements once this works.
Flags: needinfo?(ekr)
Depends on: 1493769
Depends on: 1493771
No longer depends on: 1493771
Ekr: What meaning does this have as a security bug? It's not a vulnerability in current products, and this bug isn't about a vulnerability but rather a feature request that might introduce one. Not sure who we're protecting by hiding the bug or what's the appropriate rating to give it to get it off our triage list.
Flags: needinfo?(ekr)
sec-other?
Flags: needinfo?(ekr)
Depends on: 1499732
Depends on: 1499737
We have fixes in NSS trunk for the harder issues now.  Clearing my n-i.
Flags: needinfo?(martin.thomson)
JC -- when will the fixes in NSS trunk land in Nightly? Or have they?
Flags: needinfo?(jjones)
(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #29)
> JC -- when will the fixes in NSS trunk land in Nightly? Or have they?

NSS trunk syncs to Nightly every 1-3 days. Nightly is up-to-date currently.
Flags: needinfo?(jjones)
Keywords: sec-other

:michal is this something we could get completed for RC fx67 (in about two weeks)?

Not for 67. 2 parts are still missing. Storing/reading the data to/from disk and telemetry to measure performance gain/hit.

Flags: needinfo?(michal.novotny)
Depends on: 1546975

deprioritized; no plan to implement this in the immediately future.

Assignee: michal.novotny → nobody
Priority: P2 → P3

:mt says we can unhide this now.

Group: network-core-security
You need to log in before you can comment on or make changes to this bug.