Closed Bug 707275 Opened 8 years ago Closed 6 years ago

Implement SSL certificate, cipher suite, and version negotiation telemetry

Categories

(Core :: Security: PSM, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla28
Tracking Status
firefox26 --- fixed
firefox27 --- fixed
firefox28 --- fixed
b2g-v1.2 --- fixed

People

(Reporter: dchanm+bugzilla, Assigned: briansmith)

References

()

Details

(Whiteboard: [qa-])

Attachments

(2 files, 13 obsolete files)

10.79 KB, patch
keeler
: review+
Details | Diff | Splinter Review
12.15 KB, patch
keeler
: review+
Details | Diff | Splinter Review
Attached patch implement telemetry v1 (obsolete) — Splinter Review
This bug will be used for tracking the status of implementing telemetry collection of SSL certificates and associated errors.

Documentation is in progress
Attached patch telemetry patch v2 (obsolete) — Splinter Review
fixed bitrot and moved error probe into PSM as suggested by bsmith
Attachment #578680 - Attachment is obsolete: true
Attached patch telemetry patch v3 (obsolete) — Splinter Review
* explicitly count number of good/bad ssl requests
* handle revoked certificates
* log TLS intolerant sites
Attachment #579477 - Attachment is obsolete: true
Comment on attachment 580224 [details] [diff] [review]
telemetry patch v3

The feature page has been updated to explain the reasoning for collecting this data.
Attachment #580224 - Flags: review?(bsmith)
Comment on attachment 580224 [details] [diff] [review]
telemetry patch v3

Review of attachment 580224 [details] [diff] [review]:
-----------------------------------------------------------------

Please submit your patches with showfunc = true.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +485,5 @@
>  
> +  if (rv != SECSuccess) {
> +    PRErrorCode error = PR_GetError();
> +    if (IS_SSL_ERROR(error)) {
> +      mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_ERROR,

You cannot call Accumulate here or in other places that execute on the cert verification threads. The reason is that Accumulate must always be called on the same thread for a given probe, since it isn't thread-safe (unless this changed recently). But, this code executes on an arbitrary thread from a thread pool.

If you put the Accumulate call in SetCertVerificationResult, you can detect an error overridden by the cert override service by looking at mSSLStatus->mHaveCertErrorBits, IIRC. SetCertVerificationResult always executes on the socket transport thread, so it avoids the above problem.

If it is helpful for you, you can add an originalCertError parameter to SetCertVerificationResult that you would record in your probe.

Also, certificate errors are not (always) SSL errors; they are usually SEC errors (IS_SEC_ERROR). And, not every error returned by PSM_SSL_PKIX_AuthCertificate is one of those two kinds.

If possible, it would be great to be able to record every single possible error in its own bucket.

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +982,5 @@
> +        version = 2;
> +      } else if (scinfo.protocolVersion == SSL_LIBRARY_VERSION_3_0) {
> +        version = 3;
> +      } else if (scinfo.protocolVersion == SSL_LIBRARY_VERSION_3_1_TLS) {
> +        version = 4;

I'd rather not enumerate every possible version here. Let's just use the scinfo.protocolVersion - SSL_LIBRARY_VERSION_3_0 as the bucket.

Note that we do not support versions less than SSL 3.0 anymore.

@@ +990,5 @@
> +
> +      /*
> +       * sslproto.h defines ciphersuites as a PRUint16
> +       * convert these values to an index so we don't need sparse
> +       * telemetry buckets

What is wrong with sparse telemetry buckets? I think sparseness is OK and clearer than this code.

Our analysis tools should (but maybe can't yet) collapse sparse buckets into something presentable.

(This goes for the SSL version too.)

@@ +997,5 @@
> +        if (SSL_ImplementedCiphers[i] == scinfo.cipherSuite)
> +          break;
> +      }
> +
> +#ifndef NSS_ENABLE_ECC

I do not think NSS_ENABLE_ECC is defined/undefined for PSM; only for NSS.

@@ +1007,5 @@
> +      i += 256;
> +#endif
> +
> +      /* TODO: do we want to create an extra bucket and clamp values > x */
> +      mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_VERSION,

Put this accumulate near the code that does the calculation of "version".

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +3596,5 @@
> +  PRUint32 flags = collected_errors << 1;
> +  if (collected_errors & nsICertOverrideService::ERROR_UNTRUSTED) {
> +    bool isSelfSigned; 
> +    // TODO: ignore errors here?
> +    nssCert->GetIsSelfSigned(&isSelfSigned);

Instead of doing this, we should record telemetry for every error code encountered in the verify log. (i.e. each iteration through the loop that processes the verify log should have a call to Accumulate()).

@@ +3885,5 @@
>      // hellos, it is more likely that we will get a reasonable error code
>      // on our single retry attempt.
> +
> +    // log connections to intolerant sites
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_INTOLERANT, 1);

Need to think about this. I am not sure there is a great way to meaninfully count TLS intolerance.
Attachment #580224 - Flags: review?(bsmith) → review-
Attached patch telemetry patch v4 (obsolete) — Splinter Review
> Threading issues
The logging has been moved out of SSLServerCertVerification threads. Telemetry is now logged in SetCertVerificationResult and HandleBadCertificate in nsNSSIOLayer.cpp .

> SSL Version enumeration
Changed to use SSL version as bucket as suggested by bsmith

> Ciphersuites
The funky index and lookup for ciphersuites has been removed for the most part. There are separate buckets for each of the types of ciphersuites, but converting back to the sslproto.h #defines is straightforward. There are 3 histograms now to account for each range

> HandleBadCertificate logging
Telemetry has been moved inside the for-loop / switch. 

> misc
Changed patch to use showfunc = 1
added more comments
Selfsigned certificate isn't logged anymore
Attachment #582895 - Flags: review?(bsmith)
Attachment #580224 - Attachment is obsolete: true
Comment on attachment 582895 [details] [diff] [review]
telemetry patch v4

Review of attachment 582895 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +979,5 @@
> +      // check for SSLv3 < 0xFF
> +      // ECC ciphersuites 0xC000
> +      // then SSLv2 / informational
> +      // the else clause should not come up much in practice
> +      // experimental and SSLv2 ciphersuites

See http://tools.ietf.org/html/rfc5246#section-12.

Use the names:
   SSL_CIPHERSUITE_STANDARD,
   SSL_CIPHERSUITE_SPEC_REQUIRED,
   SSL_CIPHERSUITE_PRIVATE_USE.

Include the above link in the documentation for these.

@@ +980,5 @@
> +      // ECC ciphersuites 0xC000
> +      // then SSLv2 / informational
> +      // the else clause should not come up much in practice
> +      // experimental and SSLv2 ciphersuites
> +      if (scinfo.cipherSuite < 0xFF) {

< min(0xC000, base::Histogram::kBucketCount_MAX)

@@ +983,5 @@
> +      // experimental and SSLv2 ciphersuites
> +      if (scinfo.cipherSuite < 0xFF) {
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_CIPHERSUITE,
> +                                       scinfo.cipherSuite);
> +      } else if ((scinfo.cipherSuite & 0xC000) == 0xC000) {

< min(0xFF00, 0xC000 + base::Histogram::kBucketCount_MAX)

@@ +985,5 @@
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_CIPHERSUITE,
> +                                       scinfo.cipherSuite);
> +      } else if ((scinfo.cipherSuite & 0xC000) == 0xC000) {
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_CIPHERSUITE_C0,
> +                                       scinfo.cipherSuite & 0xFF);

scinfo.cipherSuite - 0xC000

@@ +988,5 @@
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_CIPHERSUITE_C0,
> +                                       scinfo.cipherSuite & 0xFF);
> +      } else {
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_CIPHERSUITE_OTHER,
> +                                       scinfo.cipherSuite & 0xFF);

I suggest that we just make SSL_CIPHERSUITE_OTHER a boolean probe, as it is *extremely* unlikely to be used.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +113,5 @@
>  
> +// normalize each type of error
> +// otherwise we would need more than kBucket_Count_MAX
> +// defined in base/histogram.h
> +#define LOG_SSL_TELEMETRY(error) \

This should be a function, not a macro, especially because it evaluates its argument multiple times.

@@ +1005,5 @@
> +  // we will end up logging (0 - PR_NSPR_ERROR_BASE)
> +  // which accumulates in the largest bucket
> +  // The largest bucket will represent a value outside of
> +  // the NSPR range for now. (PR_MAX_ERROR)
> +  LOG_SSL_TELEMETRY(errorCode);

This is a confusing to me. I think it it better to have a separate bucket for successful certificate validations (errorCode == 0 means success.)

Use the word "record" or something else, instead of "log" since "log" has a different specific meaning.

@@ +3601,5 @@
> +
> +    // log error after switch since if SECFailure is returned
> +    // then SSLServerCertVerificationResult will be dispatched
> +    // and the error will be handled in SetCertVerificationResult
> +    LOG_SSL_TELEMETRY(i_node->error);

ditto "log" -> something else.

You need to use different telemetry probes here, otherwise these errors will get double-counted. (We are probably going to call SetCertVerificationResult with one of the values of i_node->error).

It would be extremely helpful to record the error code AND whether or not the error was overridden AND whether the default implementation of the cert override service was used. This would allow us to establish "what are the most likely reasons for users to override cert errors" and "what errors do users not override." The easiest way to do this would be to define two sets of probes "overridden" and "not overridden" and add the error to the correct set depending on the results of the override processing below. (So, you would have to loop through the error log again below.)

@@ +3896,5 @@
>      // hellos, it is more likely that we will get a reasonable error code
>      // on our single retry attempt.
> +
> +    // log connections to intolerant sites
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_INTOLERANT, 1);

Use the name SSL_TLS_1_0_INTOLERANT.
Attachment #582895 - Flags: review?(bsmith) → review-
Whiteboard: [privacy:sstamm]
Haven't had time to work on this. I'll un bitrot my current patch and attach it to the bug
Assignee: dchan+bugzilla → nobody
I am trying to work on this now and started off with unbitrotting the patch. Some questions/comments (I am new to the Firefox, so be nice and feel free to add as much context/info as you feel: I don't understand the code much)

@bsmith: I have tried to adopt the changes you suggested. Will be great if you can take a look. I haven't changed PRIVATE_SPEC to a boolean probe: can you expand further on the value of making it a boolean?

I didn't quite understand why you need the bucketcount_max in the code. I have modified it to match the spec and changed the definitions in the histograms file to match the spec. 

Further, I haven't unbitrotted the changes for SSL_ERROR and SSL_ERROR_SEC: the current code doesn't quite match what dchan worked on.  One question I wanted to ask was why is the logging for these two errors not done in SSLServerCertVerification.cpp? That seems like a good place to put it. Am I missing something?
Attached patch Unbitrot attempt #1 (obsolete) — Splinter Review
Attachment #625309 - Attachment description: SSL Errors not being logged right now. → Unbitrot attempt #1
the bucket count for SSL_CIPHERSUITE_STANDARD needs to be reduced, to be under kbucketcount_max; any ideas on what is a good bucket count? What are all the possible values seen? I am sure there aren't 0xC000 standardized cipher suites.
(In reply to dev from comment #10)
> Created attachment 625309 [details] [diff] [review]
> Unbitrot attempt #1

Is this ready for review or si it just an unrotted patch ?
It doesn't implement all the features, but for the subset it does implement, it is ready for review.
Ok. This version should log every issue. There is still one outstanding issue regarding the histogram bucket sizes that I need help with.
Attachment #625309 - Attachment is obsolete: true
Looking at http://dxr.lanedo.com/mozilla-central/security/nss/lib/ssl/sslproto.h.html , it seesm the SSL_CIPHERSUITE_STANDARD can be divided into the < 0x00FF and > 0xC000 ; with most of the remaining space empty. Does it make sense to break it up into two histograms for these two? SSL_CIPHERSUITE_STANDARD SSL_CIPHERSUITE_INFORMATIONAL_STANDARD ?
Attachment #626073 - Flags: review?(bsmith)
Attached patch All SSL Telemetry implemented (obsolete) — Splinter Review
Switched to using HISTOGRAM_ENUMERATED_VALUES because of the bug, and also added a comment to the protocol list about adding new ciphers. 

Since the minimum for the histogram is a 1, I had to add a bunch of +1s to the logging.
Assignee: nobody → dev.akhawe
Attachment #582895 - Attachment is obsolete: true
Attachment #626073 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #626073 - Flags: review?(bsmith)
Attachment #632854 - Flags: review?
Attachment #632854 - Flags: review? → review?(bsmith)
minor bugfix
Attachment #632854 - Attachment is obsolete: true
Attachment #632854 - Flags: review?(bsmith)
Attachment #632918 - Flags: review?(bsmith)
Attachment #632854 - Flags: review?(bsmith)
Comment on attachment 632918 [details] [diff] [review]
Implemented all telemetry based on the original patch from davidchan

requesting review from Honza, as bsmith is out all week.
Attachment #632918 - Flags: review?(bsmith) → review?(honzab.moz)
Devdatta, how urgent is this review?  I think it is wiser to wait Brian's review.  I've never seen this bug before and there seems to already be a lot of discussion between you and bsmith.  But if this is urgently needed, I can take a look on Monday.
Makes sense. Brian suggested I ping you, but there are lots of random little issues that it makes sense that Brian be the one to review.
Attachment #632918 - Flags: review?(honzab.moz) → review?(bsmith)
Comment on attachment 632918 [details] [diff] [review]
Implemented all telemetry based on the original patch from davidchan

Review of attachment 632918 [details] [diff] [review]:
-----------------------------------------------------------------

Adding r?nfroyd for the changes to TelemetryHistograms.h. If the error-related histograms are going to use too much memory then we should re-evaluate whether we need the data to be that fine-grained.

Please re-request review with the new patch.

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +923,5 @@
> +      SECKEYPublicKey *pubKey = CERT_ExtractPublicKey(serverCert);
> +      if (pubKey) {
> +        if (rsaKey == SECKEY_GetPublicKeyType(pubKey)) {
> +          mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_KEYMOD,
> +              SECKEY_PublicKeyStrengthInBits(pubKey) - 1023);

AFAICT, this information is available from the structure returned from SSL_GetChannelInfo in authKeyBits, so we should just use that.

@@ +971,5 @@
> +    if (SECSuccess == SSL_GetChannelInfo(fd, &scinfo, sizeof(scinfo))) {
> +      // SSLv2 is no longer supported
> +      // scinfo.protocolVersion should always be >= 0x0300
> +
> +      if(scinfo.protocolVersion == 0x0300 || scinfo.protocolVersion == 0x0301){

if (scinfo.protocolVersion >= 0x0300)

@@ +983,5 @@
> +      // then SSLv2 / informational
> +      // the else clause should not come up much in practice
> +      // experimental and SSLv2 ciphersuites
> +      // http://tools.ietf.org/html/rfc5246#section-12
> +      if (scinfo.cipherSuite <= 0x0096) {

All of these ranges will fit within the range of a single probe. It is much easier to actually use this telemetry if it is all combined into one probe, because our telemetry website doesn't allow us to combine probes in any useful way. So, please combine these all into one probe.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +440,5 @@
>  
>    mCertVerificationEnded = PR_IntervalNow();
>  
> +
> +

Remove this whitespace change.

@@ +462,5 @@
>    mCertVerificationState = after_cert_verification;
> +  
> +  if (IS_SSL_ERROR(errorCode)) { 
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_ERROR, 
> +        errorCode - SSL_ERROR_BASE + 1); 

These must be range-checked to make sure the calculated bucket is within the range of the probe. We can use a MOZ_STATIC_ASSERT to make this check.

You can #define a name for the maximum values of these probes in TelementryHistograms.h

::: security/nss/lib/ssl/sslproto.h
@@ +106,5 @@
>  #define SSL_EN_DES_192_EDE3_CBC_WITH_MD5	0xFF07
>  
> +/* SSL v3 Cipher Suites 
> + * Any addition to this list must be accompanied
> + * by changes to the bucket size in TelemetryHistograms.h

Do not modify this file, as it is part of NSS which isn't supposed to know anything about telemetry.

::: toolkit/components/telemetry/TelemetryHistograms.h
@@ +261,5 @@
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_STANDARD, 0x0097, "SSL negotiated ciphersuite Standard (count)")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_SPEC_REQUIRED, 0x0019, "SSL negotiated ciphersuite SPEC Required (count)")
> +HISTOGRAM_BOOLEAN(SSL_CIPHERSUITE_PRIVATE_USE, "SSL negotiated ciphersuite Private Use (count)")
> +
> +HISTOGRAM_ENUMERATED_VALUES(SSL_KEYMOD, 3074, "Public key modulus (bits)")

IMO, what we really need to know are:

   * Is the key x bits, x-1 bits, or less than x bits
   * what is x?

for x in { 1024,
           1280, 1408, 1536, 1664, 1792, 1920, 
           2048, 3072, 4096, 5120, 6144, 7168,
           8192, 9216, 10240, 11264, 12288 }

(That is, every 1024 bits, except more fine-grained for 1024-2048 bits.)

I would rather see two probes that collect these values, instead of a large probe that collects the data in a more fine-grained fashion. This would make analysis of the data easier given our current tools. And, it would be more memory-efficient.

@@ +264,5 @@
> +
> +HISTOGRAM_ENUMERATED_VALUES(SSL_KEYMOD, 3074, "Public key modulus (bits)")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR, 1001, "SSL Errors")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR_SEC, 1001, "SEC Errors")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR_NSPR, 129, "NSPR Errors")

Make sure there's enough headroom to accommodate ~100 or so new error codes, and make sure we statically check the range.
Attachment #632918 - Flags: review?(bsmith) → review?(nfroyd)
If we add new error codes, won't we need a new histogram? We can't reuse the existing histograms right?
(In reply to Devdatta Akhawe from comment #22)
> If we add new error codes, won't we need a new histogram? We can't reuse the
> existing histograms right?

That's correct.  But you can do something like (Brian's suggestion from comment 21):

#define TELEMETRY_SSL_ERROR_NSPR_BUCKETS 250
HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR_NSPR, TELEMETRY_SSL_ERROR_NSPR_BUCKETS, "NSPR Errors");

and then statically assert somewhere that TELEMETRY_SSL_ERROR_NSPR_BUCKETS is more than the max NSPR error code you'd like to record.  That way you don't have to change the histogram every time NSPR adds a new error code.  You have to change it eventually, of course, but hopefully not for a good long while.
Comment on attachment 632918 [details] [diff] [review]
Implemented all telemetry based on the original patch from davidchan

Review of attachment 632918 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +973,5 @@
> +      // scinfo.protocolVersion should always be >= 0x0300
> +
> +      if(scinfo.protocolVersion == 0x0300 || scinfo.protocolVersion == 0x0301){
> +      mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_VERSION,
> +          scinfo.protocolVersion - SSL_LIBRARY_VERSION_3_0 + 1);

No need for the +1.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +462,5 @@
>    mCertVerificationState = after_cert_verification;
> +  
> +  if (IS_SSL_ERROR(errorCode)) { 
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_ERROR, 
> +        errorCode - SSL_ERROR_BASE + 1); 

No need to +1 here and elsewhere; we can handle values of 0 just fine in Telemetry.

Note that this can be just Telemetry::Accumulate.  Here and elsewhere.

@@ +2414,4 @@
>    key = nsDependentCString(host) + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port);
>  
>    if (nsSSLIOLayerHelpers::isKnownAsIntolerantSite(key)) {
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_TLS_1_0_INTOLERANT,1);

You should move this check out of the |if|.  The way it's written here means that you only ever accumulate true values into the histogram, which won't be all that helpful for data analysis.

Also, nit: space after comma.

::: toolkit/components/telemetry/TelemetryHistograms.h
@@ +254,5 @@
> +
> +/*
> + * SSL/TLS Telemetry
> + */
> +HISTOGRAM(SSL_TLS_1_0_INTOLERANT, 0, 1, 2, BOOLEAN, "SSL/TLS Intolerant Sites (count)")

Please use HISTOGRAM_BOOLEAN here.  The description seems a little strange, perhaps "Whether an https site was SSL/TLS Intolerant" instead?  I know very little about the guts of SSL/TLS, so feel free to modify to your liking.

@@ +259,5 @@
> +HISTOGRAM_ENUMERATED_VALUES(SSL_VERSION, 2, "SSL/TLS Version")
> +
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_STANDARD, 0x0097, "SSL negotiated ciphersuite Standard (count)")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_SPEC_REQUIRED, 0x0019, "SSL negotiated ciphersuite SPEC Required (count)")
> +HISTOGRAM_BOOLEAN(SSL_CIPHERSUITE_PRIVATE_USE, "SSL negotiated ciphersuite Private Use (count)")

Are these really counts?  Just looking at them it doesn't look like it.  Please fix the description strings.

@@ +261,5 @@
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_STANDARD, 0x0097, "SSL negotiated ciphersuite Standard (count)")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_CIPHERSUITE_SPEC_REQUIRED, 0x0019, "SSL negotiated ciphersuite SPEC Required (count)")
> +HISTOGRAM_BOOLEAN(SSL_CIPHERSUITE_PRIVATE_USE, "SSL negotiated ciphersuite Private Use (count)")
> +
> +HISTOGRAM_ENUMERATED_VALUES(SSL_KEYMOD, 3074, "Public key modulus (bits)")

To add to what Brian said: Unfortunately, we don't have a good way of collecting data over sparse sets of values like "# of bits in public key moduli" directly.  What we usually do in cases like this is map the sparse data to a dense set and collect a HISTOGRAM_ENUMERATED_VALUES.  So something like:

HISTOGRAM_ENUMERATED_VALUES(SSL_KEYMOD, 18, "Public key modulus (bits)")

with 0 representing 1024, 1 representing 1280, etc. etc.

Naming nit: telemetry probes tend to have fairly descriptive names, so something like SSL_PUBLIC_KEY_MODULUS_N_BITS or similar would be better.

@@ +263,5 @@
> +HISTOGRAM_BOOLEAN(SSL_CIPHERSUITE_PRIVATE_USE, "SSL negotiated ciphersuite Private Use (count)")
> +
> +HISTOGRAM_ENUMERATED_VALUES(SSL_KEYMOD, 3074, "Public key modulus (bits)")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR, 1001, "SSL Errors")
> +HISTOGRAM_ENUMERATED_VALUES(SSL_ERROR_SEC, 1001, "SEC Errors")

These upper values are about 5-10x bigger than the current range of errors they're using to record. That may be a tad excessive, depending on how aggressively NSS adds error codes. I will leave the final call to Brian, since he knows more about NSS and its development than I do, but I'd cut these down to ~300 or so.

Again, slightly more descriptive names here and for SSL_ERROR_NSPR would be better. Also the description strings could be, well, more descriptive. :)
Attachment #632918 - Flags: review?(nfroyd) → feedback+
(In reply to Nathan Froyd (:froydnj) from comment #23)
> (In reply to Devdatta Akhawe from comment #22)
> > If we add new error codes, won't we need a new histogram? We can't reuse the
> > existing histograms right?
> 
> That's correct.  But you can do something like (Brian's suggestion from
> comment 21):

My concern was different. Lets say right now the error code we have is ERROR_FOOBAR, and tomorrow we decide to separate it out into 2 error codes ERROR_FOO and ERROR_BAR. While we deploy this, there will be a period of time in which some users have upgraded, some who haven't. Some users who will report to us "we saw ERROR_FOOBAR" and some who will report to us "I saw ERROR_FOO". Getting correct data out of this will be impossible, right?

The trick of leaving extra space in the histogram works if the new values are completely unrelated. But in the case of SSL errors, the new values would, if I am not wrong, be values caused by being more specific for existing errors. In such a scenario, leaving extra space doesn't work, right?
(In reply to Devdatta Akhawe from comment #25)
> My concern was different. Lets say right now the error code we have is
> ERROR_FOOBAR, and tomorrow we decide to separate it out into 2 error codes
> ERROR_FOO and ERROR_BAR. While we deploy this, there will be a period of
> time in which some users have upgraded, some who haven't. Some users who
> will report to us "we saw ERROR_FOOBAR" and some who will report to us "I
> saw ERROR_FOO". Getting correct data out of this will be impossible, right?

You're right: the scenario you describe does cause problems, no way around it.  You can separate out histogram information by build id in the telemetry website (dashboard), so you could look at all data before the change landed and after the change landed.  And if you want to do analysis on the raw data, you can certainly ask the Metrics team to divide up thee data as you wish beforehand.

> The trick of leaving extra space in the histogram works if the new values
> are completely unrelated. But in the case of SSL errors, the new values
> would, if I am not wrong, be values caused by being more specific for
> existing errors. In such a scenario, leaving extra space doesn't work, right?

Presumably you'd need some extra space to record ERROR_BAR once you split it out; the extra space would save you from having to immediately change the histogram definition.  So extra space is at least partially helpful.

As only an outside observer, it seems that NSS changes slowly enough that recycled error codes won't be a problem in the best case and a minor problem in the worst case.
Assignee: dev.akhawe+mozilla → nobody
Assignee: nobody → mhamrick
Component: Telemetry → Security: PSM
Product: Toolkit → Core
Summary: Implement SSL certificate telemetry → Implement SSL certificate and cipher suite telemetry
Whiteboard: [privacy:sstamm]
Target Milestone: --- → mozilla25
It is important to collect at least the cipher suite telemetry and uplift it to mozilla-aurora and mozilla-beta ASAP so that we can start deciding whether we can turn off some never-used-in-practice cipher suites to deal with the compatibility issues mentioned in bug 861266.
Priority: -- → P1
Attachment #632854 - Flags: review?(brian)
Comment on attachment 632918 [details] [diff] [review]
Implemented all telemetry based on the original patch from davidchan

Review of attachment 632918 [details] [diff] [review]:
-----------------------------------------------------------------

I already reviewed this version of the patch but somehow my r- didn't come through.
Attachment #632918 - Flags: review-
(In reply to Brian Smith (:briansmith), was bsmith@mozilla.com (:bsmith) from comment #27)
> It is important to collect at least the cipher suite telemetry and uplift it
> to mozilla-aurora and mozilla-beta ASAP so that we can start deciding
> whether we can turn off some never-used-in-practice cipher suites to deal
> with the compatibility issues mentioned in bug 861266.

I think we can get some stats from Qualys SSL Labs on this.

They don't have publicly available cipher suite lists, but I imagine they collect at least the number of sites supporting each one.

https://www.ssllabs.com/index.html
https://www.trustworthyinternet.org/ssl-pulse/
I second the Qualys suggestion. FWIW, we at Berkeley also shared some data with Brian, collected via the ICSI notary http://notary.icsi.berkeley.edu/

I think the data collected by Qualys/ICSI is more useful than telemetry. Telemetry is biased towards the ciphers currently implemented by Firefox, and will not tell us which sites would continue working (or stop working) with a different ciphersuite. Finally, I suspect that per-connection telemetry will end up just telling you what Facebook, Google, and Twitter use.
Attached patch gather TLS Telemetry (obsolete) — Splinter Review
turns out it was slightly more than just rebasing. SSL_VERSION not needed since we're already collecting SSL_HANDSHAKE_VERSION.
Attachment #632918 - Attachment is obsolete: true
Attachment #782765 - Flags: review?(brian)
Comment on attachment 782765 [details] [diff] [review]
gather TLS Telemetry

Review of attachment 782765 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +1057,5 @@
>    }
>  
>    ScopedCERTCertificate serverCert(SSL_PeerCertificate(fd));
>  
> +  if (serverCert) {

You can assume that SSL_PeerCertificate will always succeed at this point in the handshake.

@@ +1058,5 @@
>  
>    ScopedCERTCertificate serverCert(SSL_PeerCertificate(fd));
>  
> +  if (serverCert) {
> +    nsRefPtr<nsNSSCertificate> nssc = nsNSSCertificate::Create(serverCert);

1. In PSM, please use RefPtr<> instead of nsRefPtr. Sorry for the confusion.

2. You do not use this nssc variable for anything so you can just remove it.

@@ +1067,5 @@
> +        mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_KEYMOD,
> +          SECKEY_PublicKeyStrengthInBits(pubKey));
> +      }
> +
> +      SECKEY_DestroyPublicKey(pubKey);

1. Prefer to use the ScopedXXX smart pointers (in this case, ScopedSECKEYPublicKey) from ScopedNSSTypes.h over bare pointers.

2. Since this patch was originally written, there was a change to use SSL_GetChannelInfo. You can use the result of the call to SSL_GetChannelInfo to get the certificate key bits: channelInfo.authKeyBits. That would be more efficient.

@@ +1137,5 @@
> +
> +      if (cipherInfo.cipherSuite <= 0x0096) {
> +        mozilla::Telemetry::Accumulate(
> +          mozilla::Telemetry::SSL_CIPHERSUITE_STANDARD,
> +          cipherInfo.cipherSuite + 1 );

Please address the previous review comments in this bug. In particular, there should be no "+ 1" on any of these telemetry calls.

nit: remove the extra space before the parenthesis.

@@ +1142,5 @@
> +      } else if (cipherInfo.cipherSuite > 0xC000 &&
> +                 cipherInfo.cipherSuite <= 0xC019) {
> +        mozilla::Telemetry::Accumulate(
> +          mozilla::Telemetry::SSL_CIPHERSUITE_SPEC_REQUIRED,
> +          cipherInfo.cipherSuite - 0xC000 );

nit: remove the extra space before the parenthesis.

@@ +1144,5 @@
> +        mozilla::Telemetry::Accumulate(
> +          mozilla::Telemetry::SSL_CIPHERSUITE_SPEC_REQUIRED,
> +          cipherInfo.cipherSuite - 0xC000 );
> +      } else {
> +        //Not sure why we don't just use a boolean here, but whatever.

Please remove this comment. I think there is no way to use a boolean in Telemetry.

@@ +1146,5 @@
> +          cipherInfo.cipherSuite - 0xC000 );
> +      } else {
> +        //Not sure why we don't just use a boolean here, but whatever.
> +        mozilla::Telemetry::Accumulate(
> +          mozilla::Telemetry::SSL_CIPHERSUITE_PRIVATE_USE,1);

nit: add a space before the 1.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +550,5 @@
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_ERROR_SEC, 
> +                                   errorCode - SEC_ERROR_BASE + 1 ); 
> +  } else { 
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_ERROR_NSPR,
> +                                   errorCode - PR_NSPR_ERROR_BASE + 1); 

Besides removing all the "+ 1", please remove the trailing whitespace. (You can see it in splinter, or you can see it if you enable the "color" hg extension when you view your patches through hg.)

@@ +2627,5 @@
>    nsAutoCString key;
>    key = nsDependentCString(host) + NS_LITERAL_CSTRING(":") + nsPrintfCString("%d", port);
>  
>    if (infoObject->SharedState().IOLayerHelpers().isKnownAsIntolerantSite(key)) {
> +    mozilla::Telemetry::Accumulate(mozilla::Telemetry::SSL_TLS_1_0_INTOLERANT,1);

1. nit: space before the 1.

2. This will need to be redone on top of your patch for bug 839310.

::: toolkit/components/telemetry/Histograms.json
@@ +896,5 @@
>      "n_values": 16,
>      "description": "SSL Handshake Key Exchange Algorithm (null=0, rsa=1, dh=2, fortezza=3, ecdh=4)"
>    },
> +  "SSL_TLS_1_0_INTOLERANT": {
> +    "description":"SSL/TLS Intolerant Sites (count)",

Nit: space after the colon (here and elsewhere).

I suspect that this probe will need to be redone to take into consideration your patch for bug 839310.

@@ +901,5 @@
> +    "kind": "boolean"
> +  },
> +  "SSL_CIPHERSUITE_STANDARD": {
> +    "description":"SSL negotiated ciphersuite Standard (count)",
> +    "n_values": 16,

this value of n_values seems too low. Shouldn't it be 0x0096?

@@ +906,5 @@
> +    "kind":"enumerated"
> +  },
> +  "SSL_CIPHERSUITE_SPEC_REQUIRED": {
> +    "description":"SSL negotiated ciphersuite SPEC Required (count)",
> +    "n_values": 16,

Similarly, it seems like this n_values should be larger to reflect the range of possible values.

@@ +911,5 @@
> +    "kind":"enumerated"
> +  },
> +  "SSL_CIPHERSUITE_PRIVATE_USE": {
> +    "description":"SSL negotiated ciphersuite Private Use (count)",
> +    "n_values": 16,

It seems like this one only needs a range of "1".

@@ +918,5 @@
> +  "SSL_KEYMOD": {
> +    "description": "Public key modulus (bits)",
> +    "kind": "linear",
> +    "high": "3072",
> +    "low": "1024",

Please see the previous discussion in the bug about why this doesn't work well.

I suggest factoring out the SSL_KEYMOD part into a separate patch.

@@ +933,5 @@
> +    "kind":"enumerated"
> +  },
> +  "SSL_ERROR_NSPR": {
> +    "description":"NSPR Errors",
> +    "n_values": 16,

Ditto the whitespace and range issues.
Attachment #782765 - Flags: review?(brian) → review-
Assignee: mhamrick → nobody
No longer blocks: 861266
This did not make version 25 and is assigned to nobody. 

Friendly reminder: When this lands, check with the outcome of Bug 742500 and set a far-future release for expiration (like Firefox 35 or 40, or never expire).
Status: ASSIGNED → NEW
Target Milestone: mozilla25 → ---
Assignee: nobody → brian
Status: NEW → ASSIGNED
Summary: Implement SSL certificate and cipher suite telemetry → Implement SSL certificate, cipher suite, and version negotiation telemetry
Target Milestone: --- → mozilla28
https://tbpl.mozilla.org/?tree=Try&rev=6690e77bed6c

BAD_RECORD_MAC intolerance:
      https://www.web-secured.com/
      https://nmusd-ca.schoolloop.com/

EOF/RST intolerance:
      https://employment.en-japan.com
      https://employment.en-japan.com/signup/
      https://www.wakasa.jp/shopping/cart/
      https://www.billpaysite.com/
      https://www.basingstokeleisure.com/

Examples for cipher suite coverage:
      The above URLs
      https://google.com
      https://twitter.com
      https://facebook.com

I recommend you load all the above URLs with telemetry enabled and then go to about:telemetry to view the histograms, to verify that things seem to be counted sensibly.
Attachment #782765 - Attachment is obsolete: true
Attachment #829986 - Flags: review?(dkeeler)
Attachment #829983 - Attachment description: add-telemetry-for-tls-intolerance.patch → Part 1: add-telemetry-for-tls-intolerance.patch
I forgot that 768 and 1536 are also interesting key sizes, for DHE.
Attachment #829986 - Attachment is obsolete: true
Attachment #829986 - Flags: review?(dkeeler)
Attachment #829990 - Flags: review?(dkeeler)
Attachment #829990 - Attachment is obsolete: true
Attachment #829990 - Flags: review?(dkeeler)
Attachment #829991 - Flags: review?(dkeeler)
Try run looks good: https://tbpl.mozilla.org/?tree=Try&rev=be0cfeb436a6

Note that I added the tests for bug 936818 to the try run, to help verify that my refactoring didn't break the fallback logic. But, I worry that bug 936818 will not land for a while due to the issues with the review of the NSS bug it depends on, so we shouldn't get landing these changes on landing bug 936818.

Note that I'm intending to uplift the patches in this bug to mozilla-aurora and mozilla-beta ASAP so that we can start the collecting numbers ahead of the TLS 1.2 and cipher suite changes.
Comment on attachment 829983 [details] [diff] [review]
Part 1: add-telemetry-for-tls-intolerance.patch

Review of attachment 829983 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits addressed
(also, fyi, the comment for this patch seems to have been copied over from the second patch)

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +914,5 @@
>  
> +bool
> +retryDueToTLSIntolerance(PRErrorCode err, nsNSSSocketInfo* socketInfo)
> +{
> +  SSLVersionRange range = socketInfo->GetTLSVersionRange();

nit: move this closer to where it is used

@@ +920,5 @@
> +  // This function is supposed to decide which error codes should
> +  // be used to conclude server is TLS intolerant.
> +  // Note this only happens during the initial SSL handshake.
> +
> +  unsigned int reason;

uint32_t

@@ +1005,5 @@
>                         PRFileDesc* ssl_layer_fd,
>                         nsNSSSocketInfo *socketInfo)
>  {
> +  PRErrorCode err = PR_GetError();
> +  

nit: whitespace
Attachment #829983 - Flags: review?(dkeeler) → review+
Comment on attachment 829991 [details] [diff] [review]
Part 2 [v3]: add-telemetry-for-cipher-suites-and-key-sizes.patch

Review of attachment 829991 [details] [diff] [review]:
-----------------------------------------------------------------

I tried this out on the sites you indicated, and it seems to do what we expect (although some of those sites aren't TLS intolerant anymore, it looks like).
One thing that surprised me, though, was that telemetry seems to reset across application restarts. Looks like it's a feature of telemetry, though, not this implementation.
r=me with nits addressed.

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +1033,5 @@
> +                     : bits <  3072 ? 13 : bits ==  3072 ? 14
> +                     : bits <  4096 ? 15 : bits ==  4096 ? 16
> +                     : bits <  8192 ? 17 : bits ==  8192 ? 18
> +                     : bits < 16384 ? 19 : bits == 16384 ? 20
> +                     : 0;

I think it's a little strange that the > 16384 bucket is mapped to 0, but I imagine it will be a long time before we see those kinds of keys, if at all (in fact, are keys limited to that size?)

@@ +1168,5 @@
>      unsigned int versionEnum = channelInfo.protocolVersion & 0xFF;
>      Telemetry::Accumulate(Telemetry::SSL_HANDSHAKE_VERSION, versionEnum);
>      infoObject->SetNegotiatedTLSVersion(channelInfo.protocolVersion);
>  
> +    unsigned int csEnum;

uint32_t
Also, maybe csProbeValue or csBucket something? I don't think csEnum describes what this variable represents.

::: security/manager/ssl/src/nsNSSComponent.cpp
@@ +830,5 @@
>    long id;
>    bool enabledByDefault;
>  } CipherPref;
>  
> +// Update the switch statement in HandshakeCallback in nsNSSCallBacks when you

nsNSSCallbacks
Attachment #829991 - Flags: review?(dkeeler) → review+
Comment on attachment 829991 [details] [diff] [review]
Part 2 [v3]: add-telemetry-for-cipher-suites-and-key-sizes.patch

Review of attachment 829991 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the review.

::: security/manager/ssl/src/nsNSSCallbacks.cpp
@@ +1033,5 @@
> +                     : bits <  3072 ? 13 : bits ==  3072 ? 14
> +                     : bits <  4096 ? 15 : bits ==  4096 ? 16
> +                     : bits <  8192 ? 17 : bits ==  8192 ? 18
> +                     : bits < 16384 ? 19 : bits == 16384 ? 20
> +                     : 0;

IIUC, in telemetry the value zero is used for "out of range value" which is what >16384 would be. NSS limits RSA, DH, and DSA keys to 16,384 (or maybe even less).
I think we should be a little more careful to make sure we set the correct error code. Please review this change. (Hopefully the Bugzilla interdiff tool will work for this.)
Attachment #829983 - Attachment is obsolete: true
Attachment #830531 - Flags: review?(dkeeler)
(In reply to Brian Smith (:briansmith, was :bsmith; NEEDINFO me if you want a response) from comment #42)
> (Hopefully the Bugzilla interdiff tool will work for this.)

It does:
https://bugzilla.mozilla.org/attachment.cgi?oldid=829983&action=interdiff&newid=830531&headers=1
Comment on attachment 830531 [details] [diff] [review]
Part 1: add-telemetry-for-tls-intolerance.patch

Review of attachment 830531 [details] [diff] [review]:
-----------------------------------------------------------------

I don't understand these changes. What is the problem, and how does this fix it?

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +1033,3 @@
>      if (handleHandshakeResultNow) {
>        if (PR_WOULD_BLOCK_ERROR == err) {
> +        PR_SetError(err, 0);

But if this already was the error, why do we need to set it again?

@@ +1081,5 @@
>      socketInfo->SetHandshakeNotPending();
>    }
>    
> +  if (bytesTransfered < 0) {
> +    // restore old error code

I think this comment is a bit misleading - we might be setting the error code to PR_CONNECT_RESET_ERROR, when it wasn't this before. In fact, that seems to be the entire point, right? We're only setting the error to PR_CONNECT_RESET_ERROR in certain cases, even if we want to retry?
Attachment #830531 - Flags: review?(dkeeler) → review-
(In reply to David Keeler (:keeler) from comment #44)
> I don't understand these changes. What is the problem, and how does this fix
> it?

With NSPR error code handling, you should assume by default that any function you call may have called PR_SetError to change the error code reported by PR_GetError. There is an exception for "destructor-like" functions like CERT_DestroyCertificate.

In particular, this function should assume that socketInfo->IsHandshakePending() may have called PR_SetError().

> ::: security/manager/ssl/src/nsNSSIOLayer.cpp
> @@ +1033,3 @@
> >      if (handleHandshakeResultNow) {
> >        if (PR_WOULD_BLOCK_ERROR == err) {
> > +        PR_SetError(err, 0);
> 
> But if this already was the error, why do we need to set it again?

Because socketInfo->IsHandshakePending() may have changed it, in theory. Defensive programming here.

> @@ +1081,5 @@
> >      socketInfo->SetHandshakeNotPending();
> >    }
> >    
> > +  if (bytesTransfered < 0) {
> > +    // restore old error code
> 
> I think this comment is a bit misleading - we might be setting the error
> code to PR_CONNECT_RESET_ERROR, when it wasn't this before. In fact, that
> seems to be the entire point, right? We're only setting the error to
> PR_CONNECT_RESET_ERROR in certain cases, even if we want to retry?

I agree. I will remove the comment in the next revision.
Attachment #830531 - Attachment is obsolete: true
Attachment #831276 - Flags: review?(dkeeler)
Comment on attachment 831276 [details] [diff] [review]
add-telemetry-for-tls-intolerance.patch

Review of attachment 831276 [details] [diff] [review]:
-----------------------------------------------------------------

Ok - I believe I understand this patch now, and it looks correct.

::: security/manager/ssl/src/nsNSSIOLayer.cpp
@@ +1033,3 @@
>      if (handleHandshakeResultNow) {
>        if (PR_WOULD_BLOCK_ERROR == err) {
> +        PR_SetError(err, 0);

Ok - makes sense.
Attachment #831276 - Flags: review?(dkeeler) → review+
Comment on attachment 829991 [details] [diff] [review]
Part 2 [v3]: add-telemetry-for-cipher-suites-and-key-sizes.patch

[Approval Request Comment]
User impact if declined: Higher risk of regressions caused by TLS-related changes coming in Firefox 27 and 28.
Testing completed (on m-c, etc.): Landed on mozilla-central yesterday.
Risk to taking this patch (and alternatives if risky): Very low risk. This is just adding telemetry, with a small amount of code reorganization. It isn't changing the semantics of anything.
String or IDL/UUID changes made by this patch: None
Attachment #829991 - Flags: approval-mozilla-beta?
Attachment #829991 - Flags: approval-mozilla-aurora?
Comment on attachment 831276 [details] [diff] [review]
add-telemetry-for-tls-intolerance.patch

[Approval Request Comment]
See comment 50.
Attachment #831276 - Flags: approval-mozilla-beta?
Attachment #831276 - Flags: approval-mozilla-aurora?
Comment on attachment 829991 [details] [diff] [review]
Part 2 [v3]: add-telemetry-for-cipher-suites-and-key-sizes.patch

Let's get these into tomorrow's beta.
Attachment #829991 - Flags: approval-mozilla-beta?
Attachment #829991 - Flags: approval-mozilla-beta+
Attachment #829991 - Flags: approval-mozilla-aurora?
Attachment #829991 - Flags: approval-mozilla-aurora+
Attachment #831276 - Flags: approval-mozilla-beta?
Attachment #831276 - Flags: approval-mozilla-beta+
Attachment #831276 - Flags: approval-mozilla-aurora?
Attachment #831276 - Flags: approval-mozilla-aurora+
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.