Closed Bug 799267 Opened 12 years ago Closed 11 years ago

Add telemetry for CertVerification

Categories

(Core :: Security: PSM, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla20

People

(Reporter: cviecco, Assigned: cviecco)

Details

Attachments

(2 files, 10 obsolete files)

11.14 KB, patch
briansmith
: review+
Details | Diff | Splinter Review
11.16 KB, patch
Details | Diff | Splinter Review
We need to add telemetry to PSM in order to determine effects of PKIX enabling
Assignee: nobody → cviecco
Attachment #669311 - Flags: review?(bsmith)
Comment on attachment 669311 [details] [diff] [review]
implementation v1 (time to check + is libpkix enabled)

Review of attachment 669311 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +631,5 @@
>  }
>  
>  SECStatus
>  PSM_SSL_PKIX_AuthCertificate(CERTCertificate *peerCert, void * pinarg,
>                               const char * hostname)

We cannot use telemetry in this function without locks because it may execute on any of 5 threads in the pool, and telemetry is not thread-safe. We need to record beginTime before we dispatch to the thread pool.

::: toolkit/components/telemetry/Histograms.json
@@ +2315,5 @@
> +  "SECURITY_NSS_VERIFICATION_FIRST_CONNECTION" : {
> +    "kind": "exponential",
> +    "high": "10000",
> +    "n_buckets": 50,
> +    "description": "Time spent on NSS doing a successful verification for a SSL connection  (ms)"

I think you expect that we can look at the data two-dimensionally across two different probes, but we cannot. We need to have a problem "time when libpkix is enabled" and another one "time when libpkix is disabled."

Sadly, 10,000ms may not be enough time. Please increase it to 100,000ms.

"FIRST_CONNECTION" is a bad name because it doesn't consider the fact that we do certificate validation during renegotiation, and that we sometimes do full handshakes even after the first connection (e.g. if the server doesn't support session resumption and/or if the session has expired).

I recommend using the prefix SSL_CERT, and naming the probes SSL_CERT_VALIDATION_TIME_LIBPKIX and SSL_CERT_VALIDATION_TIME_CLASSIC.
Attachment #669311 - Flags: review?(bsmith) → review-
Comment on attachment 669311 [details] [diff] [review]
implementation v1 (time to check + is libpkix enabled)

Review of attachment 669311 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +631,5 @@
>  }
>  
>  SECStatus
>  PSM_SSL_PKIX_AuthCertificate(CERTCertificate *peerCert, void * pinarg,
>                               const char * hostname)

On the mutex OK.
If we change the begin time, then shall we include the comparison to the domain name too?

::: toolkit/components/telemetry/Histograms.json
@@ +2315,5 @@
> +  "SECURITY_NSS_VERIFICATION_FIRST_CONNECTION" : {
> +    "kind": "exponential",
> +    "high": "10000",
> +    "n_buckets": 50,
> +    "description": "Time spent on NSS doing a successful verification for a SSL connection  (ms)"

Ok Agreed
Comment on attachment 669311 [details] [diff] [review]
implementation v1 (time to check + is libpkix enabled)

Review of attachment 669311 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +631,5 @@
>  }
>  
>  SECStatus
>  PSM_SSL_PKIX_AuthCertificate(CERTCertificate *peerCert, void * pinarg,
>                               const char * hostname)

It is OK to start measuring right before the call to SSLServerCertVerificationJob::Dispatch. You can stop measuring in SetCertVerificationResult.

It would be good if you could add a third dimension to the timing: whether or not we had a failure that caused us to take the error path that checks for overrides, etc. So, we'd have these probes:

Certificate validated successfully using classic
Certificate validated successfully using libpkix
Certificate (initially) failed to validate using classic
Certificate (initially) failed to validate using libpkix.

My hypothesis is that the error cases take significantly longer than the non-error cases. We are not really concerned about how slow the error cases are.
Attachment #669311 - Attachment is obsolete: true
Attachment #669716 - Attachment is obsolete: true
Attachment #669722 - Flags: review?(bsmith)
Comment on attachment 669722 [details] [diff] [review]
implementaiton v2.1 (separate histograms + mutex)

Did not notice a comment sent from brian regaring when to stop measuring. Need a new revision of this.
Attachment #669722 - Flags: review?(bsmith)
I have the 'feeling' that we should move the failure code to the certerrorrunnable. But this one works as advertised
Attachment #669722 - Attachment is obsolete: true
Attachment #669769 - Flags: review?(bsmith)
Comment on attachment 669769 [details] [diff] [review]
implementaiton v3.0 (separate histograms + mutex)

Review of attachment 669769 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +111,5 @@
>  #include "nsComponentManagerUtils.h"
>  #include "nsServiceManagerUtils.h"
>  #include "nsIConsoleService.h"
>  #include "PSMRunnable.h"
> +#include <prlock.h>

why?

@@ +135,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static mozilla::Mutex SSLVerificationTelemetryMutex("SSLVerificationMutex");

We cannot create a static mutex like this because we must avoid static initializers. See the comment 4 lines above.

@@ +1044,5 @@
>      // Reset the error code here so we can detect if AuthCertificate fails to
>      // set the error code if/when it fails.
>      PR_SetError(0, 0); 
>      SECStatus rv = AuthCertificate(mInfoObject, mCert);
> +    uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());

Instead of calculating the interval yourself, use Telemetry::AccumulateTimeDelta, to avoid the messy casting and math.

@@ +1057,3 @@
>        nsRefPtr<SSLServerCertVerificationResult> restart 
>          = new SSLServerCertVerificationResult(mInfoObject, 0);
>        restart->Dispatch();

Why not pass the Telemetry::ID and interval to the SSLServerCertVerificationResult so that we can avoid the mutex altogether, by calling Telemetry::Accumulate in nsNSSSocketInfo::SetCertVerificationResult, which is always guaranteed to run on the same thread? I'm a little nervous about the performance ramifications of adding a lock here just for telemetry, especially when we're trying to make this code as fast as possible.

@@ +1065,5 @@
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX, interval);
> +      }
> +      else{
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC, interval);

Here, it is OK to acquire a mutex if it helps, because this is the error path, where performance isn't as much of a factor.

::: toolkit/components/telemetry/Histograms.json
@@ +2311,5 @@
> +  "SSL_SUCCESFUL_CERT_VALIDATION_TIME_LIBPKIX" : {
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 50,
> +    "description": "Time spent on a successful cert verification using LIBPKIX SSL (ms)"

s/using LIBPKIX SSL/in libpix mode/.

@@ +2317,5 @@
> +  "SSL_SUCCESFUL_CERT_VALIDATION_TIME_CLASSIC" : {
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 50,
> +    "description": "Time spent on a successful cert  verification using 'Classic 'SSL (ms)"

s/using 'Classic 'SSL/in classic mode/

@@ +2323,5 @@
> +  "SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX" : {
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 50,
> +    "description": "Time spent on an initially failed cert verification using LIBPKIX SSL (ms)"

ditto.

@@ +2329,5 @@
> +  "SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC" : {
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 50,
> +    "description": "Time spent on an initially failed cert  verification using 'Classic 'SSL (ms)"

ditto.
Attachment #669769 - Flags: review?(bsmith) → review+
Comment on attachment 669769 [details] [diff] [review]
implementaiton v3.0 (separate histograms + mutex)

Review of attachment 669769 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry. Getting close though!
Attachment #669769 - Flags: review+ → review-
Comment on attachment 669769 [details] [diff] [review]
implementaiton v3.0 (separate histograms + mutex)

Review of attachment 669769 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +111,5 @@
>  #include "nsComponentManagerUtils.h"
>  #include "nsServiceManagerUtils.h"
>  #include "nsIConsoleService.h"
>  #include "PSMRunnable.h"
> +#include <prlock.h>

Because I forgot to remove it.... removed now.

@@ +135,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static mozilla::Mutex SSLVerificationTelemetryMutex("SSLVerificationMutex");

Agreed. Moved init to InitializeSSLServerCertVerificationThreads.

@@ +1044,5 @@
>      // Reset the error code here so we can detect if AuthCertificate fails to
>      // set the error code if/when it fails.
>      PR_SetError(0, 0); 
>      SECStatus rv = AuthCertificate(mInfoObject, mCert);
> +    uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());

Agreed.

@@ +1057,3 @@
>        nsRefPtr<SSLServerCertVerificationResult> restart 
>          = new SSLServerCertVerificationResult(mInfoObject, 0);
>        restart->Dispatch();

This is a lock on a very small piece of code. I dont expect much contention. 
And we are already doing a bunch of crypto and well implemented locks without contention should take in the order of 10's of nanosecs (a <20 asm instructions).

Also it seems very messy to use locks sometimes yes and sometimes not. The mutex is much more clear (I think). But it can be done without locks by doing the accumulate in the socket transport thread.
Attachment #669769 - Attachment is obsolete: true
Attachment #670926 - Flags: review?(bsmith)
Comment on attachment 670926 [details] [diff] [review]
implementaiton v4.0 mutex only for error case.

Review of attachment 670926 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +134,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static mozilla::Mutex *SSLVerificationTelemetryMutex = NULL;

1. Instead of static, put it in the unnamed namespace two lines before.

2. Use the gXXX naming convention from the Mozilla Coding Style guide (gCertVerficationTelemetryMutex).

3. s/NULL/nullptr/.

i.e. make it look like line 135.

@@ +627,5 @@
>    : mFdForLogging(fdForLogging)
>    , mInfoObject(infoObject)
>    , mCert(CERT_DupCertificate(cert))
>  {
> +   mJobStartTime=TimeStamp::Now();

Add a comment or an assertion about being on the socket transport service thread.

@@ +1044,5 @@
>      // Reset the error code here so we can detect if AuthCertificate fails to
>      // set the error code if/when it fails.
>      PR_SetError(0, 0); 
>      SECStatus rv = AuthCertificate(mInfoObject, mCert);
> +    uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());

When a function calls SECStatus, you must be very careful about calling any function between then and the time you call PR_GetError. You must either avoid calling any other functions, or you must call PR_GetError() and save the result into a variable for later use.

@@ +1049,2 @@
>      if (rv == SECSuccess) {
> +      MutexAutoLock telemetryMutex(*SSLVerificationTelemetryMutex);

Either the patch description ("implementaiton v4.0 mutex only for error case") is wrong, or this code is wrong.

Note that you could just pass mJobStartTime, TimeStamp::Now(), and nsNSSComponent::globalConstFlagUsePKIXVerification to the SSLServerCertVerificationResult constructor, and then call Telemetry::AccumulateTimeDelta from SSLServerCertVerificationResult::Run, without using the lock.

Also, notice the code formatting.

@@ +1058,5 @@
>          = new SSLServerCertVerificationResult(mInfoObject, 0);
>        restart->Dispatch();
>        return NS_OK;
>      }
> +    //The telemetry is on its own block to limit the size of the lock contention

This comment is unnecessary because we all know this code pattern very well.

::: toolkit/components/telemetry/Histograms.json
@@ +2317,5 @@
> +  "SSL_SUCCESFUL_CERT_VALIDATION_TIME_CLASSIC" : {
> +    "kind": "exponential",
> +    "high": "100000",
> +    "n_buckets": 50,
> +    "description": "Time spent on a successful cert  verification in classic mode (ms)"

s/  / /
Attachment #670926 - Flags: review?(bsmith) → review-
Comment on attachment 670971 [details] [diff] [review]
implementaiton v4.1 (the real one) mutex only for error case.

Review of attachment 670971 [details] [diff] [review]:
-----------------------------------------------------------------

Getting close now.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +134,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static mozilla::Mutex *SSLVerificationTelemetryMutex = NULL;

See previous comment about this.

@@ +621,5 @@
>  
>    const void * const mFdForLogging;
>    const nsRefPtr<TransportSecurityInfo> mInfoObject;
>    CERTCertificate * const mCert;
> +  TimeStamp mJobStartTime;

const

@@ +629,5 @@
>      const void * fdForLogging, TransportSecurityInfo * infoObject,
>      CERTCertificate * cert)
>    : mFdForLogging(fdForLogging)
>    , mInfoObject(infoObject)
>    , mCert(CERT_DupCertificate(cert))

, mJobStartTime(TimeStamp::Now())

@@ +1048,5 @@
>      // Reset the error code here so we can detect if AuthCertificate fails to
>      // set the error code if/when it fails.
>      PR_SetError(0, 0); 
>      SECStatus rv = AuthCertificate(mInfoObject, mCert);
> +    uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());

See previous comment about executing code between a function that returns SECStatus and the call to PR_GetError().

@@ +1063,4 @@
>        restart->Dispatch();
>        return NS_OK;
>      }
> +    //The telemetry is on its own block to limit the size of the lock contention

Comment is unnecessary. However, a blank line is necessary.

@@ +1064,5 @@
>        return NS_OK;
>      }
> +    //The telemetry is on its own block to limit the size of the lock contention
> +    {
> +

This blank line should be removed.

@@ +1067,5 @@
> +    {
> +
> +      MutexAutoLock telemetryMutex(*SSLVerificationTelemetryMutex);
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX, interval);

Fix line lengths to be ~80 characters.

@@ +1070,5 @@
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX, interval);
> +      }
> +      else{
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC, interval);

Same. Line is too long, so side-by-size diff wraps unless I shrink the font too much.

@@ +1245,5 @@
>  
>  SSLServerCertVerificationResult::SSLServerCertVerificationResult(
>          TransportSecurityInfo * infoObject, PRErrorCode errorCode,
> +        SSLErrorMessageType errorMessageType,
> +        mozilla::Telemetry::ID telemetryID , uint32_t telemetryValue )

Do not use the mozilla:: prefix except inside header files.

@@ +1271,5 @@
>  NS_IMETHODIMP
>  SSLServerCertVerificationResult::Run()
>  {
>    // TODO: Assert that we're on the socket transport thread
> +  if(mTelemetryID != Telemetry::HistogramCount){

whitespace.

This condition is always true, isn't it?
Attachment #670971 - Flags: review-
Also, explain in a comment that you take the lock in the error case because the performance doesn't matter and because the error case's control flow is too complicated to bother avoiding the lock in the same way we avoid it for the non-error case.
Comment on attachment 670971 [details] [diff] [review]
implementaiton v4.1 (the real one) mutex only for error case.

Review of attachment 670971 [details] [diff] [review]:
-----------------------------------------------------------------

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +134,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static mozilla::Mutex *SSLVerificationTelemetryMutex = NULL;

I dont see the difference, it also a pointer to a null value (resolved at compile time). There is no static intializer/destructor.

and even static to avoid polluting the global namespace.

@@ +621,5 @@
>  
>    const void * const mFdForLogging;
>    const nsRefPtr<TransportSecurityInfo> mInfoObject;
>    CERTCertificate * const mCert;
> +  TimeStamp mJobStartTime;

agreed

@@ +629,5 @@
>      const void * fdForLogging, TransportSecurityInfo * infoObject,
>      CERTCertificate * cert)
>    : mFdForLogging(fdForLogging)
>    , mInfoObject(infoObject)
>    , mCert(CERT_DupCertificate(cert))

agreed

@@ +1048,5 @@
>      // Reset the error code here so we can detect if AuthCertificate fails to
>      // set the error code if/when it fails.
>      PR_SetError(0, 0); 
>      SECStatus rv = AuthCertificate(mInfoObject, mCert);
> +    uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());

thanks fixed

@@ +1063,4 @@
>        restart->Dispatch();
>        return NS_OK;
>      }
> +    //The telemetry is on its own block to limit the size of the lock contention

removed

@@ +1064,5 @@
>        return NS_OK;
>      }
> +    //The telemetry is on its own block to limit the size of the lock contention
> +    {
> +

removed

@@ +1067,5 @@
> +    {
> +
> +      MutexAutoLock telemetryMutex(*SSLVerificationTelemetryMutex);
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX, interval);

split now 88 chars...

@@ +1070,5 @@
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX, interval);
> +      }
> +      else{
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC, interval);

ditto (88 chars)

@@ +1245,5 @@
>  
>  SSLServerCertVerificationResult::SSLServerCertVerificationResult(
>          TransportSecurityInfo * infoObject, PRErrorCode errorCode,
> +        SSLErrorMessageType errorMessageType,
> +        mozilla::Telemetry::ID telemetryID , uint32_t telemetryValue )

done

@@ +1271,5 @@
>  NS_IMETHODIMP
>  SSLServerCertVerificationResult::Run()
>  {
>    // TODO: Assert that we're on the socket transport thread
> +  if(mTelemetryID != Telemetry::HistogramCount){

it should be, the condition could be removed, but I am taking defensive measures here.
Attached patch implementation v5 (obsolete) — Splinter Review
Attachment #670926 - Attachment is obsolete: true
> We cannot use telemetry in this function without locks because it may
> execute on any of 5 threads in the pool, and telemetry is not thread-safe.

I was under the impression that due to its non-thread-safe nature, accumulate needs to run from the main thread. I.e., is it enough to make sure that accumulate calls for a particular probe do not race (which is what your locking does), even if it might race with other probes (which the patch allows)?
Attached patch implementation v5.1 (obsolete) — Splinter Review
Attachment #686722 - Attachment is obsolete: true
Attached patch implementation v5.2 (obsolete) — Splinter Review
Attachment #686788 - Attachment is obsolete: true
Comment on attachment 686795 [details] [diff] [review]
implementation v5.2

Review of attachment 686795 [details] [diff] [review]:
-----------------------------------------------------------------

The only known issue/discussion is the static mutex pointer if it should be static or be part of the unnamed space in that file.
Attachment #686795 - Flags: review?(bsmith)
(In reply to Devdatta Akhawe [:devd] from comment #19)
> > We cannot use telemetry in this function without locks because it may
> > execute on any of 5 threads in the pool, and telemetry is not thread-safe.
> 
> I was under the impression that due to its non-thread-safe nature,
> accumulate needs to run from the main thread. I.e., is it enough to make
> sure that accumulate calls for a particular probe do not race (which is what
> your locking does), even if it might race with other probes (which the patch
> allows)?

Currently* doing a per probe lock is fine. The telemetry probe is a static array:
https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#460
of histogram pointers. The potential race is the binding of the pointer to a new created histogram 
https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp#486

That is why is is safe to do a per probe mutex.
Comment on attachment 686795 [details] [diff] [review]
implementation v5.2

Review of attachment 686795 [details] [diff] [review]:
-----------------------------------------------------------------

Hi Camilo. I think it's better to make it part of the unnamed namespace to be consistent with the other code in the file.

Also, please add a comment explaining why we created this new mutex instead of using the other mutex we already have.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +109,1 @@
>  #include "mozilla/Assertions.h"

When possible, try to keep includes in alphabetical order

@@ +135,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static Mutex *SSLVerificationTelemetryMutex = NULL;

Please see previous review comments about this; it is good to be consistent with the existing code. Also, please add this comment here:

// We avoid using a mutex for the success case to avoid
// lock-related performance issues. However, we do use
// a lock in the error case to simplify the code, since
// performance in the error case is not important.
Mutex *gSSLVerificationErrorTelemetryMutex = nullptr;

(in the unnamed namespace above).

@@ +1070,4 @@
>        RefPtr<SSLServerCertVerificationResult> restart(
> +        new SSLServerCertVerificationResult(mInfoObject,
> +                                           0,
> +                                           PlainErrorMessage,

One reason for defaulting the errorMessageType argument is to make it clearer that the non-error case (0) is not an error.

I suggest that you move the telemetry ID and telemetry interval parameters to be between the infoObject and error code arguments (i.e. make them the second and third arguments).

@@ +1083,5 @@
> +    {
> +      uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());
> +      MutexAutoLock telemetryMutex(*SSLVerificationTelemetryMutex);
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,

Use AccumulateInterval instead of inlining it into this function. Then you can remove the "Note: the interval is not calculated once as PR_GETError MUST be called" comment.

@@ +1087,5 @@
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,
> +                              interval);
> +      }
> +      else{
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC,

ditto.

@@ +1306,5 @@
>  
>  SSLServerCertVerificationResult::SSLServerCertVerificationResult(
>          TransportSecurityInfo * infoObject, PRErrorCode errorCode,
> +        SSLErrorMessageType errorMessageType,
> +        Telemetry::ID telemetryID , uint32_t telemetryValue )

whitespace between "ID" and "," and before ")"

@@ +1316,1 @@
>  {

// We accumulate telemetry for (only) successful validations 
// on the main thread to avoid adversely affecting performance
// by acquiring the mutex that we use when accumulating the 
// telemetry for unsuccessful validations. Unsuccessful 
// validations times are accumulated elsewhere.
MOZ_ASSERT(errorCode != 0 || telemetryID == Telemetry::HistogramCount);
Attachment #686795 - Flags: review?(bsmith) → review-
(In reply to Camilo Viecco (:cviecco) from comment #23)
> (In reply to Devdatta Akhawe [:devd] from comment #19)
> > > We cannot use telemetry in this function without locks because it may
> > > execute on any of 5 threads in the pool, and telemetry is not thread-safe.
> > 
> > I was under the impression that due to its non-thread-safe nature,
> > accumulate needs to run from the main thread. I.e., is it enough to make
> > sure that accumulate calls for a particular probe do not race (which is what
> > your locking does), even if it might race with other probes (which the patch
> > allows)?
> 
> Currently* doing a per probe lock is fine. The telemetry probe is a static
> array:
> https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> Telemetry.cpp#460
> of histogram pointers. The potential race is the binding of the pointer to a
> new created histogram 
> https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> Telemetry.cpp#486

FWIW, there's locking done further down so you don't have to worry about this sort of thing:

https://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/histogram.cc#1154

Do you really expect cert verification to take nearly 2 minutes?  The maximums on the histograms you're adding suggest as much.
Comment on attachment 686795 [details] [diff] [review]
implementation v5.2

Review of attachment 686795 [details] [diff] [review]:
-----------------------------------------------------------------

thank you. Aggreed on alsmot everything. Watch for version 6.x

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +109,1 @@
>  #include "mozilla/Assertions.h"

Done

@@ +135,5 @@
>  // do not use a nsCOMPtr to avoid static initializer/destructor
>  nsIThreadPool * gCertVerificationThreadPool = nullptr;
>  } // unnamed namespace
>  
> +static Mutex *SSLVerificationTelemetryMutex = NULL;

Agreed. Done

@@ +1070,4 @@
>        RefPtr<SSLServerCertVerificationResult> restart(
> +        new SSLServerCertVerificationResult(mInfoObject,
> +                                           0,
> +                                           PlainErrorMessage,

I see. I did something similar: 
put the telemetry params as third and fourth arguments, to make it clear on several failure sections that there was no telemetry in place.

@@ +1083,5 @@
> +    {
> +      uint32_t interval = (uint32_t) ((TimeStamp::Now() - mJobStartTime).ToMilliseconds());
> +      MutexAutoLock telemetryMutex(*SSLVerificationTelemetryMutex);
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,

Done

@@ +1087,5 @@
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,
> +                              interval);
> +      }
> +      else{
> +        Telemetry::Accumulate(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC,

Done

@@ +1306,5 @@
>  
>  SSLServerCertVerificationResult::SSLServerCertVerificationResult(
>          TransportSecurityInfo * infoObject, PRErrorCode errorCode,
> +        SSLErrorMessageType errorMessageType,
> +        Telemetry::ID telemetryID , uint32_t telemetryValue )

Done

@@ +1316,1 @@
>  {

Done
(In reply to Nathan Froyd (:froydnj) from comment #25)
> (In reply to Camilo Viecco (:cviecco) from comment #23)
> > (In reply to Devdatta Akhawe [:devd] from comment #19)
> > > > We cannot use telemetry in this function without locks because it may
> > > > execute on any of 5 threads in the pool, and telemetry is not thread-safe.
> > > 
> > > I was under the impression that due to its non-thread-safe nature,
> > > accumulate needs to run from the main thread. I.e., is it enough to make
> > > sure that accumulate calls for a particular probe do not race (which is what
> > > your locking does), even if it might race with other probes (which the patch
> > > allows)?
> > 
> > Currently* doing a per probe lock is fine. The telemetry probe is a static
> > array:
> > https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> > Telemetry.cpp#460
> > of histogram pointers. The potential race is the binding of the pointer to a
> > new created histogram 
> > https://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/
> > Telemetry.cpp#486
> 
> FWIW, there's locking done further down so you don't have to worry about
> this sort of thing:
> 
> https://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/
> histogram.cc#1154
> 
> Do you really expect cert verification to take nearly 2 minutes?  The
> maximums on the histograms you're adding suggest as much.

Thank you

I would hope that it would take less than 1 sec, but since there can be multiple OCSP requests for a resource it would depend on the timeouts for these connections. Since the tail of this probe will look similar to the SSL_TIME_UNTIL_READY and given that the TIME_UNTIL ready max is near 51k I am adjusting this to 60K as limit.
Attached patch implementation v6 (obsolete) — Splinter Review
Attachment #670971 - Attachment is obsolete: true
Attachment #691476 - Flags: review?(bsmith)
Attachment #691476 - Attachment is obsolete: true
Attachment #691476 - Flags: review?(bsmith)
Comment on attachment 691560 [details] [diff] [review]
implementation v6.1

the assert condition was wrong. Now we actually run with errors.
Attachment #691560 - Flags: review?(bsmith)
Comment on attachment 691560 [details] [diff] [review]
implementation v6.1

Review of attachment 691560 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits addressed.

::: security/manager/ssl/src/SSLServerCertVerification.cpp
@@ +137,5 @@
> +
> +// We avoid using a mutex for the success case to avoid
> +// lock-related performance issues. However, we do use
> +// a lock in the error case to simplify the code, since
> +// performance in the error case is not important.

line length (too short).

@@ +239,5 @@
>    const PRErrorCode mErrorCode;
>    const SSLErrorMessageType mErrorMessageType;
> +  const Telemetry::ID mTelemetryID;
> +  const uint32_t mTelemetryValue;
> +

remove blank line.

@@ +1064,5 @@
>        RefPtr<SSLServerCertVerificationResult> restart(
> +        new SSLServerCertVerificationResult(mInfoObject,
> +                                           0,
> +                                           telemetryID,
> +                                           interval));

indention.

Nit: I don't think each parameter needs its own line

@@ +1076,5 @@
> +    {
> +      MutexAutoLock telemetryMutex(*gSSLVerificationTelemetryMutex);
> +      if(nsNSSComponent::globalConstFlagUsePKIXVerification){
> +        Telemetry::AccumulateTimeDelta(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,
> +                                       mJobStartTime,TimeStamp::Now());

line length.

@@ +1079,5 @@
> +        Telemetry::AccumulateTimeDelta(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_LIBPKIX,
> +                                       mJobStartTime,TimeStamp::Now());
> +      }
> +      else{
> +        Telemetry::AccumulateTimeDelta(Telemetry::SSL_INITIAL_FAILED_CERT_VALIDATION_TIME_CLASSIC,

line length.

@@ +1308,5 @@
> +// We accumulate telemetry for (only) successful validations
> +// on the main thread to avoid adversely affecting performance
> +// by acquiring the mutex that we use when accumulating the
> +// telemetry for unsuccessful validations. Unsuccessful
> +// validations times are accumulated elsewhere.

line length (too short).
Attachment #691560 - Flags: review?(bsmith) → review+
Attachment #686795 - Attachment is obsolete: true
Keeping r+ from bsmith (now checking for spaces)
My bad, used try to test for compilation(for all platforms) but only did full unit testing on Linux.
Green build, pushing post festivites.
https://tbpl.mozilla.org/?tree=Try&rev=3a7ae907fdd3
https://hg.mozilla.org/mozilla-central/rev/da2b24b1fc7c
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: