Closed Bug 429794 Opened 12 years ago Closed 11 years ago

Lk (trace malloc leaks) test results have become very unstable since 2008-04-08 or so

Categories

(Core :: Security: PSM, defect, major)

defect
Not set
major

Tracking

()

RESOLVED FIXED

People

(Reporter: reed, Assigned: kaie)

References

()

Details

(Keywords: fixed1.9.0.2, memory-leak, regression)

Attachments

(3 files)

The Lk (trace malloc leaks) test has become very unstable lately since about 2008-04-08 or 2008-04-09. It's very obvious from the graphs that some patch that landed introduced a major problem that is still causing issues.

http://build-graphs.mozilla.org/graph/query.cgi?tbox=fxdbug-linux-tbox.build.mozilla.org&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1

http://build-graphs.mozilla.org/graph/query.cgi?tbox=fxdbug-win32-tb&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1

http://build-graphs.mozilla.org/graph/query.cgi?tbox=bm-xserve11.build.mozilla.org&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1

Depending on when you look at the graphs above, you may need to increase the number of days so that you can see data from before 2008-04-08 / 2008-04-09 or so.
Flags: blocking1.9?
If we could identify the actual issue that caused the instability, that issue *might* block (assuming we find it's really serious).  But, until we have more info, I don't think this would hold back the release.
Flags: blocking1.9? → blocking1.9-
Can you at least get somebody to look at this? This looks pretty severe.
Flags: blocking1.9- → blocking1.9?
Why didn't we notice this before now?
(In reply to comment #3)
> Why didn't we notice this before now?

Probably because people don't look at these graphs a lot? Dunno. We've had perf regressions before that nobody noticed for weeks...

I noticed this instability earlier than I filed this bug, but I had been too busy to file it due to school stuff.
Ok.  I'll get someone to look at it, but I still don't think it'll stop the release.  Marking 1.9.0.x+.
Flags: wanted1.9.0.x+
Flags: blocking1.9? → blocking1.9-
Reed can you get us an exact regression range.  DBaron can you take a look?
Assignee: nobody → dbaron
Flags: blocking1.9? → blocking1.9+
cc'ing stuart and rob
It would help if tinderbox showed the leak diffs; that used to work, but it doesn't seem to show up anymore.
And, for what it's worth, the regression range on Windows seems most likely to be http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=AviaryBranchTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-04-08+21%3A32&maxdate=2008-04-08+22%3A14&cvsroot=%2Fcvsroot if we assume that the cause of the occasionally-much-higher is the same as the ~5K regression that started around the same time.  Unfortunately there were no checkins in that window.
Actually, tinderbox shows the leak diffs on Linux and Mac, just doesn't on Windows.  (I filed bug 432338 for that.)

I looked through two jumps up on Mac that were about ~65K, and they were both NSS_related:

      72759 malloc
        72800 PR_Malloc
          70414 PL_ArenaAllocate
            68343 PORT_ArenaAlloc_Util
              57988 PORT_ArenaZAlloc_Util
                33136 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                  33136 CERT_EncodeInfoAccessExtension
                    33136 CERT_EncodeInfoAccessExtension
                      33136 CERT_EncodeInfoAccessExtension
                        28994 PK11_FindCertByIssuerAndSN
                          28994 CERT_FindCertByIssuerAndSN
                            28994 XRE_GetFileFromPath
                              28994 PR_CallOnce
                                28994 XRE_GetFileFromPath
                        2071 CERT_NewTempCertificate
                          2071 CERT_ImportCerts
                            2071 CERT_DestroyOCSPResponse
                              2071 CERT_VerifyOCSPResponseSignature
                                2071 CERT_GetOCSPAuthorityInfoAccessLocation
                        2071 CERT_EncodeInfoAccessExtension
                          2071 CERT_EncodeInfoAccessExtension
                            2071 CERT_EncodeInfoAccessExtension
                              2071 CERT_EncodeInfoAccessExtension
                                2071 CERT_FindCertIssuer
                12426 CERT_DerNameToAscii
                  12426 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                    12426 CERT_EncodeInfoAccessExtension
                      12426 CERT_EncodeInfoAccessExtension
                        12426 CERT_EncodeInfoAccessExtension
                          12426 PK11_FindCertByIssuerAndSN
                            12426 CERT_FindCertByIssuerAndSN
                              12426 XRE_GetFileFromPath
                                12426 PR_CallOnce
                10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2940
                  10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2EB8
                    10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2656
                      8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2AF6
                        8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2D61
                          8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+25AD
                            8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2F03
                              8284 SEC_QuickDERDecodeItem_Util
                                8284 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                      2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2D2C
                        2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+25AD
                          2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2F03
                            2071 SEC_QuickDERDecodeItem_Util
                              2071 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                                2071 CERT_EncodeInfoAccessExtension
                2071 CERT_UncacheCRL
                  2071 CERT_GetCertificateNames
                    2071 CERT_FindCertIssuer
                      2071 CERT_FindCertIssuer
                        2071 CERT_FindCertIssuer
                          2071 CERT_VerifyCert
                            2071 CERT_VerifyCertNow
                              2071 SSL_AuthCertificate
                                2071 XRE_GetFileFromPath
              6213 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                6213 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                  6213 CERT_EncodeInfoAccessExtension
                    6213 CERT_EncodeInfoAccessExtension
                      6213 CERT_EncodeInfoAccessExtension
                        6213 PK11_FindCertByIssuerAndSN
                          6213 CERT_FindCertByIssuerAndSN
                            6213 XRE_GetFileFromPath
                              6213 PR_CallOnce
                                6213 XRE_GetFileFromPath
              2071 SECITEM_ArenaDupItem_Util
                2071 SEC_RegisterDefaultHttpClient
                  2071 CERT_ClearOCSPCache
                    2071 CERT_ClearOCSPCache
                      2071 CERT_GetOCSPAuthorityInfoAccessLocation
                        2071 CERT_GetOCSPAuthorityInfoAccessLocation
                          2071 CERT_VerifyCert
                            2071 CERT_VerifyCertNow
                              2071 SSL_AuthCertificate
                                2071 XRE_GetFileFromPath
              2071 CERT_DerNameToAscii
                2071 NSS_Get_CERT_SequenceOfCertExtensionTemplate
                  2071 CERT_EncodeInfoAccessExtension
                    2071 CERT_EncodeInfoAccessExtension
                      2071 CERT_EncodeInfoAccessExtension
                        2071 CERT_EncodeInfoAccessExtension
                          2071 CERT_EncodeInfoAccessExtension
                            2071 CERT_EncodeInfoAccessExtension
                              2071 CERT_EncodeInfoAccessExtension
                                2071 CERT_FindCertIssuer
            2071 PORT_ArenaMark_Util
              2071 PK11_DestroyTokenObject
                2071 PK11_ReadRawAttribute
                  2071 nss_DumpCertificateCacheInfo
                    2071 CERT_EncodeInfoAccessExtension
                      2071 CERT_EncodeInfoAccessExtension
                        2071 CERT_EncodeInfoAccessExtension
                          2071 CERT_EncodeInfoAccessExtension
                            2071 CERT_EncodeInfoAccessExtension
                              2071 PK11_FindCertByIssuerAndSN
                                2071 CERT_FindCertByIssuerAndSN
          2286 PORT_Alloc_Util
            2286 SECITEM_AllocItem_Util
              2286 NSSBase64_DecodeBuffer_Util
                2286 ATOB_ConvertAsciiToItem_Util
                  2286 ATOB_ConvertAsciiToItem
                    2286 XRE_GetFileFromPath
                      2286 PR_CallOnce
                        2286 XRE_GetFileFromPath
                          2286 XRE_GetFileFromPath
                            2286 XRE_GetFileFromPath
                              2286 XRE_GetFileFromPath
                                2286 XRE_GetFileFromPath

-- from MacOSX Darwin 8.8.4 bm-xserve11 Depend Debug + Leak Test on 2008/04/30 08:40:00  

There were also two larger jumps I found on the morning of 4/30, and they were Java-related:

        98400 NS_UTF16ToCString_P
          98400 NS_UTF16ToCString_P
            98400 NS_UTF16ToCString_P
              98400 NS_UTF16ToCString_P
                98400 NS_UTF16ToCString_P
                  98400 JNIEnv_::ThrowNew(_jclass*, char const*)
                    98400 JNIEnv_::ThrowNew(_jclass*, char const*)
                      98400 JNIEnv_::ThrowNew(_jclass*, char const*)
                        98400 NS_GetComponentRegistrar_P
                          98400 NS_GetComponentRegistrar_P
                            98400 GetSecurityContext(JNIEnv_*, nsISecurityContext**)
                              98400 NS_GetComponentRegistrar_P
                                98400 PR_Select

-- from MacOSX Darwin 8.8.4 bm-xserve11 Depend Debug + Leak Test on 2008/04/30 04:40:00

unfortunately we don't really have good stack information.
On Linux, there were two spikes up to 1.22MB/1.23MB from the 1.15MB baseline on the morning of May 4.  They were also mostly NSS-related.

Those two extra-large Mac spikes (that showed the JNI stuff in the stack) on the morning of April 30 were a different size from all of the other spikes, so I'll say those are a different problem that's not covered by this bug, and this bug covers the instability due to varying amounts of leaks from NSS code.
Assignee: dbaron → kengert
Component: General → Security: PSM
QA Contact: general → psm
For the record, I think the most likely cause of the regression is bug 406755, but only because it's the only NSS/PSM-related patch that landed in the first few hours before this started.
Decided not to block on this at the triage during the Firefox meeting.

Kai: see comment 13 - any thoughts?
Flags: blocking1.9+ → blocking1.9-
Could bug 426681 have caused this? There are often breaks of several hours between the spikes, and that was a big change to NSS just a few hours before the spikes started.
Reed Wrote:
> The Lk (trace malloc leaks) test has become very unstable 

What does that mean?  Crashing?  
Huge variations from run to tun?
You quoted bug 426681 as a potential cause.

(a) Would you be able to revert the change from that bug for the purpose of confirming it has caused or not caused the regression?

(b) Or would running your tests require that we revert code on trunk?
(In reply to comment #16)
> Reed Wrote:
> > The Lk (trace malloc leaks) test has become very unstable 
> 
> What does that mean?  Crashing?  
> Huge variations from run to tun?

The latter.  See graph links in comment 0 (although you need to increase the "days" field to see the regression range at this point).
(In reply to comment #17)
> You quoted bug 426681 as a potential cause.
> 
> (a) Would you be able to revert the change from that bug for the purpose of
> confirming it has caused or not caused the regression?
> 
> (b) Or would running your tests require that we revert code on trunk?

I could try to revert the change (on my local machine), but I can't check if it's the cause for this bug. I neither know the setup for the test, nor can I even compile a trace-malloc enabled build.
David Baron mentioned that some spikes were related to NSS and he suspected bug 406755. I just mentioned bug 426681, because I thought he might have overlooked it.
(In reply to comment #14)
> Decided not to block on this at the triage during the Firefox meeting.
> 
> Kai: see comment 13 - any thoughts?

I don't see how it would have introduced a new leak.
With the patch we are now checking certs for EV status more often.
But we had already done so before.
The patch simply changes the point of time where the verification is done (earlier) and the frequency.

So, yes, if there are leaks related to EV verification (libpkix based certificate policy verification), then that leak would now show up more often.

I can't recommend backing out 406755, because that would have other side effects.

I could offer to write a simple patch, which disables EV verification completely.

Question is, how should we test patches?

Which of the following should we do?

1) test patches locally, which requires to set up the leak building and logfile analysis locally? Is that easy or a lot of work?

2) apply the patch to cvs, so we can use automated builds and their infrastructure to see the effects?


And if we decide to do 2), should we do:

2a) test things now, before the release

2b) accept the leaks for the release, do not disturb the release process, and test after RC1 or even after final?
Bug 406755 caused us to do EV verification more often.
We can't easily revert it to the original state, without seeing other side effects.

This test patch 1 completely disables EV verification.
If bug 406755 was the culprit, applying this patch would avoid the leaks.
Note that after bug 426681 got resolved, we moved to an even newer tag of NSS already.

This patch could be used to revert NSS to the older snapshot.


It would be good to see test results for both test patches applied independently (not at the same time). Let's try patch 1 first, as this is what dbaron considers to be more likely.
The recognizable NSS stacks in comment 11 appear to be allocations of parts 
of a complex object that holds an NSS CERTCertificate.  If the caller that 
asked NSS to allocate a CERTCertificate then leaked the handle to that 
CERTCertificate object, all the items allocated within that object would 
also be leaked.  

So, I went looking for a caller of CERT_FindCertByIssuerAndSN that was itself
called through PR_CallOnce, and found one (shown here in the opposite order
of that used in comment 11). 

AuthCertificateCallback
  nsNSSCertificate::GetIsExtendedValidation
    nsNSSCertificate::getValidEVOidTag
      nsNSSCertificate::hasValidEVOidTag
        nsNSSComponent::EnsureIdentityInfoLoaded()
          PR_CallOnce
            nsNSSComponent::IdentityInfoInit()
              CERT_FindCertByIssuerAndSN(nsnull, &ias);

Then I looked for certificate leaks in those functions.  
I think there may be a couple of leaks here:
http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/manager/ssl/src/nsIdentityChecking.cpp&rev=1.20&mark=505,515-517#483
Appears that the temp_ev object is leaked, rather than deleted, in both places.  But neither of those places appears to leak an NSS certificate object.
Still looking.
(In reply to comment #23)
> I think there may be a couple of leaks here:
> http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/manager/ssl/src/nsIdentityChecking.cpp&rev=1.20&mark=505,515-517#483
> Appears that the temp_ev object is leaked, rather than deleted, in both places.

This function is code only active (I hope) in debug code.

It's wrapped in #ifdef PSM_ENABLE_TEST_EV_ROOTS, which is supposed to be undefined in optimized builds.
ok, I didn't know if that symbol was defined or not.

In security/manager/ssl/src/nsIdentityChecking.cpp
in nsNSSCertificate::hasValidEVOidTag() we see
   
877   CERTCertList *rootList = getRootsForOid(oid_tag);
   ...
924   cvin[2].value.pointer.chain = rootList;
   ...
934   rv = CERT_PKIXVerifyCert(mCert, certificateUsageSSLServer,
935                            cvin, cvout, nsnull);

Appears that the CertList object is leaked.
However, that object doesn't hold any (counted) cert references, so 
I think this leak does not account for leaked CERTCertificate references.
Attached patch Patch v3Splinter Review
Thanks to nelson for finding the leak mentioned in comment 25 !
It's a small one. This should fix it.
Don't know if it's the real fix for this bug.
Attachment #320094 - Flags: review?(nelson)
Comment on attachment 320094 [details] [diff] [review]
Patch v3

I believe this is correct.
I doubt that it's the cause of this bug though.
Attachment #320094 - Flags: review?(nelson) → review+
Attachment #320094 - Flags: approval1.9?
Comment on attachment 320094 [details] [diff] [review]
Patch v3

we are closed down for showstoppers. We can get this in 3.0.1 if needed.
Attachment #320094 - Flags: approval1.9? → approval1.9-
Keywords: checkin-needed
fixed on trunk
Status: NEW → RESOLVED
Closed: 11 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Attachment #320094 - Flags: approval1.9.0.2?
Comment on attachment 320094 [details] [diff] [review]
Patch v3

Approved for 1.9.0.2. Please land in CVS. a=ss
Attachment #320094 - Flags: approval1.9.0.2? → approval1.9.0.2+
Keywords: fixed1.9.0.2
Keywords: fixed1.9.0.2
checked in to cvs for 1.9.0.2
Keywords: fixed1.9.0.2
You need to log in before you can comment on or make changes to this bug.