Closed Bug 429794 Opened 12 years ago Closed 11 years ago
Lk (trace malloc leaks) test results have become very unstable since 2008-04-08 or so
980 bytes, patch
|Details | Diff | Splinter Review|
892 bytes, patch
|Details | Diff | Splinter Review|
1.36 KB, patch
|Details | Diff | Splinter Review|
The Lk (trace malloc leaks) test has become very unstable lately since about 2008-04-08 or 2008-04-09. It's very obvious from the graphs that some patch that landed introduced a major problem that is still causing issues. http://build-graphs.mozilla.org/graph/query.cgi?tbox=fxdbug-linux-tbox.build.mozilla.org&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1 http://build-graphs.mozilla.org/graph/query.cgi?tbox=fxdbug-win32-tb&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1 http://build-graphs.mozilla.org/graph/query.cgi?tbox=bm-xserve11.build.mozilla.org&testname=trace_malloc_leaks&units=bytes&autoscale=1&days=14&avg=1 Depending on when you look at the graphs above, you may need to increase the number of days so that you can see data from before 2008-04-08 / 2008-04-09 or so.
If we could identify the actual issue that caused the instability, that issue *might* block (assuming we find it's really serious). But, until we have more info, I don't think this would hold back the release.
Flags: blocking1.9? → blocking1.9-
Can you at least get somebody to look at this? This looks pretty severe.
Flags: blocking1.9- → blocking1.9?
Why didn't we notice this before now?
(In reply to comment #3) > Why didn't we notice this before now? Probably because people don't look at these graphs a lot? Dunno. We've had perf regressions before that nobody noticed for weeks... I noticed this instability earlier than I filed this bug, but I had been too busy to file it due to school stuff.
Ok. I'll get someone to look at it, but I still don't think it'll stop the release. Marking 1.9.0.x+.
This has gotten much worse now, too. See http://build-graphs.mozilla.org/graph/query.cgi?tbox=fxdbug-win32-tb&testname=trace_malloc_leaks&autoscale=1&size=&units=bytes<ype=&points=&showpoint=&avg=1&days=32 Re-noming.
Flags: blocking1.9- → blocking1.9?
Reed can you get us an exact regression range. DBaron can you take a look?
Assignee: nobody → dbaron
Flags: blocking1.9? → blocking1.9+
cc'ing stuart and rob
It would help if tinderbox showed the leak diffs; that used to work, but it doesn't seem to show up anymore.
And, for what it's worth, the regression range on Windows seems most likely to be http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=AviaryBranchTinderbox&branch=HEAD&branchtype=match&dir=&file=&filetype=match&who=&whotype=match&sortby=Date&hours=2&date=explicit&mindate=2008-04-08+21%3A32&maxdate=2008-04-08+22%3A14&cvsroot=%2Fcvsroot if we assume that the cause of the occasionally-much-higher is the same as the ~5K regression that started around the same time. Unfortunately there were no checkins in that window.
Actually, tinderbox shows the leak diffs on Linux and Mac, just doesn't on Windows. (I filed bug 432338 for that.) I looked through two jumps up on Mac that were about ~65K, and they were both NSS_related: 72759 malloc 72800 PR_Malloc 70414 PL_ArenaAllocate 68343 PORT_ArenaAlloc_Util 57988 PORT_ArenaZAlloc_Util 33136 NSS_Get_CERT_SequenceOfCertExtensionTemplate 33136 CERT_EncodeInfoAccessExtension 33136 CERT_EncodeInfoAccessExtension 33136 CERT_EncodeInfoAccessExtension 28994 PK11_FindCertByIssuerAndSN 28994 CERT_FindCertByIssuerAndSN 28994 XRE_GetFileFromPath 28994 PR_CallOnce 28994 XRE_GetFileFromPath 2071 CERT_NewTempCertificate 2071 CERT_ImportCerts 2071 CERT_DestroyOCSPResponse 2071 CERT_VerifyOCSPResponseSignature 2071 CERT_GetOCSPAuthorityInfoAccessLocation 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_FindCertIssuer 12426 CERT_DerNameToAscii 12426 NSS_Get_CERT_SequenceOfCertExtensionTemplate 12426 CERT_EncodeInfoAccessExtension 12426 CERT_EncodeInfoAccessExtension 12426 CERT_EncodeInfoAccessExtension 12426 PK11_FindCertByIssuerAndSN 12426 CERT_FindCertByIssuerAndSN 12426 XRE_GetFileFromPath 12426 PR_CallOnce 10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2940 10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2EB8 10355 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2656 8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2AF6 8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2D61 8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+25AD 8284 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2F03 8284 SEC_QuickDERDecodeItem_Util 8284 NSS_Get_CERT_SequenceOfCertExtensionTemplate 2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2D2C 2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+25AD 2071 /builds/tinderbox/Fx-Trunk-test_mem/Darwin_8.8.4_Depend/mozilla/../build/dist/bin/libnssutil3.dylib+2F03 2071 SEC_QuickDERDecodeItem_Util 2071 NSS_Get_CERT_SequenceOfCertExtensionTemplate 2071 CERT_EncodeInfoAccessExtension 2071 CERT_UncacheCRL 2071 CERT_GetCertificateNames 2071 CERT_FindCertIssuer 2071 CERT_FindCertIssuer 2071 CERT_FindCertIssuer 2071 CERT_VerifyCert 2071 CERT_VerifyCertNow 2071 SSL_AuthCertificate 2071 XRE_GetFileFromPath 6213 NSS_Get_CERT_SequenceOfCertExtensionTemplate 6213 NSS_Get_CERT_SequenceOfCertExtensionTemplate 6213 CERT_EncodeInfoAccessExtension 6213 CERT_EncodeInfoAccessExtension 6213 CERT_EncodeInfoAccessExtension 6213 PK11_FindCertByIssuerAndSN 6213 CERT_FindCertByIssuerAndSN 6213 XRE_GetFileFromPath 6213 PR_CallOnce 6213 XRE_GetFileFromPath 2071 SECITEM_ArenaDupItem_Util 2071 SEC_RegisterDefaultHttpClient 2071 CERT_ClearOCSPCache 2071 CERT_ClearOCSPCache 2071 CERT_GetOCSPAuthorityInfoAccessLocation 2071 CERT_GetOCSPAuthorityInfoAccessLocation 2071 CERT_VerifyCert 2071 CERT_VerifyCertNow 2071 SSL_AuthCertificate 2071 XRE_GetFileFromPath 2071 CERT_DerNameToAscii 2071 NSS_Get_CERT_SequenceOfCertExtensionTemplate 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_FindCertIssuer 2071 PORT_ArenaMark_Util 2071 PK11_DestroyTokenObject 2071 PK11_ReadRawAttribute 2071 nss_DumpCertificateCacheInfo 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 CERT_EncodeInfoAccessExtension 2071 PK11_FindCertByIssuerAndSN 2071 CERT_FindCertByIssuerAndSN 2286 PORT_Alloc_Util 2286 SECITEM_AllocItem_Util 2286 NSSBase64_DecodeBuffer_Util 2286 ATOB_ConvertAsciiToItem_Util 2286 ATOB_ConvertAsciiToItem 2286 XRE_GetFileFromPath 2286 PR_CallOnce 2286 XRE_GetFileFromPath 2286 XRE_GetFileFromPath 2286 XRE_GetFileFromPath 2286 XRE_GetFileFromPath 2286 XRE_GetFileFromPath -- from MacOSX Darwin 8.8.4 bm-xserve11 Depend Debug + Leak Test on 2008/04/30 08:40:00 There were also two larger jumps I found on the morning of 4/30, and they were Java-related: 98400 NS_UTF16ToCString_P 98400 NS_UTF16ToCString_P 98400 NS_UTF16ToCString_P 98400 NS_UTF16ToCString_P 98400 NS_UTF16ToCString_P 98400 JNIEnv_::ThrowNew(_jclass*, char const*) 98400 JNIEnv_::ThrowNew(_jclass*, char const*) 98400 JNIEnv_::ThrowNew(_jclass*, char const*) 98400 NS_GetComponentRegistrar_P 98400 NS_GetComponentRegistrar_P 98400 GetSecurityContext(JNIEnv_*, nsISecurityContext**) 98400 NS_GetComponentRegistrar_P 98400 PR_Select -- from MacOSX Darwin 8.8.4 bm-xserve11 Depend Debug + Leak Test on 2008/04/30 04:40:00 unfortunately we don't really have good stack information.
On Linux, there were two spikes up to 1.22MB/1.23MB from the 1.15MB baseline on the morning of May 4. They were also mostly NSS-related. Those two extra-large Mac spikes (that showed the JNI stuff in the stack) on the morning of April 30 were a different size from all of the other spikes, so I'll say those are a different problem that's not covered by this bug, and this bug covers the instability due to varying amounts of leaks from NSS code.
Assignee: dbaron → kengert
Component: General → Security: PSM
QA Contact: general → psm
For the record, I think the most likely cause of the regression is bug 406755, but only because it's the only NSS/PSM-related patch that landed in the first few hours before this started.
Decided not to block on this at the triage during the Firefox meeting. Kai: see comment 13 - any thoughts?
Flags: blocking1.9+ → blocking1.9-
Could bug 426681 have caused this? There are often breaks of several hours between the spikes, and that was a big change to NSS just a few hours before the spikes started.
Reed Wrote: > The Lk (trace malloc leaks) test has become very unstable What does that mean? Crashing? Huge variations from run to tun?
You quoted bug 426681 as a potential cause. (a) Would you be able to revert the change from that bug for the purpose of confirming it has caused or not caused the regression? (b) Or would running your tests require that we revert code on trunk?
(In reply to comment #16) > Reed Wrote: > > The Lk (trace malloc leaks) test has become very unstable > > What does that mean? Crashing? > Huge variations from run to tun? The latter. See graph links in comment 0 (although you need to increase the "days" field to see the regression range at this point).
(In reply to comment #17) > You quoted bug 426681 as a potential cause. > > (a) Would you be able to revert the change from that bug for the purpose of > confirming it has caused or not caused the regression? > > (b) Or would running your tests require that we revert code on trunk? I could try to revert the change (on my local machine), but I can't check if it's the cause for this bug. I neither know the setup for the test, nor can I even compile a trace-malloc enabled build. David Baron mentioned that some spikes were related to NSS and he suspected bug 406755. I just mentioned bug 426681, because I thought he might have overlooked it.
(In reply to comment #14) > Decided not to block on this at the triage during the Firefox meeting. > > Kai: see comment 13 - any thoughts? I don't see how it would have introduced a new leak. With the patch we are now checking certs for EV status more often. But we had already done so before. The patch simply changes the point of time where the verification is done (earlier) and the frequency. So, yes, if there are leaks related to EV verification (libpkix based certificate policy verification), then that leak would now show up more often. I can't recommend backing out 406755, because that would have other side effects. I could offer to write a simple patch, which disables EV verification completely. Question is, how should we test patches? Which of the following should we do? 1) test patches locally, which requires to set up the leak building and logfile analysis locally? Is that easy or a lot of work? 2) apply the patch to cvs, so we can use automated builds and their infrastructure to see the effects? And if we decide to do 2), should we do: 2a) test things now, before the release 2b) accept the leaks for the release, do not disturb the release process, and test after RC1 or even after final?
Bug 406755 caused us to do EV verification more often. We can't easily revert it to the original state, without seeing other side effects. This test patch 1 completely disables EV verification. If bug 406755 was the culprit, applying this patch would avoid the leaks.
Note that after bug 426681 got resolved, we moved to an even newer tag of NSS already. This patch could be used to revert NSS to the older snapshot. It would be good to see test results for both test patches applied independently (not at the same time). Let's try patch 1 first, as this is what dbaron considers to be more likely.
The recognizable NSS stacks in comment 11 appear to be allocations of parts of a complex object that holds an NSS CERTCertificate. If the caller that asked NSS to allocate a CERTCertificate then leaked the handle to that CERTCertificate object, all the items allocated within that object would also be leaked. So, I went looking for a caller of CERT_FindCertByIssuerAndSN that was itself called through PR_CallOnce, and found one (shown here in the opposite order of that used in comment 11). AuthCertificateCallback nsNSSCertificate::GetIsExtendedValidation nsNSSCertificate::getValidEVOidTag nsNSSCertificate::hasValidEVOidTag nsNSSComponent::EnsureIdentityInfoLoaded() PR_CallOnce nsNSSComponent::IdentityInfoInit() CERT_FindCertByIssuerAndSN(nsnull, &ias); Then I looked for certificate leaks in those functions. I think there may be a couple of leaks here: http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/manager/ssl/src/nsIdentityChecking.cpp&rev=1.20&mark=505,515-517#483 Appears that the temp_ev object is leaked, rather than deleted, in both places. But neither of those places appears to leak an NSS certificate object. Still looking.
(In reply to comment #23) > I think there may be a couple of leaks here: > http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/security/manager/ssl/src/nsIdentityChecking.cpp&rev=1.20&mark=505,515-517#483 > Appears that the temp_ev object is leaked, rather than deleted, in both places. This function is code only active (I hope) in debug code. It's wrapped in #ifdef PSM_ENABLE_TEST_EV_ROOTS, which is supposed to be undefined in optimized builds.
ok, I didn't know if that symbol was defined or not. In security/manager/ssl/src/nsIdentityChecking.cpp in nsNSSCertificate::hasValidEVOidTag() we see 877 CERTCertList *rootList = getRootsForOid(oid_tag); ... 924 cvin.value.pointer.chain = rootList; ... 934 rv = CERT_PKIXVerifyCert(mCert, certificateUsageSSLServer, 935 cvin, cvout, nsnull); Appears that the CertList object is leaked. However, that object doesn't hold any (counted) cert references, so I think this leak does not account for leaked CERTCertificate references.
Thanks to nelson for finding the leak mentioned in comment 25 ! It's a small one. This should fix it. Don't know if it's the real fix for this bug.
Attachment #320094 - Flags: review?(nelson)
Comment on attachment 320094 [details] [diff] [review] Patch v3 I believe this is correct. I doubt that it's the cause of this bug though.
Attachment #320094 - Flags: review?(nelson) → review+
Comment on attachment 320094 [details] [diff] [review] Patch v3 we are closed down for showstoppers. We can get this in 3.0.1 if needed.
Attachment #320094 - Flags: approval1.9? → approval1.9-
fixed on trunk
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment on attachment 320094 [details] [diff] [review] Patch v3 Approved for 188.8.131.52. Please land in CVS. a=ss
Attachment #320094 - Flags: approval184.108.40.206? → approval220.127.116.11+
checked in to cvs for 18.104.22.168
You need to log in before you can comment on or make changes to this bug.