Intermittent cors/basic.htm,test_shared_all.py TestSharedUnits.test_units | application crashed [@ mozilla::psm::RootCABinNumber(SECItemStr const *)]

RESOLVED FIXED in Firefox 41

Status

()

Core
Security: PSM
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: philor, Assigned: njn)

Tracking

({crash, intermittent-failure})

Trunk
mozilla41
x86
Windows 7
crash, intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox39 unaffected, firefox40 unaffected, firefox41 fixed, firefox-esr31 unaffected, firefox-esr38 unaffected)

Details

(crash signature)

Comment hidden (empty)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Bug 1162691 and bug 1164373 were the last two patches to touch the files in question.
Component: Marionette → Security: PSM
Flags: needinfo?(n.nethercote)
Keywords: crash
Product: Testing → Core
Summary: Intermittent test_shared_all.py TestSharedUnits.test_units | application crashed [@ mozilla::psm::RootCABinNumber(SECItemStr const *)] → Intermittent cors/basic.htm,test_shared_all.py TestSharedUnits.test_units | application crashed [@ mozilla::psm::RootCABinNumber(SECItemStr const *)]
Comment hidden (Treeherder Robot)
I'm not totally sure how to read the stack (if it's giving line numbers, that's not clear to me). We can see the crash address is 0x4, all of the crashes are on 32-bit windows builds. In this case 0x4 matches |offsetof(struct PRLogModuleInfo, level)| [1].

My best guess is we're passing a null log module to PR_LOG_TEST [2], which is odd, but I guess points to njn's change.

For reference the stack looks like this:

03:18:26    ERROR -  PROCESS-CRASH | test_shared_all.py TestSharedUnits.test_units | application crashed [@ mozilla::psm::RootCABinNumber(SECItemStr const *)]
03:18:26     INFO -  Crash dump filename: c:\users\cltbld\appdata\local\temp\tmpuoqn4t.mozrunner\minidumps\29a0b414-3ba4-43f4-a359-4e51b2cf83c4.dmp
03:18:26     INFO -  Operating system: Windows NT
03:18:26     INFO -                    6.1.7601 Service Pack 1
03:18:26     INFO -  CPU: x86
03:18:26     INFO -       GenuineIntel family 6 model 30 stepping 5
03:18:26     INFO -       8 CPUs
03:18:26     INFO -  Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
03:18:26     INFO -  Crash address: 0x4
03:18:26     INFO -  Thread 53 (crashed)
03:18:26     INFO -   0  xul.dll!mozilla::psm::RootCABinNumber(SECItemStr const *) [RootCertificateTelemetryUtils.cpp:16e7f374a5a2 : 70 + 0x5]
03:18:26     INFO -      eip = 0x665d2f92   esp = 0x17b0f738   ebp = 0x17b0f790   ebx = 0x16ff7b40
03:18:26     INFO -      esi = 0x000003db   edi = 0x18190bb0   eax = 0x00000000   ecx = 0x00000000
03:18:26     INFO -      edx = 0x005002b0   efl = 0x00010202
03:18:26     INFO -      Found by: given as instruction pointer in context
03:18:26     INFO -   1  xul.dll!mozilla::psm::AccumulateTelemetryForRootCA(mozilla::Telemetry::ID,CERTCertificateStr const *) [RootCertificateTelemetryUtils.cpp:16e7f374a5a2 : 94 + 0x7]
03:18:26     INFO -      eip = 0x665d092b   esp = 0x17b0f798   ebp = 0x17b0f7c8
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   2  xul.dll!mozilla::psm::`anonymous namespace'::AuthCertificate(mozilla::psm::CertVerifier &,mozilla::psm::TransportSecurityInfo *,CERTCertificateStr *,mozilla::TypeSpecificScopedPointer<CERTCertListStr> &,SECItemStr *,unsigned int,mozilla::pkix::Time) [SSLServerCertVerification.cpp:16e7f374a5a2 : 1199 + 0x8]
03:18:26     INFO -      eip = 0x665d8753   esp = 0x17b0f7a0   ebp = 0x17b0f7c8
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   3  xul.dll!mozilla::psm::`anonymous namespace'::SSLServerCertVerificationJob::Run() [SSLServerCertVerification.cpp:16e7f374a5a2 : 1327 + 0x1d]
03:18:26     INFO -      eip = 0x665df3f2   esp = 0x17b0f7d0   ebp = 0x17b0f854
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   4  xul.dll!nsThreadPool::Run() [nsThreadPool.cpp:16e7f374a5a2 : 226 + 0x5]
03:18:26     INFO -      eip = 0x654a4566   esp = 0x17b0f85c   ebp = 0x00617880
03:18:26     INFO -      Found by: call frame info with scanning
03:18:26     INFO -   5  xul.dll!nsThread::ProcessNextEvent(bool,bool *) [nsThread.cpp:16e7f374a5a2 : 866 + 0x5]
03:18:26     INFO -      eip = 0x65351a83   esp = 0x17b0f88c   ebp = 0x00617880
03:18:26     INFO -      Found by: stack scanning
03:18:26     INFO -   6  xul.dll!NS_ProcessNextEvent(nsIThread *,bool) [nsThreadUtils.cpp:16e7f374a5a2 : 265 + 0xa]
03:18:26     INFO -      eip = 0x65352ebe   esp = 0x17b0f994   ebp = 0x17b0f9a8
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   7  xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate *) [MessagePump.cpp:16e7f374a5a2 : 355 + 0x9]
03:18:26     INFO -      eip = 0x653545f5   esp = 0x17b0f9b0   ebp = 0x17b0f9cc
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   8  xul.dll!MessageLoop::RunHandler() [message_loop.cc:16e7f374a5a2 : 226 + 0x7]
03:18:26     INFO -      eip = 0x6534fb88   esp = 0x17b0f9d4   ebp = 0x17b0fa04
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -   9  xul.dll!MessageLoop::Run() [message_loop.cc:16e7f374a5a2 : 200 + 0x6]
03:18:26     INFO -      eip = 0x6534fd1a   esp = 0x17b0fa0c   ebp = 0x17b0fa24
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -  10  xul.dll!nsThread::ThreadFunc(void *) [nsThread.cpp:16e7f374a5a2 : 362 + 0x6]
03:18:26     INFO -      eip = 0x652461a7   esp = 0x17b0fa2c   ebp = 0x17b0fa3c
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -  11  nss3.dll!_PR_NativeRunThread [pruthr.c:16e7f374a5a2 : 397 + 0x5]
03:18:26     INFO -      eip = 0x68b31f5c   esp = 0x17b0fa44   ebp = 0x17b0fa58
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -  12  nss3.dll!pr_root [w95thred.c:16e7f374a5a2 : 90 + 0x9]
03:18:26     INFO -      eip = 0x68b3143a   esp = 0x17b0fa60   ebp = 0x17b0fa64
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -  13  msvcr120.dll + 0x2c01c
03:18:26     INFO -      eip = 0x6976c01d   esp = 0x17b0fa6c   ebp = 0x17b0fa9c
03:18:26     INFO -      Found by: call frame info
03:18:26     INFO -  14  msvcr120.dll + 0x2c000
03:18:26     INFO -      eip = 0x6976c001   esp = 0x17b0faa4   ebp = 0x17b0faa8
03:18:26     INFO -      Found by: previous frame's frame pointer
03:18:26     INFO -  15  kernel32.dll + 0x53c44
03:18:26     INFO -      eip = 0x76c63c45   esp = 0x17b0fab0   ebp = 0x17b0fab4
03:18:26     INFO -      Found by: previous frame's frame pointer
03:18:26     INFO -  16  ntdll.dll + 0x637f4
03:18:26     INFO -      eip = 0x774f37f5   esp = 0x17b0fabc   ebp = 0x17b0faf4
03:18:26     INFO -      Found by: previous frame's frame pointer
03:18:26     INFO -  17  ntdll.dll + 0x637c7
03:18:26     INFO -      eip = 0x774f37c8   esp = 0x17b0fafc   ebp = 0x17b0fb0c
03:18:26     INFO -      Found by: previous frame's frame pointer

[1] https://hg.mozilla.org/mozilla-central/annotate/f578b845c4b8/nsprpub/pr/include/prlog.h#l126
[2] https://hg.mozilla.org/mozilla-central/annotate/f578b845c4b8/nsprpub/pr/include/prlog.h#l166
(Assignee)

Comment 7

3 years ago
PR_NewLogModule() can return NULL, but it looks like most callers of that function do not allow for that. My patch changed things so that one of these calls occurred later during execution instead of at startup. My best guess is that the delayed call triggered an OOM, and thus |sPublicKeyPinningTelemetryLog| was NULL when deref'd.

So this is my fault in the sense that I stepped into the booby trap laid by countless programmers over numerous years before me.

The simple fix would be just to back my patch out; it wasn't a big deal. A better fix would be to change PR_LOG_TEST from this:

> #define PR_LOG_TEST(_module,_level) \
>     ((_module)->level >= (_level))

to this:

> #define PR_LOG_TEST(_module,_level) \
>     ((_module) && (_module)->level >= (_level))

because there are countless other places in the code where we currently aren't safe and this change would fix all of them. (Some of them trigger a non-fatal NS_ASSERTION on failure, but continue executing anyway. Hilarious.)

Unfortunately, it's a change in NSPR, which is always a pain. erahm, maybe this could be motivation for you to do that s/PR_LOG/MOZ_LOG/ change so that this could be done in non-NSPR code :)
Flags: needinfo?(n.nethercote)
(Assignee)

Comment 8

3 years ago
I'll back out my patch once the trees reopen. I filed bug 1166127 for general failure to account for  PR_NewLogModule() failing.
(Assignee)

Updated

3 years ago
Depends on: 1164373
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Based on the minidumps, I think this was an issue with thread-unsafe statics, not OOM. (Though the fact that PR_NewLogModule can fail is still true)
This was briefly a topcrash on nightly. Adding the signature so people can see that it's fixed.
Crash Signature: [@ mozilla::psm::RootCABinNumber(SECItemStr const*) ]
Many of the nightly crashes were on startup. That further points to thread safety rather than OOM.
(Assignee)

Comment 14

3 years ago
This responsible patch was backed out:
https://hg.mozilla.org/integration/mozilla-inbound/rev/eca0c05f29cd
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Assignee: nobody → n.nethercote
status-firefox39: --- → unaffected
status-firefox40: --- → unaffected
status-firefox41: affected → fixed
status-firefox-esr31: --- → unaffected
status-firefox-esr38: --- → unaffected
Target Milestone: --- → mozilla41
(In reply to David Major [:dmajor] from comment #11)
> Based on the minidumps, I think this was an issue with thread-unsafe
> statics, not OOM. (Though the fact that PR_NewLogModule can fail is still
> true)

Non-global static variables are required to be threadsafe in C++11.  I know they are in Unix-land.  The last time I looked at disassembly from MSVC, I was fairly sure that such variables were initialized in a threadsafe fashion...
This code can claim that the pointer exists when it doesn't yet:

xul!mozilla::psm::PublicKeyPinningTelemetryLog [c:\builds\moz2_slave\m-cen-w32-ntly-000000000000000\build\src\security\manager\boot\src\rootcertificatetelemetryutils.cpp @ 24]:
611a9c36 a10845ea61      mov     eax,dword ptr [xul!$S2 (61ea4508)]
611a9c3b a801            test    al,1
611a9c3d 751a            jne     xul!mozilla::psm::PublicKeyPinningTelemetryLog+0x23 (611a9c59)
611a9c3f 83c801          or      eax,1
611a9c42 68a812cf61      push    offset xul!`string' (61cf12a8)
611a9c47 a30845ea61      mov     dword ptr [xul!$S2 (61ea4508)],eax
611a9c4c ff15684c6861    call    dword ptr [xul!_imp__PR_NewLogModule (61684c68)]
611a9c52 59              pop     ecx
611a9c53 a31045ea61      mov     dword ptr [xul!sPublicKeyPinningTelemetryLog (61ea4510)],eax
611a9c58 c3              ret
611a9c59 a11045ea61      mov     eax,dword ptr [xul!sPublicKeyPinningTelemetryLog (61ea4510)]
611a9c5e c3              ret
I wonder if this issue is fixed in 2015 (or perhaps one of the 2013 updates).
(In reply to Eric Rahm [:erahm] from comment #17)
> I wonder if this issue is fixed in 2015 (or perhaps one of the 2013 updates).

This ought to be fixed in 2015 ("magic statics"): http://blogs.msdn.com/b/vcblog/archive/2015/04/29/c-11-14-17-features-in-vs-2015-rc.aspx
You need to log in before you can comment on or make changes to this bug.