Closed Bug 1544511 Opened 6 years ago Closed 3 years ago

Crash in [@ nsNSSComponent::nsNSSComponent]

Categories

(Core :: Security: PSM, defect, P1)

x86
Windows 7
defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox67 + wontfix
firefox68 --- wontfix

People

(Reporter: marcia, Assigned: keeler)

References

Details

(Keywords: crash, regression, Whiteboard: [psm-assigned], qa-not-actionable)

Crash Data

Attachments

(2 files)

This bug is for crash report bp-6fb4247d-62da-4d7b-8785-116d40190415.

Seen while looking at nightly crash stats, but visible in 67 beta as well: https://bit.ly/2IxblWd

This signature has been seen before in Bug 1301407 which was duped to another bug. Looks as if this started in 67.0b3 and has continued through the betas. There are more crashes than installs. No comments.

Some correlations in beta:

(100.0% in signature vs 00.36% overall) moz_crash_reason = MOZ_RELEASE_ASSERT(NS_IsMainThread())
(62.88% in signature vs 09.56% overall) Module "ntdsapi.dll" = true [94.92% vs 16.83% if platform_version = 6.1.7601 Service Pack 1]
(66.67% in signature vs 26.96% overall) Module "apphelp.dll" = true [100.0% vs 29.80% if platform_version = 6.1.7601 Service Pack 1]
(96.21% in signature vs 17.51% overall) Module "fastprox.dll" = true [96.43% vs 23.47% if platform_version = 10.0.17134]
(96.21% in signature vs 17.73% overall) Module "wbemsvc.dll" = true [96.43% vs 23.56% if platform_version = 10.0.17134]
(96.21% in signature vs 18.09% overall) Module "wbemcomn.dll" = true [96.43% vs 23.86% if platform_version = 10.0.17134]
(96.21% in signature vs 18.19% overall) Module "wbemprox.dll" = true [96.43% vs 23.88% if platform_version = 10.0.17134]
(99.24% in signature vs 47.96% overall) reason = EXCEPTION_BREAKPOINT
(99.24% in signature vs 37.43% overall) Module "winsta.dll" = true [100.0% vs 54.04% if platform_version = 10.0.17134]
(72.73% in signature vs 28.43% overall) Module "urlmon.dll" = true [100.0% vs 55.06% if platform_version = 10.0.17134]
(43.94% in signature vs 86.79% overall) Module "softokn3.dll" = true [37.21% vs 89.06% if platform_pretty_version = Windows 7]
(99.24% in signature vs 46.16% overall) Module "psapi.dll" = true [100.0% vs 59.50% if platform_version = 10.0.17134]
(74.24% in signature vs 29.66% overall) contains_memory_report = null
(49.24% in signature vs 94.59% overall) Addon "webcompat-reporter@mozilla.org" = true
(49.24% in signature vs 94.53% overall) Addon "fxmonitor@mozilla.org" = true
(50.00% in signature vs 94.63% overall) Addon "formautofill@mozilla.org" = true
(50.00% in signature vs 94.63% overall) Addon "webcompat@mozilla.org" = true
(50.00% in signature vs 94.39% overall) Addon "screenshots@mozilla.org" = true

Top 10 frames of crashing thread:

0 xul.dll nsNSSComponent::nsNSSComponent security/manager/ssl/nsNSSComponent.cpp:208
1 xul.dll mozilla::xpcom::CreateInstanceImpl xpcom/components/StaticComponents.cpp:10546
2 xul.dll nsresult nsComponentManagerImpl::GetServiceLocked xpcom/components/nsComponentManager.cpp:1387
3 xul.dll nsComponentManagerImpl::GetServiceByContractID xpcom/components/nsComponentManager.cpp:1574
4 xul.dll static bool mozilla::net::CanEnableSpeculativeConnect netwerk/protocol/http/nsHttpHandler.cpp:2349
5 xul.dll nsresult mozilla::detail::RunnableFunction<`lambda at z:/task_1555273210/build/src/netwerk/protocol/http/nsHttpHandler.cpp:2390:61'>::Run xpcom/threads/nsThreadUtils.h:562
6 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1180
7 xul.dll NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:486
8 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:303
9 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:308

Dana, could we get an assignee to this bug? The spike on beta is noticeable. Thanks.

Flags: needinfo?(dkeeler)

This is an assertion failure: MOZ_RELEASE_ASSERT(NS_IsMainThread())

It's crashing on "MOZ_RELEASE_ASSERT(NS_IsMainThread());" so the problem is the caller, not NSS itself.

https://hg.mozilla.org/releases/mozilla-beta/annotate/13b7d4d4df59f7f2e50f66644daf655bad40bec8/security/manager/ssl/nsNSSComponent.cpp#l208

On the stack is the background thread check added in Bug 1435141 in Fx67
https://hg.mozilla.org/releases/mozilla-beta/annotate/13b7d4d4df59f7f2e50f66644daf655bad40bec8/netwerk/protocol/http/nsHttpHandler.cpp#l2346

Either that was always wrong or something shifted around to change the relative order between that check and starting NSS. Valentin is out so adding a ni? for mayhemer

Flags: needinfo?(honzab.moz)

That should get called through nsHttpHandler::MaybeEnableSpeculativeConnect(), which calls net_EnsurePSMInit() on the main thread [0]. The problem is that net_EnsurePSMInit() doesn't actually enforce that nsINSSComponent is successfully initialized in non-debug builds [1]. If this first call fails on the main thread, subsequent calls off the main thread will hit this release assertion. Note that if we can't successfully initialize nsINSSComponent, something is very wrong and the browser probably shouldn't try to continue.

[0] https://searchfox.org/mozilla-beta/rev/22be965751d52a56f4e6920d58283152e0d8bec0/netwerk/protocol/http/nsHttpHandler.cpp#2384
[1] https://searchfox.org/mozilla-beta/rev/22be965751d52a56f4e6920d58283152e0d8bec0/netwerk/base/nsNetUtil.cpp#2440

Flags: needinfo?(dkeeler)

Agree with Dana. W/o NSS Fx can't live.

Flags: needinfo?(honzab.moz)

Dana, I understand we should rightly crash if NSS isn’t available. Any ideas on why we could be landing here? If there’s nothing we can do here, should we close it as environmental issue?

Flags: needinfo?(dkeeler)

Probably the best course of action is to land some diagnostic assertions to try and see where in NSS initialization Firefox is failing. It may be something we can fix.

Assignee: nobody → dkeeler
Flags: needinfo?(dkeeler)
Priority: -- → P1
Whiteboard: [psm-assigned]
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/2e4a7bcc1a95 add some diagnostic assertions to nsNSSComponent::InitializeNSS to see why it's failing r=KevinJacobs

Crashes as a result of 2e4a7bcc1a95 indicate that InitializeNSSWithFallbacks is
failing. Hopefully this will give us more information as to why.

Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/814e0d966842 add some diagnostic crashes to InitializeNSSWithFallbacks to see why it's failing r=KevinJacobs
Flags: needinfo?(dkeeler)
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/5f58e2f5d1f7 add some diagnostic crashes to InitializeNSSWithFallbacks to see why it's failing r=KevinJacobs

Ok - I'm pretty sure my try push actually has coverage of the build that was failing now: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ef76b0933815e697101e70840ce17bcdf7ffbe9c

Flags: needinfo?(dkeeler)
Pushed by dkeeler@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/b2ddb30a2d07 add some diagnostic crashes to InitializeNSSWithFallbacks to see why it's failing r=KevinJacobs

So far I've found one report that indicates we're getting SEC_ERROR_PKCS11_DEVICE_ERROR upon calling the last-resort NSS_NoDB_Init: https://crash-stats.mozilla.com/report/index/db6dbe89-2ebf-4a0b-9f9c-6eb5f0190506

Dana, should this be kept open? This is tracking for 67.

Flags: needinfo?(dkeeler)

Those patches don't fix the bug, they just give us more information. At this point, we're still trying to figure out why this is happening.

Flags: needinfo?(dkeeler)

Setting as fix optional for 67 as we are not going to have a fix in time for the release but if we get a fix after the release and the patch is safe to uplift, I would probably take it in a dot release.

Hi Dana, are there any more diagnostics or info for this one?

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #19)

So far I've found one report that indicates we're getting SEC_ERROR_PKCS11_DEVICE_ERROR upon calling the last-resort NSS_NoDB_Init: https://crash-stats.mozilla.com/report/index/db6dbe89-2ebf-4a0b-9f9c-6eb5f0190506

Kai, is there a reason you can think of why calling NSS_NoDB_Init would result in the error SEC_ERROR_PKCS11_DEVICE_ERROR?

Flags: needinfo?(dkeeler) → needinfo?(kaie)

Does NSS_NoDB_Init() actually still work? Would it make sense to test if Firefox starts up, if you disable the regular calls to init NSS in PSM, and go straight to the last-resort code?

The error might originate from CKR_DEVICE_ERROR. Some of the PKCS modules are "software devices" from NSS' point of view, maybe NSS has trouble loading them?

IIUC these crashes were first seen with Firefox 67, so maybe it's a side effect by changes in NSS 3.43 ?
Looking at https://developer.mozilla.org/en-US/docs/Mozilla/Projects/NSS/NSS_3.43_release_notes there isn't anything obvious.

Flags: needinfo?(kaie)

The leave-open keyword is there and there is no activity for 6 months.
:keeler, maybe it's time to close this bug?

Flags: needinfo?(dkeeler)

No, this still happening. It's unclear how to proceed, unfortunately.

Flags: needinfo?(dkeeler)

This is %100 reproducible when kernel fips mode is enabled with fips=1 on the kernel cmd line.

Are you using Firefox downloaded from Mozilla or re-packaged by your linux distro?

Flags: needinfo?(doncuppjr)

These are the official binaries from mozilla.org.
I originally discovered this when upgrading from 60ESR to 68.4.1
I bisected up from 60, until I ran into 67. I then worked with Aryx on irc to get a crash report submitted.
https://crash-stats.mozilla.org/report/index/f81c27b4-00da-437e-8041-2715e0200116

Flags: needinfo?(doncuppjr)

(In reply to doncuppjr from comment #28)

This is %100 reproducible when kernel fips mode is enabled with fips=1 on the kernel cmd line.

FIPS issues are probably interesting to RH people, cc'ing.

The leave-open keyword is there and there is no activity for 6 months.
:keeler, maybe it's time to close this bug?

Flags: needinfo?(dkeeler)

No, this is still an issue. We just don't know how to move forward here.

Flags: needinfo?(dkeeler)

The leave-open keyword is there and there is no activity for 6 months.
:keeler, maybe it's time to close this bug?

Flags: needinfo?(dkeeler)

We still don't really have an answer for this. Since this particular crash signature essentially masks where the real problem occurred, perhaps our best approach for now would be to change nsNSSComponent's initialization to crash immediately upon encountering an issue, rather than returning an error and then having some other thread try later, which hits the "must be on the main thread" release assertion.

Flags: needinfo?(dkeeler)

It's no longer an issue for us, as we have moved to a newer ESR release.

Crash Signature: [@ nsNSSComponent::nsNSSComponent] → [@ nsNSSComponent::nsNSSComponent] [@ nsNSSComponent::InitializeNSS] [@ nsNSSComponent::Init]

This crash also happens on the Mac, though only in small numbers. And recently a crash report showed up with interesting mac_crash_info:

bp-34fe2d10-fd78-42ea-a143-c130f0210610

    {
      "num_records": 1,
      "records": [
        {
          "message": "\u043d#\u0012\u0001",
          "module": "/System/Library/PrivateFrameworks/CoreUI.framework/Versions/A/CoreUI"
        }
      ]
    }

Yes, that "message" does need interpretation. Later I'll try to provide it.

(Following up comment #37)

In the CoreUI private framework, it's an internal _CUILog() method that writes to the "message" field in "mac crash info". I used a disassembler to look through many of that framework's calls to _CUILog(), and none of them comes close to matching the output here. "mac crash info" string fields (like "message") are UTF8 strings. I suspect this one is in non-Roman script, and Socorro's stackwalker is unable to interpret it.

So this "message" is useless here.

Whiteboard: [psm-assigned] → [psm-assigned], qa-not-actionable
Severity: critical → S3

The leave-open keyword is there and there is no activity for 6 months.
:keeler, maybe it's time to close this bug?

Flags: needinfo?(dkeeler)

Sure - I guess we really don't have anything we can do here.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(dkeeler)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: