Closed Bug 521849 Opened 15 years ago Closed 14 years ago

SSL broken, may crash [@ ntdll.dll@0x22272] & [@ RtlEnterCriticalSection ] when FIPS softtoken checksum verification fails.

Categories

(Core :: Security: PSM, defect, P1)

x86
Windows 7
defect

Tracking

()

VERIFIED FIXED
mozilla2.0b8
Tracking Status
blocking2.0 --- final+

People

(Reporter: lenjrjr, Assigned: KaiE)

References

(Depends on 1 open bug, )

Details

(Keywords: crash, topcrash, Whiteboard: [psm-fatal])

Crash Data

Attachments

(1 file, 6 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.0 (KHTML, like Gecko) Chrome/3.0.195.25 Safari/532.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.4) Gecko/20091007 Firefox/3.5.4 (.NET CLR 3.5.30729)

When Starting the newest build of firefox you first get the following error.

Could not initialize the application's security componet. The most likely cause is problems with files in your application's profile directory.  Please check that this directory has no read/write restrictions...

after you dismiss the alert, any https url will result in a Secure Connection Failed message in the browser.
 
This seems to be a problem with this build only going back to older versions fixes the problem




Reproducible: Always

Steps to Reproduce:
1. open firefox confirm your receive alert dialog box
2. 
3.
Actual Results:  
notice any sites that require ssl do not work

get error code (Error code: ssl_error_ssl_disabled)
Do you use software on your system that could prevent FF from working like Zonealarm Forcefield ?

Did you already tried to reinstall Firefox ?
Dupe of bug 521878? Are you in FIPS mode?
Matti:  Nope I dont' have any software like zonealarm or any network monitoring software on my system.

Justin:  I think it is a dupe of 521878 but I have not yet confirmed the work around... I will try it and update this bug

Thanks.
Confirming this on Win7 x64 with Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a1pre) Gecko/20091120 Minefield/3.7a1pre (.NET CLR 3.5.30729) and NSS 3.12.4..

Please fix this, because it just sucks to update each day and then revert back to the build from 2009-11-15 (which was the last one without this bug)!

OT: Oh boy, I never thought that I'd need IE to submit a FF-bug.. :(
I'd suggest uninstalling and reinstalling from a fresh, current installer download.

I'm running current trunk (1.9.3) and release (Firefox 3.5) builds on my Windows 7 x64 box, and haven't hit this error. [The box is new as of a week or two ago.]
Also confirmed with win7 x64. Bug happened since the build from Tuesday I think, checked with every release made after that day. Oh and I'm using fips.

   "Justin Dolske [:dolske]      2009-11-20 15:31:49 PST

   I'd suggest uninstalling and reinstalling from a fresh, current installer
   download."

Unfortunately this didn't help either.
I've now encountered the same problem in the newest d2d build from Bas Schouten. After some testing it seems that there is a regression/bug in the fips module.

When you update with fips active you get an error with every encrypted page, additionally you can't disable fips and the master password (it doesn't recognize a password has been set). Also when you look into your saved passwords the list is empty.

Updating with fips disabled everything seems normal (master password, password list), but you can't activate fips in the options.
Hmm, FIPS seems to be the key.

On a current trunk build, I can't enable FIPS at all. The button does nothing, but I get an message in the error console:

NS_ERROR_FAILURE in chrome://pippki/content/device_manager.js line 545, calling nsIPKCS11ModuleDB.toggleFIPSMode.

I can enable FIPS by switching to Firefox 3.5, but then I crash on startup running trunk! eg:

bp-82ff7947-84e4-4fbb-9c6f-3b8fb2091213
bp-7a96dd58-b8c8-4767-880b-b59f42091213
bp-d352bef2-ddc8-4fbe-abb7-b62152091213 (breakpad seemed to choke on this one)

A Firefox 1.9.2 nightly seems to work fine.
Assignee: nobody → kaie
Status: UNCONFIRMED → NEW
blocking2.0: --- → ?
Component: Security → Security: PSM
Ever confirmed: true
Product: Firefox → Core
QA Contact: firefox → psm
Version: unspecified → Trunk
Summary: SSL seems broken on windows 7 → SSL seems broken on windows 7, involves FIPS
Signature	ntdll.dll@0x22272
UUID	82ff7947-84e4-4fbb-9c6f-3b8fb2091213
Time 	2009-12-13 20:27:00.170814
Uptime	1
Last Crash	12 seconds before submission
Product	Firefox
Version	3.7a1pre
Build ID	20091213045416
Branch	1.9.3
OS	Windows NT
OS Version	6.1.7600
CPU	x86
CPU Info	GenuineIntel family 6 model 26 stepping 5
Crash Reason	EXCEPTION_ACCESS_VIOLATION
Crash Address	0x20
User Comments	FIPS! (dolske)
Processor Notes 	
Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 	ntdll.dll 	ntdll.dll@0x22272 	
1 	nspr4.dll 	PR_Lock 	nsprpub/pr/src/threads/combined/prulock.c:233
2 	xul.dll 	nsAutoLock::nsAutoLock 	obj-firefox/dist/include/nsAutoLock.h:219
3 	xul.dll 	nsSSLIOLayerHelpers::isKnownAsIntolerantSite 	security/manager/ssl/src/nsNSSIOLayer.cpp:2230
4 	xul.dll 	nsSSLIOLayerSetOptions 	security/manager/ssl/src/nsNSSIOLayer.cpp:3501
5 	xul.dll 	nsNSSShutDownPreventionLock::~nsNSSShutDownPreventionLock 	security/manager/ssl/src/nsNSSShutDown.cpp:472
6 	xul.dll 	nsSSLIOLayerNewSocket 	security/manager/ssl/src/nsNSSIOLayer.cpp:2250

Roughly this should be a null lock.
Summary: SSL seems broken on windows 7, involves FIPS → SSL seems broken on windows 7, involves FIPS PR_Lock/nsSSLIOLayerHelpers::isKnownAsIntolerantSite
Attached patch guard against failed init (obsolete) — Splinter Review
This doesn't force the init, and it doesn't cause fips to magically work, but it should cause the crash to go away (https will not magically work with this, but we should take it).
Attachment #417437 - Flags: review?(kaie)
Confirmed on Win7 x64 with 3.6b5 - this did not happen with 3.6b4 and IMHO should be fixed _immediately_ with the planned release date in mind!

TIA.
Martin: were you crashing or was SSL just not working? It WFM on 3.6, but not 3.7.

I'm going to nom for blocking 1.9.2 and flag for qawanted to get this on the radar. If there's a latent FIPS+Win7 bug, we should at least verify with a few more system if it's a 3.6 issue or not.
Flags: blocking1.9.2?
Keywords: qawanted
Justin, I don't fully get what you want me to do.. What is "WFM"?

a) It happens on my gaming machine, running the latest nightly (i.e. 3.7pre1, but not with the build from 2009-11-15).
b) It happens on my company's notebook, running 3.6b5, but not with 3.6b4.

Though, both machines run Win7 x64 (one Ultimate, one Enterprise) and regarding b), Firefox keeps crashing as soon as it is launched..

HTH,
 Martin
WFM = works for me. I wasn't able to reproduce either of the problems described here (broken SSL, crashes) with 3.6, only 3.7.

If you're seeing crashes with 3.6 it would be good to get the crash IDs (as in comment 8). Try running a 3.5 build and going about:crashes, and paste the IDs here.
Thanks for explaining me :)

Regarding the crashes, I copied 3.6b4 which WFM and here are the two crash reports (I clicked on them, since I was connected to my company and therefore they weren't submitted up to now..

http://crash-stats.mozilla.com/report/index/bp-e32f6b4e-76ee-4096-826c-f17f62091220
http://crash-stats.mozilla.com/report/index/bp-b18b4fb0-8e39-4321-9202-58bd12091220

Both are quite identically, but it might help and so I submitted them both.
So, the 3.6 reports from comment 15 look similar to the 3.7 reports from comment 8, except that they go through ssl_DupSocket() instead of nsSSLIOLayerHelpers::isKnownAsIntolerantSite(). I'm not sure if I entirely believe that stack frame, though, because I don't see where the caller can actually call ssl_DupSocket. But in any case, I suspect timeless's patch would fix both cases, since it's checking for the condition in an earlier frame.
Leonard - do you have administrator rights on that machine?

I don't think we took a NSS/PSM change on mozilla-1.9.2 between beta 4 and beta 5, so that's an odd regression range. Kai?
I don't think we'll block on this, though the patch looks safe enough to take in a security release if it's needed.
Flags: blocking1.9.2? → blocking1.9.2-
Whiteboard: [3.6.x]
* Clean install + new profile of 3.6 Beta 4: Can enable FIPS, Google login page (SSL) works fine.

* Clean install + new profile of 3.6 Beta 5: Can *not* enable FIPS, error console shows "NS_ERROR_FAILURE in chrome://pippki/content/device_manager.js line 545, calling nsIPKCS11ModuleDB.toggleFIPSMode."

* Clean install + new profile of 3.5.6: Can enable FIPS, Google login page OK.
  ...Installed 3.6 Beta 5: Got "Could not initialize the application's security component" popup on start. Can't load any SSL pages.

* Clean install + new profile of 3.5.6 (again), enabled FIPS: OK
  ...Installed 3.6 Beta 4: SSL pages work OK.

So, it sure looks like we regressed things between B4 and B5.

Renomming, since in addition to largely breaking the browser these users won't get automatic updates (so a 3.6.1 fix doesn't help unless they download it manually). I'm concerned about how much beta tester coverage we have for this scenario.

It would be greatly helpful if Kaie could chime in on what the root cause here might be; I'm not sure if timeless's patch is just wallpapering the problem.


Crash stats show we are already getting some crash reports for this bug. Some, but not all, of these stacks look like those reported above:

http://crash-stats.mozilla.com/query/query?product=Firefox&version=ALL%3AALL&platform=windows&date=&range_value=2&range_unit=weeks&query_search=signature&query_type=startswith&query=ntdll.dll%400x22272&do_query=1

Interestingly, some of these (that involve NSS) are on the 3.5 branch.
Flags: blocking1.9.2- → blocking1.9.2?
Summary: SSL seems broken on windows 7, involves FIPS PR_Lock/nsSSLIOLayerHelpers::isKnownAsIntolerantSite → SSL broken on windows 7, involves FIPS PR_Lock/nsSSLIOLayerHelpers::isKnownAsIntolerantSite, sometimes crash [@ ntdll.dll@0x22272]
Whiteboard: [3.6.x]
It would also be quite helpful if QA could narrow down the regression range here...
It does not happen on the nightly from 12-02-05, but it happens on 12-03-05. I'll add links in a bit.
Ted: see comment 19 and forward - any chance this is you from bug 484799?

romaxa/kaie: how about bug 528184 which made a change to nsNSSIOLayer?
Flags: blocking1.9.2? → blocking1.9.2+
Priority: -- → P1
beltzner: bug 528184 wouldn't do this, if it did, then people could revert to before that fix and crash there, but there's no indication of that.

it's trivial for rebase on a dll to cause the signature to fail which would cause fips to not init. so yes, bug 484799 should have 'caused' the signature failure and the rest of the chain reaction.

can we please try my fix and then deal w/ fixing rebase to happen in such a way that a signature is regenerated?
We backed out bug 484799 on 1.9.2, FWIW, so I wouldn't expect that to be a problem in 3.6b5. It was never shipped in a release build from 1.9.2, since it also conveniently broke release automation.
timeless, I'm asking kaie for a review as hard as I know how to do in bugzilla; if there's another potential reviewer, please cc at will. have you tried your fix locally with justin's str in comment 19 to see if it fixes the problem?
Did some more testing, with my profile FIPS enabled per comment 19 STR...

Latest trunk nightly still crashes. Latest 1.9.2 nightlies (both .zip and .exe) work fine. So, this might not be a 3.6 blocker -- but I want to understand what's happening here before recommending we not block.

Interestingly, my trunk debug build works fine. But if I do a "make package" and run it from the resulting zip, I crash on start. The string being passed to isKnownAsIntolerantSite() is "api-secure.recaptcha.net:443"... Which happens to be for the captchs on our our minefield first-run page (http://www.mozilla.org/projects/minefield/). If I set my homepage to blank and do some startup-fu to avoid triggering the updated-build page, I don't crash -- but SSL is broken.

So, I suspect there are two related things going on here... (1) a startup page with SSL is causing NSS/PSM to start doing stuff before its locks are setup, causing a crash and (2) something is breaking SSL with FIPS. Maybe #2 sets up the condition that allows #1 to happen.
Indeed, BLAPI_SHVerify() fails to verify nssdbm3.dll [DSA_VerifyDigest fails], which seems to be expected from bug 484799 (seems we don't regenerate the NSS .chk files after rebasing).

I think that 484799 *did* ship with Beta 5... B5 was tagged on December 4th @13:16:09, and stuart reverted it on December 4th @16:27:05 -- 3 hours after tagging.

The comments on https://wiki.mozilla.org/Releases/Firefox_3.6b5/BuildNotes makes it sound like build/QA found problems with partial updates due to the rebasing, but we decided to proceed as-is anyway (ie, no relbranch backout).

So I don't think we need to block on this, nor is there a pressing need to take the the crash-fix patch here (afaik, the NSS libraries / checksums should always be valid unless someone's mucked with their bits or got a build we broke, ala 484799).

Clearing blocking flag.
Flags: blocking1.9.2+
Summary: SSL broken on windows 7, involves FIPS PR_Lock/nsSSLIOLayerHelpers::isKnownAsIntolerantSite, sometimes crash [@ ntdll.dll@0x22272] → SSL broken, may crash [@ ntdll.dll@0x22272], when FIPS softtoken checksum verification fails.
Depends on: 484799
I'll back bug 484799 out on trunk to fix the remaining issues there when I get a chance. (or someone else can do so if they have time before I do.)
Severity: major → critical
Thanks for the investigation, Dolske. Much appreciated.
(In reply to comment #19)
> * Clean install + new profile of 3.6 Beta 5: Can *not* enable FIPS, error
> console shows "NS_ERROR_FAILURE in chrome://pippki/content/device_manager.js
> line 545, calling nsIPKCS11ModuleDB.toggleFIPSMode."
> 
> * Clean install + new profile of 3.5.6: Can enable FIPS, Google login page OK.
>   ...Installed 3.6 Beta 5: Got "Could not initialize the application's security
> component" popup on start. Can't load any SSL pages.

One data point.  I'm using:
Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6

I am unable to reproduce these FIPS problems with this build using a fresh profile. 

Are other people able to reproduce?
Using Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2) Gecko/20100115 Firefox/3.6 (installed RC2 yesterday) this does not happen.

However, using latest trunk this _does_ happen!
timeless' patch is simply wallpapering.

We must ensure that no SSL code gets executed prior to PSM module init.

The XPCOM loader code of PSM tries to ensure that any access to the first instance provided by PSM triggers the required init. We must find out why this no longer works.

We need the full stack to the code that triggers the first SSL connection.
Is the crash limited to x86_64, or has anyone seen it on 32 bit, too?

I would have to set up a Win7 on a x86_64 hardware and get an MS C++ debugger installed, in order to debug myself, which seems quite a bit of work.

If anyone of you has such an environment and can reproduce the crash, please set breapoints in each of the following functions:

- nsNSSComponent::Init
- nsNSSComponent::ShutdownNSS
- nsSSLIOLayerHelpers::Init
- nsSSLIOLayerNewSocket
- nsSSLIOLayerAddToSocket
- nsSSLIOLayerHelpers::isKnownAsIntolerantSite

What you should see is:

- nsNSSComponent::Init gets called

I'd like to see the full stack (please attach or mail to me) when you first arrive at nsNSSComponent::Init.

- it proceeds and calls nsSSLIOLayerHelpers::Init and successfully initializes the mutex 

- you shouldn't arrive at nsNSSComponent::ShutdownNSS prior to app exit (or, another call to above Init functions must follow. if it does, please send stacks for these additional calls, too.)

- as soon as the app attempts to use SSL, it will arrive at nsSSLIOLayerNewSocket or nsSSLIOLayerAddToSocket. When you arrive at this breakpoint, please inspect global variable nsSSLIOLayerHelpers::mutex - it should be initialized, not null

Thanks in advance.
On Linux trunk I confirmed using gdb that I get the order of expected events, even when I set the startup page to a https page.
Isn't this duplicate of bug 522041?
(In reply to comment #31)

> One data point.  I'm using:
> Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2) Gecko/20100115
> Firefox/3.6

The patch that was triggering this problem (see comment 28) was backed out after Beta 5 shipped. So current Firefox 3.6 builds, like you're running, would work fine.

I suspect the problem could be reproduced on new builds simply by corrupting the .chk files, so that the checksum fails. Dunno if this is somehow Win7/x64 related.
Thanks Justin, I'm able to reproduce on Linux after corrupting the .chk files.

Some things go wrong very early during PSM/NSS init:
- some code requests a cryptoHash object
- PSM's factory constructor triggers NSS init
- during NSS init we fail to init r/w, fail to init r/o
  and show that error dialog
- unfortunately as soon as we show the error dialog,
  things go parallel
- we get multiple
  ###!!! ASSERTION: Recursive GetService!: 'Error', 
  file mozilla/xpcom/components/nsComponentManager.cpp, line 2211
- cause is that in parallel, we attempt to construct an SSL socket
- again, this goes through PSM's factory constructor
- however, the logic in
  PRBool EnsureNSSInitialized(EnsureNSSOperator op)
  {
    ...
      case nssEnsure:

  allows it to proceed, even though it's still loading.
- that's what causes the crash, an attempt to use PSM's data structures
  although PSM hasn't completed init.

I think my first proposal is to change function
   EnsureNSSInitialized
and protect it against the race.

I propose:
- the first call into it triggers loading (as of today)
- parallel executions shall wait until PSM init completes or fails.

I see another detail which we should fix:
If PSM fails both r/w and r/o attempts, PSM performs a no-database-init of the NSS library, but then PSM remembers a false state.
PSM decides it's "not initialized", although NSS has been initialized.
Consequently, when XPCOM environment shuts down the failed PSM instance, PSM will not shutdown NSS. And when the XPCOM environment attempts to perform another init, it will succeed, even though it won't be FIPS mode, but a no-database session. We should take this crashing opportunity to fix this, too.
@ Bob Relyea

Is it possible for the application to distinguish between
  "NSS init failed because of FIPS constraints"
and
  "NSS init failed because of something else"
?
I have an experimental patch that prevents this crash.
If NSS init fails, it prevents all PSM factory constructors from succeeding.

Unfortunately, this opens multiple questions.

The first thing I learn, when PSM is disabled completely, we won't crash, but one can't do much anyway.
For example, it's impossible to load any web pages. This is probably caused by the failure of the url-classifier service (which seems to depend on PSM).

Is it necessary to disable PSM completely? If in FIPS mode, then yes it is. We must not proceed and must not allow any kind of SSL functionality to work.

As of today, I don't see a way for PSM to distinguish an NSS init failure caused by FIPS constraints, or other failures, e.g. "read access denied to the NSS database file".

I conclude, at least for now, we must go the path to "disable PSM completely".


However, if it were possible to distinguish between FIPS-failure and other NSS init failure, then, in case of failures unrelated to FIPS, we could allow PSM to init in some minimal functional mode, that doesn't access any profile databases. In this mode, at least basic web site accessing would work. While testing this today, I was able to visit secure sites, but I got the broken indicator (because we lack some NSS functionality etc.). Maybe that's better than nothing. Maybe it's not worth it because it creates additional confusion for support if users report such reduced/failing behavior.


I also saw problems when displaying the error message, we shortly crashed afterwards, even with PSM init code fixed. Problem is we run into recursive GetService calls, and get a crash at a random point of time. I conclude we must stop using the dialog and log the error message to the error console.


I propose to attach a patch that prevents the crash, logs to error console instead of showing the dialog, and always disallows the minimal SSL mode after NSS init failures, because it might have been caused by FIPS constraints and we must play it safe.
Kai, for the crash fix (and confirmation of your analyzes) please see bug 522041 comment 33.
(In reply to comment #42)
> *** Bug 522041 has been marked as a duplicate of this bug. ***

Applying in-litmus? to this bug since it was requested on the duped bug
Flags: in-litmus?
Kai, did you manage to make some progress on fixing this bug?

I don't want to boss you around, because there's surely enough stuff going on in your real-life.
Though, I also don't want to "downgrade" my security just to be able to use the latest Nightly and the bugfixes it may include - so I'm stuck with an old Nightly until this one gets fixed..

TIA,
 Martin
Attached patch turn sync alert to async (obsolete) — Splinter Review
Patch from bug 517938.  This makes the dialog display asynchronously preventing instantiations of other classes during nss component initiation.

This is not a final fix, this a quick and safe workaround.
Attachment #426828 - Flags: review?(kaie)
Thank you very much for your contribution! :)
Summary: SSL broken, may crash [@ ntdll.dll@0x22272], when FIPS softtoken checksum verification fails. → SSL broken, may crash [@ ntdll.dll@0x22272] & [@ RtlEnterCriticalSection ] when FIPS softtoken checksum verification fails.
(In reply to comment #38)
> - we get multiple
>   ###!!! ASSERTION: Recursive GetService!: 'Error', 
>   file mozilla/xpcom/components/nsComponentManager.cpp, line 2211
> - cause is that in parallel, we attempt to construct an SSL socket
> - again, this goes through PSM's factory constructor
> - however, the logic in
>   PRBool EnsureNSSInitialized(EnsureNSSOperator op)
>   {
>     ...
>       case nssEnsure:
> 
>   allows it to proceed, even though it's still loading.
> - that's what causes the crash, an attempt to use PSM's data structures
>   although PSM hasn't completed init.
> 
> I think my first proposal is to change function
>    EnsureNSSInitialized
> and protect it against the race.
> 
> I propose:
> - the first call into it triggers loading (as of today)
> - parallel executions shall wait until PSM init completes or fails.
> 


Agree.  I can see a fix now.  It is quit simple to change EnsureNSSInitialized function.

Problem is in the 'loading' boolean.  It must be changed to 'loadingThread' set to PR_GetCurrentThread() on the first load (if null).  If the same thread re-enters, let it pass (return true) as we do now.  If another thread enters 'nssEnsure', then try to instantiate the service on that thread.  That will enter the xpcom manager monitor and will let the other thread(s) wait until we exit (successfully or not) the first instance of the service constructor.  Quit easy, I just don't know how to synchronize the new loadingThread global...
Attached patch synchronized nss ensure (obsolete) — Splinter Review
Tested patch.  All threads waits for the constructor to finish while preventing its re-entrance on the first invoking thread.

I found another possible problem with nss init: we do not ensure the nss component is first created on the main thread.  EnsureNSSInitialized should do sync proxy object creation to the main thread.
Assignee: kaie → honzab.moz
Attachment #417437 - Attachment is obsolete: true
Attachment #426828 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #445821 - Flags: review?(kaie)
Attachment #417437 - Flags: review?(kaie)
Attachment #426828 - Flags: review?(kaie)
Whiteboard: [psm-fatal]
blocking2.0: ? → final+
Wan-Teh, would you be willing to review this patch?
I believe we can see the following state, i.e. stack frame:

--- highest stack frame ---
your new code: { case nssEnsure: if (state == NSS_STATE_LOADING && loadingThread == PR_GetCurrentThread()) return PR_TRUE; }
psm: ensure nss init already(nssEnsure)
psm: construct ssl socket object                   ----> higher psm stack frame
request a psm/nss ssl socket
still in nsNSSComponent constructor, showing error dialog, spinning event loop
psm: problems during init, create error dialog
psm: ensure nss init(nssLoading)
psm: construct object nsNSSComponent
psm: ensure nss init(nssEnsure) => not yet
psm: construct hash object                         ----> lower psm stack frame
request psm hash object
--- lowest stack frame ---


In this scenario, your new code returns PR_TRUE, which means "init has completed, it's fine to construct the object", but the original init is still pending (and is failing).

Is there a mistake in my thinking?
You are right.  When we are still loading (are in the constructor) and we are re-entered on the same thread, we have to check for IsNSSInitialized.  If it is, return OK, if it is not, return ERROR.
Comment on attachment 445821 [details] [diff] [review]
synchronized nss ensure

r- because of the previous comments.
Attachment #445821 - Flags: review?(kaie) → review-
This seems to be yet more complicated:

- if we are not able to initialize NSS we still allow nsNSSComponent to instantiate as well as anything that depends on it
- after we introduced IsNSSInitialized, only the first component that demands NSS, will be prevented to instantiate itself ; other will pass even there is no NSS as we no longer check for IsNSSInitialized flag

Seems like we leave nsNSSCompoenent be instantiated even there is no NSS purposely.  So, I'll change the patch to allow other components be instantiated only when NSS has been setup, check that always.  Is that correct?
With that approach we will get a bad error message when loading https page.  Normally we would get "Secure Connection Failed, ssl_error_ssl_disabled", with the patch I work on we get "Unexpected response from server" even we didn't any request, actually.  This is caused by missing SSL socket provider.
Attached patch v2 (obsolete) — Splinter Review
- introduced new load state and operator: nss-down
- this state is set instantly after we fail to init nss
- this state is left set even after we instantiate nss component
- when this state is set, no component other than nss component is able to instantiate
- re-entrance of the nss component constructor is allowed on the same thread only (as in the previous version)
- on a different thread we do_GetService (as in the previous version)
- initiation success is checked through the load state == NSS_COMPONENT_UP and not anymore by IsNSSInitialized flag (load state has the exactly same meaning now)
- nss module now allows a custom error to return when a component failed to instantiate because of failed nss initiation
- for socket providers we return NS_ERROR_SSL_DISABLED error
- this error is then handled in SocketTransport ; this leads to a correct error message displayed to the user
- when nss init fails, with this patch we get empty lists of certificates and security devices, what is IMO pretty correct
- we log to the console (but I'm not sure of the code, I cannot find the message displayed anywhere in the current Minefield.. old good error console is gone..)
Attachment #445821 - Attachment is obsolete: true
Attachment #480749 - Flags: review?(kaie)
OS: Windows 7 → Windows XP
OS: Windows XP → Windows 7
Attached patch v2.1 (obsolete) — Splinter Review
This one is simpler, what has remained:
- introduced new load state and operator: nss-down
- this state is set instantly after we fail to init nss
- this state is left set even after we instantiate nss component
- when this state is set, no component other than nss component is able to
instantiate
- re-entrance of the nss component constructor is allowed on the same thread
only (as in the previous version)
- on a different thread we do_GetService (as in the previous version)
- initiation success is checked through the load state == PSM_UP and
not anymore by IsNSSInitialized flag (load state has the exactly same meaning
now)
- when nss init fails, with this patch we get empty lists of certificates and
security devices, what is IMO pretty correct
- we log to the console (but I'm not sure of the code, I cannot find the
message displayed anywhere in the current Minefield.. old good error console is
gone..)

v2 -> v2.1:
- we allow creating of SSL and TLS providers when there is no NSS but we do not allow them to create a socket ; this displays the correct error to user and removes the huge change to nsNSSModule from the v2 patch
Attachment #480749 - Attachment is obsolete: true
Attachment #481075 - Flags: review?(kaie)
Attachment #480749 - Flags: review?(kaie)
Blocks: 602081
Attached patch v2.2 (obsolete) — Splinter Review
Honza gave me this merged patch today
I don't understand why patch v2.2 is better, in the scenario described in comment 52.

On the positive side, you have disabled the error dialog box that reports "unable to init NSS", as a consquence, the scenario from comment 52 is now less likely.

However, maybe you haven't solved the core problem yet.
Is there any other code path where PSM init can block and respin the event loop?

If there is, based on the example from comment 52, the following will happen in the "higher psm stack frame":
- loadState is still PSM_LOADING
- the higher stack calls Ensure...(nssEnsure)
- your code is:
    case nssEnsure:
      ..
      case PSM_LOADING:
        ..
        if (loadingThread == PR_GetCurrentThread())
          return PR_TRUE;
(In reply to comment #60)
> the "higher psm stack frame":
> - loadState is still PSM_LOADING

It's no longer PSM_LOADING, it is now NSS_DOWN_PSM_UP and we return PR_FALSE in that case not allowing to instantiate.  This is valid for example from comment 52.

I haven't found a code that would spin the loop before we know the NSS init state.  I'll recheck this ones again to be 100% sure.
On many places in nsNSSComponent::InitializeNSS I call EnsureNSSInitialized(nssNSSDown) that turns the state to NSS_DOWN_PSM_UP.  That always happens before we output the warning.
For the last week, this is number #1 crash on the trunk.
Keywords: topcrash
Honza, maybe your code is right. The trouble is, this code is becoming very complex, with many many states. Maybe we should avoid this additional complexity.

I'm now favoring a much simpler approach.

Maybe it's better to give up finding a very smart and complete solution, and try to find a simple patch that will result in all PSM code to be unavailable, as soon as we detect the failure.

I now believe the trouble is that we bring up the error dialog, before we remember the failure.

As soon as we discover that we can't init NSS, we could set a "panic flag", that won't ever be changed again until application restart.
As soon as the "panic flag" is set, any other attempt to run PSM code should fail.

This panic flag should be set before we bring up the error dialog, making sure the NSS panic state is known at the time things go parallel.
Here is an alternative approach.


Ok, here is an updated test case, which reliable crashes for me on Linux, and which is fixed using this very simply patch.


- use older ff to set up a profile for testing this patch (call it FIPS ?)
- use prefs/advanced/encryption/sec. devices and enable fips mode
- quit
- go to your binary FF 4 directory and corrupt file libfreebl3.chk
  (move original away, use any other small file, and copy it to this name)
- start firefox 4 using this command line:
    ./firefox -P FIPS -no-remote http://kuix.de http://kuix.de

On startup, I get the warning dialog box, then ff exits with a crash.


With this patch, I get the warning dialog box (wanted), Firefox starts up and works. No crash for me.
Well, you can't do much with this Firefox instance, because as soon as you try to open any web page, you get an error on the console, because the url-classifier-service doesn't work (expected with no NSS available).
However, URLs like about:config and file:/// work fine.

So, in my opinion, this could be a minimal patch that fixes the crash globally, and makes sure that PSM is completely disabled.
Attachment #488979 - Flags: review?(honzab.moz)
True is that I am not able to understand my own patch after few weeks.  What you suggest is working, clean and simple.

But we have few other problems we should probably fix in followup bugs:
- are we sure we don't want to allow PSM at all when NSS init fails?
- when another thread is demanding PSM, it may pass through because of the 'loading' flag and access NSS before it has been really initialized
- a subject for discussion: should we allow URL classifier bypass checks when there is no PSM?  but as I heard there are movements to do not make it dependent on the PSM at all (by embedding FreeBL) so we are probably OK for future
Comment on attachment 488979 [details] [diff] [review]
v3, alternative approach

I cannot force my Fx4 trunk build to crash using your STR in comment 65, however this is very clean patch.

r=honzab

Thanks Kai.
Attachment #481075 - Attachment is obsolete: true
Attachment #481075 - Flags: review?(kaie)
Attachment #486733 - Attachment is obsolete: true
Comment on attachment 488979 [details] [diff] [review]
v3, alternative approach

Forgot to set the flag correctly.
Attachment #488979 - Flags: review?(honzab.moz) → review+
Assignee: honzab.moz → kaie
fixed
http://hg.mozilla.org/mozilla-central/rev/60fa0e6a4365
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #63)
> For the last week, this is number #1 crash on the trunk.

The RtlEnterCriticalSection crashes you're looking at are mostly not this bug. The bulk (495) are plugin crashes and those are something else.

http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b8pre&range_value=1&range_unit=weeks&query_search=signature&query_type=exact&query=RtlEnterCriticalSection&process_type=plugin&do_query=1

Of the 117 that are browser crashes...

http://crash-stats.mozilla.com/query/query?product=Firefox&version=Firefox%3A4.0b8pre&range_value=1&range_unit=weeks&query_search=signature&query_type=exact&query=RtlEnterCriticalSection&process_type=browser&do_query=1

...the graph shows a huge jump on builds from 11/1 through 11/5. During that bulge, and not before or after, a lot look like
bp-8f82ff40-c410-422f-8ecf-325bf2101105 (scanning pushloghtml I don't see any likely fixes on 11/5 that made this go away). Those aren't this bug either.

There are several other random other stacks under RtlEnterCriticalSection (js-gc, layout, etc).

The ones related to this bug look like bp-d9ccad01-72dd-4afb-9533-de2952101107 which are coming in at a fairly slow and steady rate. Enough that it will be clear whether or not this patch fixes them, but nowhere near enough to account for topcrash status.
Dan, thanks a lot for the great explanation.

I'm glad to hear that the number of crashes related to this bug is small.
(If it had been higher, we would have had to investigate why so many people crashed with nonworking fips mode...)
setting milestone to b8 (I believe it landed after b7 got tagged)
Target Milestone: --- → mozilla2.0b8
Litmus testcase added:
https://litmus.mozilla.org/show_test.cgi?id=13884

Can someone please review this test? Thanks.

Verified FIXED using this test on Minefield 4.0b8pre 2010-12-02.
Status: RESOLVED → VERIFIED
Flags: in-litmus? → in-litmus+
Crash Signature: [@ ntdll.dll@0x22272] [@ RtlEnterCriticalSection ]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: