Changes from bug 107491 lead to tons of thread-safety asserts

RESOLVED FIXED

Status

()

defect
P1
normal
RESOLVED FIXED
13 years ago
12 years ago

People

(Reporter: bzbarsky, Assigned: benjamin)

Tracking

Trunk
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9 +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 1 obsolete attachment)

STEPS TO REPRODUCE:

1)  Compile Seamonkey (don't know how to trigger this code in Firefox, though I'm sure there's a way... in any case, the app is not relevant here).
2)  Type "84582" in the URL bar
3)  Hit enter.

EXPECTED RESULTS:  Error page

ACTUAL RESULT: Tons of thread-safety asserts.

For the first assert:

###!!! ASSERTION: Using observer service off the main thread!: 'Error', file ../../../mozilla/xpcom/ds/nsObserverService.cpp, line 130

the issue is that nsSocketTransport::InitiateSocket ends up calling ErrorAccordingToNSPR from a non-main thread (perfectly OK).  But ErrorAccordingToNSPR now gets the nsINSSErrorService, which leads to PSM component init.  And that code assumes it's on the main thread.  In particular, nsNSSComponent::RegisterObservers is called by nsNSSComponent::Init.

Then this same code calls nsNSSComponent::GetPIPNSSBundleString, which touches the chrome registry from off-thread, and leads to:

###!!! ASSERTION: wrong thread: 'NS_IsMainThread()', file ../../../../mozilla/netwerk/base/src/nsIOService.cpp, line 488

as we try to open channels on a background thread to load stringbundles.  And so forth.

The only approach I see here offhand is to create an nsIRunnable impl that makes sure PSM init happens and to sync-proxy that Run() method to the main thread...
Flags: blocking1.9?
The solution is to use a sync-proxy to get the service.
Assignee: kengert → nobody
Component: Security: PSM → Networking
Flags: blocking1.9? → blocking1.9+
OS: Linux → All
QA Contact: psm → networking
Hardware: PC → All
Assignee: nobody → dcamp
Priority: -- → P1
Benjamin can you help out here?
Assignee: dcamp → benjamin
Benjamin, thanks a lot for working on this bug and your idea with the runnable.

Question regarding 
  nsresult temprv = NS_DispatchToMainThread();
If temprv indicates whether the dispatching, starting the runner worked, then I'm ok with adding the NS_WARNING.


GetXPCOMFromNSSError gets called without knowing whether it's an NSS error.
If GetXPCOMFromNSSError returns a failure (indicating it's unable to convert), that's not a failure for function ErrorAccordingToNSPR. It's just supposed to mean "ok, it's not an NSS error, so let's use the default error code set in ErrorAccordingToNSPR".

So instead of the current
  try to convert, if it worked, then set rv to the conversion result
you do
  try to convert, it the conversion fails, the converter will update
  rv to NS_ERROR_FAILURE. This duplicates ErrorAccordingToNSPR default
  error code in the runnable.


Maybe that's not a big deal, but if we wanted to stay closer to the current behavior and keep control over the default value in ErrorAccordingToNSPR, we could change the patch to only update rv if the runnable indicates a successful conversion.

Benjamin, are you ok with this modified patch?
How about this? It avoids the extra member and IMO is a little easier to read.
Attachment #288998 - Flags: review?(kengert)
Yes, I'm now happy with NSSErrorRunnable and ErrorAccordingToNSPR.

I went on to review your other changes in the file, your attempt to move the call to ErrorAccordingToNSPR out of the locks.

I think in nsSocketInputStream::Available you want to revert the if test.
I think you want
    if (n < 0)   ###
       mCondition = ErrorAccordingToNSPR(code);

But I'm worried, because you'll update mCondition without holding the lock.
I don't know if mCondition is supposed to be protected by the lock, but it shoulds like it should?


This all brings me to a completely different proposal.


I think GetXPCOMFromNSSError is really a trivial function.
While it needs to have access to NSS headers, and therefore must reside in PSM, it doesn't need access to any state (besides the parameters).

I understand the problem comes from running through the NSS init service (which registers observers).
I think it should be fine to implement GetXPCOMFromNSSError without triggering NSS init.

At least the following functions can be decoupled from nsNSSComponent:
  nsINSSErrorsService::isNSSErrorCode
  nsINSSErrorsService::getXPCOMFromNSSError
  nsINSSErrorsService::getErrorClass

We could also decouple
  nsINSSErrorsService::getErrorMessage
although that has some more dependencies, but it should be ok to have another string bundle object.

Let me try something.
This no longer requires init of NSS/PSM component when calling ErrorAccordingToNSPR.

Maybe the original test case still triggers the assertion, we'll have to try that.

If it does, the assertions might also get triggered by docshell when it attempts to obtain an error message... Not sure if that code is reached in your test case. If it does, and we must still isolate it, I think the problem is now much more simple. We could use Benjamin's approach only for getting the error string in docshell. This would not require changing the scope of locks. I hope.
The rule (though it may be unwritten) is that you must not call any XPCOM method while holding a lock. If this were just a direct call into the NSS error provider I might not worry about it... but in this case we're calling into the component manager first, which does locking which could potentially deadlock.

If we want a solution like this I think we at least need to cache the NSS error provider so that we don't have to enter the component manager. :-(
Just wanted to check in on this - we are comming up on M10/b2 and it would be great to wrap this up before then...
Comment on attachment 288687 [details] [diff] [review]
Move ErrorAccordingToNSPR out of locks and sync-proxy to the main thread, rev. 1

I don't know whether rev. 3 is acceptable or not... it sets the status outside of the lock, which may not be acceptable. If that's the case, I think we're going to have to cache the NSS error reporter object somewhere.
Attachment #288687 - Attachment is obsolete: true
Attachment #288687 - Flags: review?(cbiesinger)
re mCondition. I quick look at the code shows that there are tons of places mCondition is set and reverenced without lock protection. (OnSocketReady() RecoverFromError(), OnSocketEvent() OnMsgInputClosed(), OnMsgOutputClosed()).

The only race I see is the potential for a partial setting of mCondition (that is the write to mCondition isn't atomic). That could be handled by using a local to read ErrorAccordingToNSPR(), and protecting the write with a lock or an PR_AtomicWrite() call. This is only useful if all the reference calls are protected by locks or atomicRead calls as well, however.

32-bit writes are atomic on all processors that I'm aware of, so I think we're safe in that regard.
Attachment #288998 - Flags: review?(dcamp)
I want to take a step back.
I think we are over-engineering here.

This whole problem only exists because mozilla/netwerk is unaware of 4 numeric constants, which are defined in mozilla/security/nss:

#define SEC_ERROR_BASE                          (-0x2000)
#define SEC_ERROR_LIMIT                         (SEC_ERROR_BASE + 1000)
#define SSL_ERROR_BASE                          (-0x3000)
#define SSL_ERROR_LIMIT                         (SSL_ERROR_BASE + 1000)


In my very first proposed patch, I duplicated the numerical constants into function ErrorAccordingToNSPR. That was seen as "not elegant".

In a second step I had proposed to use a mechanism at the NSPR level, where we could register the error code ranges used by NSS/libSSL. But that was seen as unwanted, because NSPR and NSS are separate...


I think if the lack of 4 numeric constants leads to threading, locking, and order-of-init problems, I think we have gone too far.

Let's find a way to make mozilla/netwerk or mozilla/nsprpub aware of the range of error codes that are used by NSS and libSSL, and we're done.
(In reply to comment #14)
>
> Let's find a way to make mozilla/netwerk or mozilla/nsprpub aware of the range
> of error codes that are used by NSS and libSSL, and we're done.

... without using XPCOM, compile time knowledge ...
We could hard code the constants into mozilla/netwerk/base/public/nsINSSErrorsService.idl

We could make mozilla/netwerk use these constants directly.


In order to ensure the duplicated constants will not get out of synch between netwerk and security, we could do a run time check. When PSM inits, it could compare the values available in nsINSSErrorsService.idl with the values defined by NSS.

If PSM discovers a difference, it will assert.
This will assure that we update the netwerk code should the values used by NSS ever change.
Comment on attachment 288998 [details] [diff] [review]
Proposed rev. 3

That sounds like a great solution to me... can you make a patch or should I?
Attachment #288998 - Flags: review?(kengert)
Attachment #288998 - Flags: review?(dcamp)
Posted patch Patch v5Splinter Review
Bob, Benjamin, what do you think?
Attachment #291440 - Flags: superreview?(benjamin)
Attachment #291440 - Flags: review?(rrelyea)
Comment on attachment 291440 [details] [diff] [review]
Patch v5

I think you should use PR_STATIC_ASSERT instead of NS_ASSERTION... sr=me with that change http://mxr.mozilla.org/mozilla/source/nsprpub/pr/include/prlog.h#259
Attachment #291440 - Flags: superreview?(benjamin) → superreview+
Comment on attachment 291440 [details] [diff] [review]
Patch v5

+isNSSErrorCode(PRErrorCode code)

uppercase start letter please

and can you linewrap the lines after the && operators?

+#define NSS_NSPRCODE_TO_XPCOMCODE(nspr_code) \

how about making this an inline function?
Posted patch Patch v6Splinter Review
Addressed comments from Benjamin and Biesi.
Carrying forward sr=benjamin
Attachment #291445 - Flags: superreview+
Attachment #291445 - Flags: review?(rrelyea)
Comment on attachment 291445 [details] [diff] [review]
Patch v6

r=wtc.

It seems better to do the PR_STATIC ASSERT the first thing in
nsNSSComponent::InitializeNSS.
Attachment #291445 - Flags: review+
Comment on attachment 291445 [details] [diff] [review]
Patch v6

r+, with some notes:

NOTE: These error limits are compile time in both cases. Basically it means PSM and XPCOM match, which is fine. If it were a real runtime check with NSS, I would have to give an r-.

The primary reason for non-approval of a true run-time check would be PSM should be able to run with new versions of NSS installed. On some platforms NSS is updated independently of mozilla, it shouldn't fail to start simply because a newer version of NSS is installed.

Since translation of the NSS errors into strings is done by PSM, the important value here isn't what errors NSS can return, but which NSS errors PSM understands, so a compile type check in PSM is sufficient, thus the r+.

bob
Attachment #291445 - Flags: review?(rrelyea) → review+
Is this going to land by tonight in time for the B2 freeze?
Comment on attachment 291445 [details] [diff] [review]
Patch v6

Bob, just to clarify -- SEC_ERROR_LIMIT means the upper end of the
range of error codes that NSS has reserved.  It doesn't mean the
current maximum NSS SEC error code.  So SEC_ERROR_LIMIT is unlikely
to change.  Right now SEC_ERROR_LIMIT equals SEC_ERROR_BASE + 1000.
fixed
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
The 4 constants shown in comment 14 were defined prior to 1996, and have
never changed since then.  I think you could copy them into whatever other
file (idl?) needs to know them.  I wouldn't expend too much effort trying 
to detect a change in those values.  I predict they will never change in 
the lifetime of NSS.
Comment on attachment 291440 [details] [diff] [review]
Patch v5

remove obsolte review request.
Attachment #291440 - Flags: review?(rrelyea)
You need to log in before you can comment on or make changes to this bug.