Open Bug 430917 Opened 17 years ago Updated 3 years ago

Better pthread error checking in optimized builds

Categories

(NSPR :: NSPR, defect)

All
SunOS
defect

Tracking

(Not tracked)

People

(Reporter: julien.pierre, Unassigned)

Details

Attachments

(2 files)

We recently had an escalation where the problem turned out to be a broken pthread system library. We discovered that NSPR asserts on many pthread errors, but completely ignores them in the optimized build, even if there is an opportunity to take some action. The specific issue we ran into was with a pthread_mutex_init failure as part of in PR_NewLock . Since we run with pthreads on most platforms nowadays, I think there is some value in improving the error checking in optimized builds, so that we can diagnose this sort of problem earlier.
In particular, in ptsync.c, in non-debug builds, - PR_NewLock ignores the return value from _PT_PTHREAD_MUTEX_INIT (which is essentially pthread_mutex_init) - PR_Lock ignores the return value from pthread_mutex_lock - PR_Unlock ignores the return value from pthread_mutex_unlock These pthread_mutex calls can legitimately fail with various errors, such as EAGAIN (EINTR), etc. This is also an issue at higher layers. In prinit.c _PR_InitStuff and _PR_InitCallOnce ignore failures of PR_NewLock In prenv.c, _PR_InitEnv ignores failure of PR_NewLock, PR_GetEnv and PR_SetEnv ignore failures of PR_Lock and PR_Unlock There are many more examples. These above are what we ran into.
Version: other → 4.7
Error handling: We should handle the failure of pthread_mutex_init in PR_NewLock and the failure of pthread_mutex_unlock in PR_Unlock. But it doesn't make sense to handle pthread_mutex_lock failure in PR_Lock because PR_Lock returns void. Assertions: - Remove the assertion in PR_NewLock. - Keep the assertion in PR_Lock. - Although PR_Unlock can report failure, we should keep the assertion in PR_Unlock because most callers of PR_Unlock don't check its return value. Error codes: pthread_mutex_lock and pthread_mutex_unlock should not return EINTR and EAGAIN: http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
This bug is open-ended, so we should take a focused approach and target the functions for which error handling is most important. I didn't fix PR_Unlock because we also need to fix pt_PostNotifie.
Attachment #317952 - Flags: review?(julien.pierre.boogz)
(In reply to comment #2) > it doesn't make sense to handle pthread_mutex_lock failure in PR_Lock > because PR_Lock returns void. And that's a defect, frankly. > Error codes: > > pthread_mutex_lock and pthread_mutex_unlock should not return > EINTR and EAGAIN: > http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html The page you cite says that it CAN return EAGAIN, and gives a reason. The reason given sounds like an implementation limitation, but the fact that it's documented means that it clearly can happen in some implementations. The page says that it should not return EINTR. Now, IIRC, on some platforms, EAGAIN and EINTR are the same value and indistinguishable. (My memory may be faulty about this) If that is so, then the fact that EAGAIN explicitly is allowed, means we cannot ignore EINTR when it has the same value as EAGAIN.
Initialization failures (including failures of implicit initialization) must be detected and reported. Reporting a failure of PR_NewLock does little good unless it has an impact on initialization, and an initialization failure has impact on its caller. As for this bug being "open ended", I disagree. There are a finite number of functions in NSPR, a finite number of functions that call implicit initialization, and a finite number of functions calls by NSPR initialization. When NSPR initialization doesn't ignore failures of the functions it calls, and reports failure to its caller, and the functions that call implicit initialization don't ignore its failure, then the problem is fixed, IMO. It's big, yes, but not open ended. Now, if you prefer, I can file bugs against all those components individually. We could have a hierarchy of bugs to match the hierarchy which is the call graph of functions that call, or are called by, NSPR initialization. We need to do more than merely assert in PR_Lock. Continuing on without taking some action to either call attention to the failure or to crash only serves to make the caller fail in some way that is difficult to diagnose, and will probably lead to more accusations against NSPR. I think we should retry those operations that can fail with EAGAIN. other failures should perhaps lead to a call to abort(). If we're going to fail, IMO, we do ourselves the least damage (politically) by calling attention to the underlying problem as quickly as possible. In the case of the NewLock failure, because the error was ignored, a great deal of time (months) was spent diagnosing the problem, and at the end of that time, there was a great deal of ill will accumulated towards NSPR. Had NSPR acted to bring attention to the underlying failure quickly, the issue could have been quickly deflected to its source, without the ill will that it (and we) now suffer.
Wan-Teh, Just to be clear, I'm not advocating that we remove any assertions that are already present. These should stay for debug builds, in case all the callers don't check the values - and we know they don't always check. I'm advocating that we add error checking code in optimized builds, so that those callers that do check get the opportunity to take the appropriate action.
It's EWOULDBLOCK that may have the same value as EAGAIN. EINTR is different from EAGAIN. NSPR doesn't use recursive pthread mutexes, so the EAGAIN error in the man page doesn't apply to NSPR. I understand what you want to do about those assertions. I only removed assertions from PR_NewXXX because the functions now return a NULL pointer when the pthread_xxx_init functions fail, and our callers will crash if they don't check the return value. I didn't return the assertions from PR_Lock and PR_Unlock.
Comment on attachment 317952 [details] [diff] [review] Fix PR_NewLock, PR_NewCondVar, PR_NewMonitor, and PR_Lock This patch removes assertions in debug builds in several cases : 1) In pt_PostNotifies : the assert of _PT_PTHREAD_MUTEX_INIT return status is gone . is gone and needs to be restored. I would suggest PR_ASSERT(0) before your PR_Free(lock); 2) Same problem in pt_PostNotifyToCvar for the return value of _PT_PTHREAD_COND_INIT . Please add PR_ASSERT(0) before PR_Free(cv) . 3) Same problem in PR_NewMonitor with _PT_PTHREAD_MUTEX_INIT . Please restore the assertion before the new if statement . Same thing for _PT_PTHREAD_COND_INIT . I only checked the content of this patch. I didn't check the code for completion. There are probably other pthread calls that need more checks in the optimized build.
Attachment #317952 - Flags: review?(julien.pierre.boogz) → review-
BTW , thanks for taking a crack at this, Wan-Teh. I meant for this bug to be on my plate as I assigned it to myself.
pthread_mutex_init may fail because we're out of memory or system resources: http://www.opengroup.org/onlinepubs/007908775/xsh/pthread_mutex_init.html So asserting that pthread_mutex_init doesn't fail is like asserting malloc doesn't fail. We don't assert malloc doesn't fail, so we shouldn't assert that pthread_mutex_init doesn't fail. The original assertions were there because we're too lazy to write proper error handling code. We don't need both.
in reply to comment 10, The fact that we hide failures from applications that use NSPR, rather than calling attention to those failures, means that people who support and/or sustain applications that use NSPR may spend a lot of time (months) trying to find failures that were hidden by NSPR. Those people would rather that NSPR calls attention to these failures than hide them. The failures will cause the application to fail, and they'd prefer that the cause of the failure be made obvious rather than be inobvious.
Re: comment 10 and 11, Regarding the fact that we don't assert on out of memory errors, I think this is debatable behavior. I would prefer if we did. It is not likely that most applications can safely recover from such errors. Usually, this error will just cause some subsequent failure later on, and should be considered fatal. So, more time will be spent figuring out exactly when the application ran out of memory. It might as well be detected at the earliest possible time. Usually, debug NSPR builds only get installed for a reason - because somebody is trying to diagnose a problem. So, I would be in favor of having that assert as well.
I have been taking a closer look at this problem. Unfortunately, there are many functions returning void in NSPR. This makes it very hard to detect any problem during NSPR initialization in optimized builds, especially for the implicit initialization case. The functions that return void that I see as a direct problem for this bug are : _PR_ImplicitInitialization _PR_InitStuff _PR_InitLocks These prototypes are private and can be changed, but it is significantly more work than just checking the pthread system call return values. To do this correctly will have an impact on source code for many non-pthread platforms.
For correctness, a large number of other initialization functions should not be returning void. Basically, everything called from _PR_InitStuff . Many errors can occur in them during initialization, but are never checked. Some of these problems could be pthread errors and go undetected. Here is the list of functions that should not return void. PR_MD_EARLY_INIT _PR_InitAtomics _PR_InitSegs _PR_InitStacks _PR_InitTPD _PR_InitEnv _PR_InitLayerCache _PR_InitClock _PR_InitGarbageCollector _PR_InitThreads _PR_InitCPUs _PR_InitMem _PR_MD_FINAL_INIT
Also : _PR_InitCMon _PR_InitIO _PR_InitNet _PR_InitLog _PR_InitLinker _PR_InitCallOnce _PR_InitDtoa _PR_InitTime _PR_InitMW _PR_InitRWLocks nspr_InitializePRErrorTable
This patch tries to specfically address the issues from comment 13 . While it is large, it is a simple and repetitive patch and should not be hard to review.
Attachment #342381 - Flags: review?
Attachment #342381 - Flags: review? → review?(nbolyard)
Attachment #342381 - Flags: review?(nelson) → review-
Comment on attachment 342381 [details] [diff] [review] Propagate errors up the stack This patch does one thing many times, and a few other things. The one thing that it does many times is to replace one line with a set of 3 lines, as shown immediately below. - if (!_pr_initialized) _PR_ImplicitInitialization(); + PRStatus initStatus = PR_SUCCESS; + if (!_pr_initialized) initStatus = _PR_ImplicitInitialization(); + if (PR_SUCCESS != initStatus) return NULL; All those replacements are identical except for the value returned. In most cases, it is NULL or PR_FAILURE, but in some cases it is another value, such as 0, -1, PR_INVALID_IO_LAYER, and others. My first observation is that those 3 lines are just CRYING out to be replaced with a macro, something like: #define PR_INIT_OR_FAIL(rv) \ PRStatus initStatus = PR_SUCCESS; \ if (!_pr_initialized) initStatus = _PR_ImplicitInitialization(); \ if (PR_SUCCESS != initStatus) return rv invoked as PR_INIT_OR_FAIL(NULL); or PR_INIT_OR_FAIL(PR_FAILURE); Please use a macro like that, rather than repeating those 3 lines everywhere. This has the benefit that if we ever again decide to change these tests, we can change them all by changing one macro. Aside from that desire to see them all turned into macros, most the changes that return NULL, PR_FAILURE or PR_INVALID_IO_LAYER appear to be correct. If they were macros already, I would give most of those changes r+. However, some of those cases have the problem that they introduce a line that defines a new variable AFTER some other executable code has been executed. While some very modern c compilers allow this, some older ones that we support do not (e.g. AIX, IINM). For example, in file pr/src/io/prio.c, and in file pr/src/pthreads/ptio.c, those 3 lines are inserted after a PR_ASSERT call. The patch for the file pr/src/md/unix/uxwrap.c inserts those lines in the middle of a function, after numerous other executable lines. There may be other examples of this. I didn't try to find them all. Among the other cases, where the value being returned is not one of NULL, PR_FAILURE, or PR_INVALID_IO_LAYER, there are a number of problems. The most common problem is that, in most cases, the value being returned does not signify failure. Many of these functions can legitimately return all possible 32-bit values, and no value unambiguously means that a failure occurred. Functions for which this is the problem (not necessarily a complete list) include: - _MD_AtomicSet() in pr/src/md/beos/bmisc.c [the patch returns NULL, even though the type of the return value is PRInt32. But there is no "failure" value for that function. It always returns the value that it replaced. - in file pr/src/misc/pratom.c, the functions _PR_MD_ATOMIC_INCREMENT _PR_MD_ATOMIC_ADD, _PR_MD_ATOMIC_DECREMENT, _PR_MD_ATOMIC_SET, this patch makes those functions return 0 if initialization fails, but 0 is a perfectly legitimate return value, not indicating failure. - in file pr/src/misc/prdtoa.c, the patch makes function PR_strtod return 0, but 0 does not indicate failure for that function. - the change to PR_Init in prinit.c is so pointless that I would say it is wrong. - the patch to file pr/src/misc/prinrval.c causes PR_IntervalNow to return PR_INTERVAL_NO_WAIT, and causes PR_TicksPerSecond to return PR_INTERVAL_MIN, but those values do not indicate failure. They are perfect allowable values, and presumably there is at least one system on which PR_INTERVAL_MIN is the most correct value for the success case. - The patch to file pr/src/threads/prcthr.c makes PR_SetCPUAffinityMask return 0, but 0 is the only value it ever returns. Also in that file, the changes to PR_SetThreadGCAble, PR_ClearThreadGCAble, and PR_GetThreadScope do not cause those functions to return failure values. In file pr/src/misc/prinit.c, the change to function _PR_InitStuff will cause it to return PR_SUCCESS if it is called one one thread while it is already in the process of executing on another thread, even if that other thread subsequently fails. I think the patch to function _PR_ImplicitInitialization is incomplete. The patch makes it always return PR_SUCCESS, even if either of the functions it calls returns PR_FAILURE. It pays no attention to the values returned by the functions it calls. If the purpose of this patch is to propagate failures up the stack, then all the callers to _PR_InitStuff should pay attention to the return value, and that function should likewise pay attention to the values returned by the functions it calls. It presently ignores the return values of all but one of the functions it calls. (Perhaps this patch was intended to be the first of several?) REgarding the public functions that are defined as returning void, I think what you want to do is to define new functions that do the same things, but return a PR_STATUS, and then re-implement the void functions to call the function that returns a status, but then ignore the return value, and return. - All the functions that return PR_STATUS should set an error code before returning PR_FAILURE. Any function that is patched to return PR_Status instead of void should follow this rule. A single example of a function that is changed, but does not follow this rule is _PR_InitLocks(). (hmm, having said that, is it possible to set an NSPR return value if NSPR initialization fails? I don't know. :( ) - pr/src/md/unix/uxwrap.c contains implementations of the Unix functions poll and select. Those functions do have well defined return values that unambiguously means failure, and this patch returns those values. However, it should set the value in errno before returning -1, because that is how those functions are defined.
Please write the implicit initialization call like this, without using the initStatus variable: if (!_pr_initialized && _PR_ImplicitInitialization() != PR_SUCCESS) return NULL; You can turn this into a macro as Nelson suggested.
Nelson, Re: comment 17 1) I have no problem turning these statements into macros. You didn't need to spend as much time saying that. 2) I compiled the patch as-is with our current AIX compiler, and there was no error, nor even a warning. I didn't try every single platform, but I think the compilers have finally modernized. That said, I will still reimplement the logic as a macro without the intermediate variable as suggested by Wan-Teh, since I don't think that variable is needed for anything. 3) Regarding functions that don't have explicit failure return values, what would you suggest we do ? An assertion of debug builds would seem reasonable at least. However, this bug is specifically about optimized builds. 4) Re: the change to PR_Init, you are right that it doesn't accomplish anything as written. I think I meant to add an assertion in there. I think it is very unfortunate that PR_Init is declared as returning void, and publicly exported. I am not sure what to do for the optimized build case, which is the subject of this bug. 5) You are right, the patch to PR_ImplicitInitialization is incomplete. I meant to make it check the return status of _PR_InitStuff . 6) regarding _PR_InitStuff being called from multiple threads, concurrent initialization of NSPR is specifically unsupported, and this patch does not attempt to change that behavior. 7) The scope of the patch, as indicated when I was submitted it, was only to propagate errors within the functions listed in comment 13 . That's why I did not attempt to change the return status of the other functions. I am well aware of the other issues, as you can read in comment 14 and 15 . This is a lot more work and it may be preferable to do it in phases. Those functions that do not relate to POSIX errors may be best treated in other bugzilla bugs or RFEs. 8) I will take your advice as far as adding new public functions that don't return void. I don't think it's possible to set an error code if NSPR initialization fails because we need access to thread private data. 9) I will take a look at setting errno for those and poll and select cases. I wasn't sure if there was a portable way to do that.
Assignee: julien.pierre.boogz → wtc

The bug assignee is inactive on Bugzilla, so the assignee is being reset.

Assignee: wtc → nobody
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: