Open
Bug 430917
Opened 17 years ago
Updated 3 years ago
Better pthread error checking in optimized builds
Categories
(NSPR :: NSPR, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: julien.pierre, Unassigned)
Details
Attachments
(2 files)
4.37 KB,
patch
|
julien.pierre
:
review-
|
Details | Diff | Splinter Review |
54.75 KB,
patch
|
nelson
:
review-
|
Details | Diff | Splinter Review |
We recently had an escalation where the problem turned out to be a broken pthread system library. We discovered that NSPR asserts on many pthread errors, but completely ignores them in the optimized build, even if there is an opportunity to take some action. The specific issue we ran into was with a pthread_mutex_init failure as part of in PR_NewLock .
Since we run with pthreads on most platforms nowadays, I think there is some value in improving the error checking in optimized builds, so that we can diagnose this sort of problem earlier.
Comment 1•17 years ago
|
||
In particular, in ptsync.c, in non-debug builds,
- PR_NewLock ignores the return value from _PT_PTHREAD_MUTEX_INIT
(which is essentially pthread_mutex_init)
- PR_Lock ignores the return value from pthread_mutex_lock
- PR_Unlock ignores the return value from pthread_mutex_unlock
These pthread_mutex calls can legitimately fail with various errors,
such as EAGAIN (EINTR), etc.
This is also an issue at higher layers. In prinit.c
_PR_InitStuff and _PR_InitCallOnce ignore failures of PR_NewLock
In prenv.c, _PR_InitEnv ignores failure of PR_NewLock,
PR_GetEnv and PR_SetEnv ignore failures of PR_Lock and PR_Unlock
There are many more examples. These above are what we ran into.
Version: other → 4.7
Comment 2•17 years ago
|
||
Error handling:
We should handle the failure of pthread_mutex_init in PR_NewLock
and the failure of pthread_mutex_unlock in PR_Unlock. But it
doesn't make sense to handle pthread_mutex_lock failure in PR_Lock
because PR_Lock returns void.
Assertions:
- Remove the assertion in PR_NewLock.
- Keep the assertion in PR_Lock.
- Although PR_Unlock can report failure, we should keep the
assertion in PR_Unlock because most callers of PR_Unlock
don't check its return value.
Error codes:
pthread_mutex_lock and pthread_mutex_unlock should not return
EINTR and EAGAIN:
http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
Comment 3•17 years ago
|
||
This bug is open-ended, so we should take a focused approach
and target the functions for which error handling is most
important.
I didn't fix PR_Unlock because we also need to fix pt_PostNotifie.
Attachment #317952 -
Flags: review?(julien.pierre.boogz)
Comment 4•17 years ago
|
||
(In reply to comment #2)
> it doesn't make sense to handle pthread_mutex_lock failure in PR_Lock
> because PR_Lock returns void.
And that's a defect, frankly.
> Error codes:
>
> pthread_mutex_lock and pthread_mutex_unlock should not return
> EINTR and EAGAIN:
> http://www.opengroup.org/onlinepubs/009695399/functions/pthread_mutex_lock.html
The page you cite says that it CAN return EAGAIN, and gives a reason.
The reason given sounds like an implementation limitation, but the fact that
it's documented means that it clearly can happen in some implementations.
The page says that it should not return EINTR.
Now, IIRC, on some platforms, EAGAIN and EINTR are the same value and
indistinguishable. (My memory may be faulty about this)
If that is so, then the fact that EAGAIN explicitly is allowed, means
we cannot ignore EINTR when it has the same value as EAGAIN.
Comment 5•17 years ago
|
||
Initialization failures (including failures of implicit initialization)
must be detected and reported.
Reporting a failure of PR_NewLock does little good unless it has an impact
on initialization, and an initialization failure has impact on its caller.
As for this bug being "open ended", I disagree. There are a finite number
of functions in NSPR, a finite number of functions that call implicit
initialization, and a finite number of functions calls by NSPR initialization.
When NSPR initialization doesn't ignore failures of the functions it calls,
and reports failure to its caller, and the functions that call implicit
initialization don't ignore its failure, then the problem is fixed, IMO.
It's big, yes, but not open ended.
Now, if you prefer, I can file bugs against all those components individually.
We could have a hierarchy of bugs to match the hierarchy which is the call
graph of functions that call, or are called by, NSPR initialization.
We need to do more than merely assert in PR_Lock. Continuing on without
taking some action to either call attention to the failure or to crash
only serves to make the caller fail in some way that is difficult to
diagnose, and will probably lead to more accusations against NSPR.
I think we should retry those operations that can fail with EAGAIN.
other failures should perhaps lead to a call to abort().
If we're going to fail, IMO, we do ourselves the least damage (politically)
by calling attention to the underlying problem as quickly as possible.
In the case of the NewLock failure, because the error was ignored, a great
deal of time (months) was spent diagnosing the problem, and at the end of
that time, there was a great deal of ill will accumulated towards NSPR.
Had NSPR acted to bring attention to the underlying failure quickly, the
issue could have been quickly deflected to its source, without the ill will
that it (and we) now suffer.
Reporter | ||
Comment 6•17 years ago
|
||
Wan-Teh,
Just to be clear, I'm not advocating that we remove any assertions that are already present. These should stay for debug builds, in case all the callers don't check the values - and we know they don't always check.
I'm advocating that we add error checking code in optimized builds, so that those callers that do check get the opportunity to take the appropriate action.
Comment 7•17 years ago
|
||
It's EWOULDBLOCK that may have the same value as EAGAIN.
EINTR is different from EAGAIN.
NSPR doesn't use recursive pthread mutexes, so the EAGAIN
error in the man page doesn't apply to NSPR.
I understand what you want to do about those assertions.
I only removed assertions from PR_NewXXX because the functions
now return a NULL pointer when the pthread_xxx_init functions fail,
and our callers will crash if they don't check the return value. I didn't
return the assertions from PR_Lock and PR_Unlock.
Reporter | ||
Comment 8•17 years ago
|
||
Comment on attachment 317952 [details] [diff] [review]
Fix PR_NewLock, PR_NewCondVar, PR_NewMonitor, and PR_Lock
This patch removes assertions in debug builds in several cases :
1) In pt_PostNotifies : the assert of _PT_PTHREAD_MUTEX_INIT return status is gone . is gone and needs to be restored. I would suggest PR_ASSERT(0) before your PR_Free(lock);
2) Same problem in pt_PostNotifyToCvar for the return value of _PT_PTHREAD_COND_INIT . Please add PR_ASSERT(0) before PR_Free(cv) .
3) Same problem in PR_NewMonitor with _PT_PTHREAD_MUTEX_INIT . Please restore the assertion before the new if statement .
Same thing for _PT_PTHREAD_COND_INIT .
I only checked the content of this patch. I didn't check the code for completion. There are probably other pthread calls that need more checks in the optimized build.
Attachment #317952 -
Flags: review?(julien.pierre.boogz) → review-
Reporter | ||
Comment 9•17 years ago
|
||
BTW , thanks for taking a crack at this, Wan-Teh. I meant for this bug to be on my plate as I assigned it to myself.
Comment 10•17 years ago
|
||
pthread_mutex_init may fail because we're out of memory or system resources:
http://www.opengroup.org/onlinepubs/007908775/xsh/pthread_mutex_init.html
So asserting that pthread_mutex_init doesn't fail is like asserting malloc doesn't
fail. We don't assert malloc doesn't fail, so we shouldn't assert that
pthread_mutex_init doesn't fail.
The original assertions were there because we're too lazy to write proper error
handling code. We don't need both.
Comment 11•17 years ago
|
||
in reply to comment 10,
The fact that we hide failures from applications that use NSPR, rather
than calling attention to those failures, means that people who
support and/or sustain applications that use NSPR may spend a lot of time
(months) trying to find failures that were hidden by NSPR. Those people
would rather that NSPR calls attention to these failures than hide them.
The failures will cause the application to fail, and they'd prefer that
the cause of the failure be made obvious rather than be inobvious.
Reporter | ||
Comment 12•17 years ago
|
||
Re: comment 10 and 11,
Regarding the fact that we don't assert on out of memory errors, I think this is debatable behavior. I would prefer if we did. It is not likely that most applications can safely recover from such errors. Usually, this error will just cause some subsequent failure later on, and should be considered fatal. So, more time will be spent figuring out exactly when the application ran out of memory. It might as well be detected at the earliest possible time. Usually, debug NSPR builds only get installed for a reason - because somebody is trying to diagnose a problem. So, I would be in favor of having that assert as well.
Reporter | ||
Comment 13•17 years ago
|
||
I have been taking a closer look at this problem.
Unfortunately, there are many functions returning void in NSPR. This makes it very hard to detect any problem during NSPR initialization in optimized builds, especially for the implicit initialization case.
The functions that return void that I see as a direct problem for this bug are :
_PR_ImplicitInitialization
_PR_InitStuff
_PR_InitLocks
These prototypes are private and can be changed, but it is significantly more work than just checking the pthread system call return values.
To do this correctly will have an impact on source code for many non-pthread platforms.
Reporter | ||
Comment 14•17 years ago
|
||
For correctness, a large number of other initialization functions should not be returning void. Basically, everything called from _PR_InitStuff .
Many errors can occur in them during initialization, but are never checked. Some of these problems could be pthread errors and go undetected. Here is the list of functions that should not return void.
PR_MD_EARLY_INIT
_PR_InitAtomics
_PR_InitSegs
_PR_InitStacks
_PR_InitTPD
_PR_InitEnv
_PR_InitLayerCache
_PR_InitClock
_PR_InitGarbageCollector
_PR_InitThreads
_PR_InitCPUs
_PR_InitMem
_PR_MD_FINAL_INIT
Reporter | ||
Comment 15•17 years ago
|
||
Also :
_PR_InitCMon
_PR_InitIO
_PR_InitNet
_PR_InitLog
_PR_InitLinker
_PR_InitCallOnce
_PR_InitDtoa
_PR_InitTime
_PR_InitMW
_PR_InitRWLocks
nspr_InitializePRErrorTable
Reporter | ||
Comment 16•17 years ago
|
||
This patch tries to specfically address the issues from comment 13 . While it is large, it is a simple and repetitive patch and should not be hard to review.
Attachment #342381 -
Flags: review?
Reporter | ||
Updated•17 years ago
|
Attachment #342381 -
Flags: review? → review?(nbolyard)
Updated•17 years ago
|
Attachment #342381 -
Flags: review?(nelson) → review-
Comment 17•17 years ago
|
||
Comment on attachment 342381 [details] [diff] [review]
Propagate errors up the stack
This patch does one thing many times, and a few other things.
The one thing that it does many times is to replace one
line with a set of 3 lines, as shown immediately below.
- if (!_pr_initialized) _PR_ImplicitInitialization();
+ PRStatus initStatus = PR_SUCCESS;
+ if (!_pr_initialized) initStatus = _PR_ImplicitInitialization();
+ if (PR_SUCCESS != initStatus) return NULL;
All those replacements are identical except for the value returned.
In most cases, it is NULL or PR_FAILURE, but in some cases it is
another value, such as 0, -1, PR_INVALID_IO_LAYER, and others.
My first observation is that those 3 lines are just CRYING out to
be replaced with a macro, something like:
#define PR_INIT_OR_FAIL(rv) \
PRStatus initStatus = PR_SUCCESS; \
if (!_pr_initialized) initStatus = _PR_ImplicitInitialization(); \
if (PR_SUCCESS != initStatus) return rv
invoked as
PR_INIT_OR_FAIL(NULL);
or
PR_INIT_OR_FAIL(PR_FAILURE);
Please use a macro like that, rather than repeating those 3 lines
everywhere. This has the benefit that if we ever again decide to
change these tests, we can change them all by changing one macro.
Aside from that desire to see them all turned into macros, most
the changes that return NULL, PR_FAILURE or PR_INVALID_IO_LAYER
appear to be correct. If they were macros already, I would give
most of those changes r+.
However, some of those cases have the problem that they introduce
a line that defines a new variable AFTER some other executable code
has been executed. While some very modern c compilers allow this,
some older ones that we support do not (e.g. AIX, IINM). For example,
in file pr/src/io/prio.c, and in file pr/src/pthreads/ptio.c, those
3 lines are inserted after a PR_ASSERT call.
The patch for the file pr/src/md/unix/uxwrap.c inserts those lines
in the middle of a function, after numerous other executable lines.
There may be other examples of this. I didn't try to find them all.
Among the other cases, where the value being returned is not one of
NULL, PR_FAILURE, or PR_INVALID_IO_LAYER, there are a number of
problems. The most common problem is that, in most cases, the value
being returned does not signify failure. Many of these functions
can legitimately return all possible 32-bit values, and no value
unambiguously means that a failure occurred. Functions for which
this is the problem (not necessarily a complete list) include:
- _MD_AtomicSet() in pr/src/md/beos/bmisc.c [the patch returns NULL,
even though the type of the return value is PRInt32. But there is
no "failure" value for that function. It always returns the value
that it replaced.
- in file pr/src/misc/pratom.c, the functions _PR_MD_ATOMIC_INCREMENT
_PR_MD_ATOMIC_ADD, _PR_MD_ATOMIC_DECREMENT, _PR_MD_ATOMIC_SET,
this patch makes those functions return 0 if initialization fails,
but 0 is a perfectly legitimate return value, not indicating failure.
- in file pr/src/misc/prdtoa.c, the patch makes function PR_strtod
return 0, but 0 does not indicate failure for that function.
- the change to PR_Init in prinit.c is so pointless that I would say
it is wrong.
- the patch to file pr/src/misc/prinrval.c causes PR_IntervalNow to
return PR_INTERVAL_NO_WAIT, and causes PR_TicksPerSecond to return
PR_INTERVAL_MIN, but those values do not indicate failure. They
are perfect allowable values, and presumably there is at least one
system on which PR_INTERVAL_MIN is the most correct value for the
success case.
- The patch to file pr/src/threads/prcthr.c makes PR_SetCPUAffinityMask
return 0, but 0 is the only value it ever returns. Also in that file,
the changes to PR_SetThreadGCAble, PR_ClearThreadGCAble, and
PR_GetThreadScope do not cause those functions to return failure values.
In file pr/src/misc/prinit.c, the change to function _PR_InitStuff will
cause it to return PR_SUCCESS if it is called one one thread while it
is already in the process of executing on another thread, even if that
other thread subsequently fails.
I think the patch to function _PR_ImplicitInitialization is incomplete.
The patch makes it always return PR_SUCCESS, even if either of the
functions it calls returns PR_FAILURE. It pays no attention to the
values returned by the functions it calls.
If the purpose of this patch is to propagate failures up the stack, then
all the callers to _PR_InitStuff should pay attention to the return value,
and that function should likewise pay attention to the values returned
by the functions it calls. It presently ignores the return values of
all but one of the functions it calls. (Perhaps this patch was intended
to be the first of several?)
REgarding the public functions that are defined as returning void, I think
what you want to do is to define new functions that do the same things,
but return a PR_STATUS, and then re-implement the void functions to call
the function that returns a status, but then ignore the return value,
and return.
- All the functions that return PR_STATUS should set an error code before
returning PR_FAILURE. Any function that is patched to return PR_Status
instead of void should follow this rule. A single example of a function
that is changed, but does not follow this rule is _PR_InitLocks().
(hmm, having said that, is it possible to set an NSPR return value if
NSPR initialization fails? I don't know. :( )
- pr/src/md/unix/uxwrap.c contains implementations of the Unix functions
poll and select. Those functions do have well defined return values that
unambiguously means failure, and this patch returns those values. However,
it should set the value in errno before returning -1, because that is how
those functions are defined.
Comment 18•17 years ago
|
||
Please write the implicit initialization call like this, without using
the initStatus variable:
if (!_pr_initialized && _PR_ImplicitInitialization() != PR_SUCCESS)
return NULL;
You can turn this into a macro as Nelson suggested.
Reporter | ||
Comment 19•17 years ago
|
||
Nelson,
Re: comment 17
1) I have no problem turning these statements into macros. You didn't need to spend as much time saying that.
2) I compiled the patch as-is with our current AIX compiler, and there was no error, nor even a warning. I didn't try every single platform, but I think the compilers have finally modernized.
That said, I will still reimplement the logic as a macro without the intermediate variable as suggested by Wan-Teh, since I don't think that variable is needed for anything.
3) Regarding functions that don't have explicit failure return values, what would you suggest we do ? An assertion of debug builds would seem reasonable at least. However, this bug is specifically about optimized builds.
4) Re: the change to PR_Init, you are right that it doesn't accomplish anything as written. I think I meant to add an assertion in there. I think it is very unfortunate that PR_Init is declared as returning void, and publicly exported. I am not sure what to do for the optimized build case, which is the subject of this bug.
5) You are right, the patch to PR_ImplicitInitialization is incomplete. I meant to make it check the return status of _PR_InitStuff .
6) regarding _PR_InitStuff being called from multiple threads, concurrent initialization of NSPR is specifically unsupported, and this patch does not attempt to change that behavior.
7) The scope of the patch, as indicated when I was submitted it, was only to propagate errors within the functions listed in comment 13 . That's why I did not attempt to change the return status of the other functions. I am well aware of the other issues, as you can read in comment 14 and 15 . This is a lot more work and it may be preferable to do it in phases. Those functions that do not relate to POSIX errors may be best treated in other bugzilla bugs or RFEs.
8) I will take your advice as far as adding new public functions that don't return void. I don't think it's possible to set an error code if NSPR initialization fails because we need access to thread private data.
9) I will take a look at setting errno for those and poll and select cases. I wasn't sure if there was a portable way to do that.
Updated•16 years ago
|
Assignee: julien.pierre.boogz → wtc
Comment 20•3 years ago
|
||
The bug assignee is inactive on Bugzilla, so the assignee is being reset.
Assignee: wtc → nobody
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•