Closed Bug 212082 Opened 22 years ago Closed 21 years ago

nsCOMPtr<T> breaks C++ aliasing rules when nsCOMPtr_base is used

Categories

(Core :: XPCOM, defect)

PowerPC
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: Franz.Sirl-kernel, Assigned: scc)

References

()

Details

Attachments

(2 files)

User-Agent: Mozilla/5.0 (compatible; Konqueror/3.1; Linux; X11; ppc) Build Identifier: As detailed in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 mozilla-1.4 is miscompiled with gcc-3.3+ (at least), leading to a segfault at startup. The reason are aliasing problems in the nsCOMPtr<> code. Reproducible: Always Steps to Reproduce: 1. 2. 3.
Severity: blocker → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Please add a short summary from the very long GCC bugzilla bug..
->XPCOM
Assignee: jdunn → dougt
Component: ImageLib → XPCOM
QA Contact: tpreston → scc
scc owns nsCOMPtr
Assignee: dougt → scc
It doesn't look too hard to fix this in the case where NSCAP_FEATURE_DEBUG_PTR_TYPES is not defined (non-DEBUG builds) and NSCAP_FEATURE_INLINE_STARTASSIGNMENT is not defined (non-VC++ builds), which is th case that really matters. Fixing it in general is harder without un-optimizing all the carefully hand-optimized code here.
Actually, never mind. I don't think I understand the problem, because I don't see where we're *using* differently-typed pointers to the same object in that case (although I can see how it can happen in other cases).
Er, no, I do understand it. I was reading |ifndef| as |ifdef| in too many places. The case I was thinking about above was really the case where NSCAP_FEATURE_DEBUG_PTR_TYPES is defined. The problem is that we're using nsCOMPtr_base here. We could defined NSCAP_FEATURE_DEBUG_PTR_TYPES in more builds to solve this problem, although it might need a different name. NSCAP_FEATURE_INLINE_STARTASSIGNMENT actually isn't relevant thanks to the fact that "The C/C++ standards say that casting "X**" to "Y**" is OK -- the problem is *using* that pointer." <URL: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376#c26 >
Summary: nsCOMPtr<> not aliasing safe → nsCOMPtr<T> breaks C++ aliasing rules when nsCOMPtr_base is used
In short: the C and C++ aliasing rules specify that (except for rather special circumstances which are not relevant here), objects can only be accessed through their real type. If this is not the case, the compiler may assume that subsequent read/write statements are independent if they access objects of different types and may then reorder them as it deems fit because they are independent. Example: if A and B are types that are not related to each other, then A *a; B *b; *a = some_a_value; *b = some_b_value; may be reordered so that *b is assigned first, if the compiler knows that the assignment has no side effects. The problem is that if you have been cheating, and b is actually pointing to an object of type A, and if that object happens to be the object pointed to by a as well, then the compiler reordering these statements will generate code that does not do what you expect it to do. (Note: you've been cheating to the compiler, it's not the compiler's fault which is keeping strictly to the language rules.) Sometimes the effects are subtler than above, with the compiler doing other optimizations based on information it thinks it has (this is likely the case for this report). In C++, cheating about the actual type also frequently has the unpleasant side effect that you make yourself dependent on the actual layout of an object, including things like where the vtable is located. In this particular case for mozilla, the code looks roughly like so: ----------------------------------- struct A { virtual int Setit(int k) = 0; }; struct B : A { int Setit(int k) { int i; i = k; } }; A* a; void ** begin_assign() { return reinterpret_cast< void **> (&a); } int main(int argc, char** argv) { B** ppB = reinterpret_cast<B**> (begin_assign()); *ppB = new B(); a->Setit(0); return 0; } ---------------------------------- At the site of the new B(), the compiler thinks it writes the result into a B*. In the next line, an A* is accessed. These types are distinct, so the compiler may choose to shift these statements around independently of each other when optimizing. I don't know whether it does that here in this case, but the point simply is: the language says it is allowed to, the code is invalid and may produce results that you did not expect, including a crash (which is what happens here). The gcc bug report has more information about how the original mozilla code looked like. The whole upshot is: don't try to cheat to the compiler about the true types of variables. Casting from A* to void* to B* is bound to get you into trouble. If there is absolutely no way to avoid void* and reinterpret_casts, then you must make sure that you cast back to the actual type (from which you casted to void*) when you access a variable. Last note -- why does this only happen on PPC here: the ways in which a compiler can optimize depend on machine features. On x86, for example, instruction reordering is often impossible because of the lack of registers. This may or may not be the case on PPC. So it's not surprising that you may get a crash on one platform and not the other. Regards Wolfgang
Attached patch possible patchSplinter Review
This isn't necessarily the way we'd want to fix this, but it should at least test my understanding of the problem here. Reporter: Any chance you could try this patch and see if it fixes the problem?
That fix would work, because the debugging stuff it adds is probably not helpful under gdb anyway :)
*** Bug 213309 has been marked as a duplicate of this bug. ***
So has anyone had a chance to try the patch in comment 8?
Oh, actually, bug 213309 comment 6 says it works. I just wasn't cc:ed on that bug (and thus asked for comments here).
Comment on attachment 127272 [details] [diff] [review] possible patch So, scc, any thoughts on this approach?
Attachment #127272 - Flags: review?(scc)
Comment on attachment 127272 [details] [diff] [review] possible patch r=scc. what a shame though. I'm still thinking about how to really solve this. In the meanwhile, this will have to do.
Attachment #127272 - Flags: review?(scc) → review+
Attachment #127272 - Flags: superreview?(bryner) → superreview+
seems like this broke Ports, mingw sludge is red.
This submit has increased the binary size by about 1.4 megs. See the graph on the 'brad' tinderbox.
Re comment 15: I think that's unrelated, and the tinderbox configuration was changed between builds, although I'm not sure. I suspect it's the error already mentioned in bug 217009 (see bug 217009 comment 7, etc.).
The only configuration change that was made to sludge was to enable BloatTests. The camino tinderboxes are broken as well, though based upon what happened with monkey, that may be a dependency issue.
What I propose in bug 220291 might fix the mingw bustage, but I'm not sure if that will happen soon enough.
Well, inlining actually makes the codesize problem worse, on gcc 3.2 anyway (luna tinderbox): Z:19.97MB Zdiff:+417812 (+775259/-357447) mZ:12.65MB mZdiff:+291428 (+464791/-173363)
This fixed my optimized build with Mozilla-1.5 and GCC-3.3.2. My browser started crashing when I compiled with "-O3 -march=athlon-xp -pipe" instead of "-O2 -march=i686 -pipe". Then I applied the patch mentioned here and now it works. Just so you know. /Andreas
Ok, so I'd like to come up with another way to address this problem so we can get back that 1.4 MB of code size. There are 3 options I've been looking at: 1. Make this all aliasing-safe. This isn't straightforward, since in order to get the benefit of nsCOMPtr_base, mRawPtr needs to be part of the (non-templated) base class. One idea I've looked at is doing something like this: union { nsISupports *i; void *t; } mRawPtr; and treating mRawPtr.t as a T*, from within nsCOMPtr<T>. nsCOMPtr_base would use mRawPtr.i to manipulate the pointer as a nsISupports*. I'm not completely sure if this addresses the problem -- what happens if I return reinterpet_cast<T**>(&mRawPtr.t) from a function and it's then assigned to via a T** pointer? Is that ok as long as we only ever use mRawPtr.t as a T**? (you can't dereference a void pointer, anyway) I think there may be other subtle things that aren't aliasing-safe that we're being saved from because they aren't inlined. For example, looking at the way our QueryInterface implementation works, a pointer of arbitrary type is cast to a void** and passed to QI, which then stores through that pointer without casting it back to the original type. Is this also a violation? I'm not saying we have to fix everything at once if we go down this path, but it could turn out to be a substantial amount of work (especially to not break frozen interfaces), and I'm hesitant to do that without having some idea of the payoff (see option #3 below). 2. Turn nsCOMPtr_base back on, after fixing this particular manifestation of the problem in nsGIFDecoder (I'd like to change all that code to use nsCOMArray anyway, which will avoid this problem). If we do this, we may want to make a general guideline that you should not pass getter_AddRefs() to an inline function. The downside is that we depend on gcc optimizer behavior that may change. 3. Turn nsCOMPtr_base back on and use -fno-strict-aliasing. Do we know for sure what sort of speedup we get from using -fstrict-aliasing? I'll run some tests when I have time.
> If we do this, we may want to make a general guideline that you should not > pass getter_AddRefs() to an inline function. The compiler can inline at will, and in CVS gcc its even able to inline a function whose definition is after the use of that funciton inthe file (with -funit-at-a-time, which may or may not be on by default) > 3. Turn nsCOMPtr_base back on and use -fno-strict-aliasing. Do we know for sure > what sort of speedup we get from using -fstrict-aliasing? I'll run some tests > when I have time. You need a recent (>3.3) version to test on, since earlier versions didn't do aliasing in c++. Even curent versions don't quite do everything, I believe.
The imgContainerGIF cleanup I mentioned is bug 224621 (no direct dependency).
This patch re-enables nsCOMPtr_base for gcc 3, and adds a may_alias attribute on mRawPtr. I tested this with a simplified testcase on Mac OS 10.3, gcc 3.3 (where I was able to reproduce a crash resulting from strict aliasing) and found that this has the desired effect. The attribute is recognized by gcc 3.3 and later (which happen to be the same versions that caused crashes), and causes the compiler to assume that mRawPtr may alias any type. I tried a few different ways of actually fixing nsCOMPtr to not break the aliasing rules, but couldn't find a way to do so without increasing code size on platforms other than gcc 3.x. If anyone can offer up a solution, please do so, but keep in mind: - In order to be a T* (or a union containing T*), mRawPtr must be on nsCOMPtr<T>, not nsCOMPtr_base. - The best I came up with was to make nsCOMPtr_base no longer be a base class, but instead be a "helper" class with static functions that would operate on an nsISupports** (or nsISupports*&). That worked but still increased code size over the current approach.
Attachment #135380 - Flags: superreview?(dbaron)
Attachment #135380 - Flags: review?(darin)
Comment on attachment 135380 [details] [diff] [review] use may_alias attribute for mRawPtr sr=dbaron. (Is the attribute really unavailable in versions lower than 3.3?)
Attachment #135380 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 135380 [details] [diff] [review] use may_alias attribute for mRawPtr Yeah, the attribute was just added in 3.3. Earlier versions will ignore it but give a warning (which I didn't really want to introduce for every source file).
Comment on attachment 135380 [details] [diff] [review] use may_alias attribute for mRawPtr r=darin
Attachment #135380 - Flags: review?(darin) → review+
Marking as fixed. (though if someone actually wants to hack on making nsCOMPtr strict-aliasing-safe they could reopen this)
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
nsCOMPtr is tricky API glue; the may_alias solution seems in line with nsCOMPtr's intent and use. Sure, a `pure' non-aliasing solution would be great (though not as great as not needing nsCOMPtr at all); but I really like the may_alias answer, bryner.
You may be interested in bug 472570 which mentions a misunderstanding of "may_alias attribute" as used by nsCOMPtr.h
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: