Closed Bug 212082 Opened 21 years ago Closed 21 years ago

nsCOMPtr<T> breaks C++ aliasing rules when nsCOMPtr_base is used

Categories

(Core :: XPCOM, defect)

PowerPC
Linux
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: Franz.Sirl-kernel, Assigned: scc)

References

()

Details

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (compatible; Konqueror/3.1; Linux; X11; ppc)
Build Identifier: 

As detailed in http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376 mozilla-1.4 
is miscompiled with gcc-3.3+ (at least), leading to a segfault at startup. The 
reason are aliasing problems in the nsCOMPtr<> code. 
 

Reproducible: Always

Steps to Reproduce:
1.
2.
3.
Severity: blocker → critical
Status: UNCONFIRMED → NEW
Ever confirmed: true
Please add a short summary from the very long GCC bugzilla bug..
->XPCOM
Assignee: jdunn → dougt
Component: ImageLib → XPCOM
QA Contact: tpreston → scc
scc owns nsCOMPtr
Assignee: dougt → scc
It doesn't look too hard to fix this in the case where
NSCAP_FEATURE_DEBUG_PTR_TYPES is not defined (non-DEBUG builds) and
NSCAP_FEATURE_INLINE_STARTASSIGNMENT is not defined (non-VC++ builds), which is
th case that really matters.

Fixing it in general is harder without un-optimizing all the carefully
hand-optimized code here.
Actually, never mind.  I don't think I understand the problem, because I don't
see where we're *using* differently-typed pointers to the same object in that
case (although I can see how it can happen in other cases).
Er, no, I do understand it.  I was reading |ifndef| as |ifdef| in too many
places.  The case I was thinking about above was really the case where
NSCAP_FEATURE_DEBUG_PTR_TYPES is defined.  The problem is that we're using
nsCOMPtr_base here.

We could defined NSCAP_FEATURE_DEBUG_PTR_TYPES in more builds to solve this
problem, although it might need a different name.

NSCAP_FEATURE_INLINE_STARTASSIGNMENT actually isn't relevant thanks to the fact
that "The C/C++ standards say that casting "X**" to "Y**" is OK -- the problem
is *using* that pointer." <URL:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=11376#c26 >
Summary: nsCOMPtr<> not aliasing safe → nsCOMPtr<T> breaks C++ aliasing rules when nsCOMPtr_base is used
In short: the C and C++ aliasing rules specify that (except for rather special
circumstances which are not relevant here), objects can only be
accessed through their real type. If this is not the case, the compiler
may assume that subsequent read/write statements are independent if they
access objects of different types and may then reorder them as it deems fit
because they are independent.

Example: if A and B are types that are not related to each other, then
  A *a;
  B *b;
  *a = some_a_value;
  *b = some_b_value;
may be reordered so that *b is assigned first, if the compiler knows that
the assignment has no side effects. The problem is that if you have been
cheating, and b is actually pointing to an object of type A, and if that
object happens to be the object pointed to by a as well, then the compiler
reordering these statements will generate code that does not do what you expect
it to do. (Note: you've been cheating to the compiler, it's not the compiler's
fault which is keeping strictly to the language rules.) Sometimes the effects
are subtler than above, with the compiler doing other optimizations based on
information it thinks it has (this is likely the case for this report).

In C++, cheating about the actual type also frequently has the unpleasant
side effect that you make yourself dependent on the actual layout of an
object, including things like where the vtable is located.

In this particular case for mozilla, the code looks roughly like so:
-----------------------------------
struct A {
  virtual int Setit(int k) = 0;
};

struct B : A {
    int Setit(int k) { int i; i = k; }
};

A* a;
void ** begin_assign() { return reinterpret_cast< void **> (&a); }

int
main(int argc, char** argv)
{
    B** ppB = reinterpret_cast<B**> (begin_assign());
    *ppB = new B();
    a->Setit(0);
    return 0;
}
----------------------------------
At the site of the new B(), the compiler thinks it writes the result into
a B*. In the next line, an A* is accessed. These types are distinct, so
the compiler may choose to shift these statements around independently of
each other when optimizing. I don't know whether it does that here in
this case, but the point simply is: the language says it is allowed to,
the code is invalid and may produce results that you did not expect, 
including a crash (which is what happens here).

The gcc bug report has more information about how the original mozilla
code looked like. The whole upshot is: don't try to cheat to the
compiler about the true types of variables. Casting from A* to void* to
B* is bound to get you into trouble. If there is absolutely no way to
avoid void* and reinterpret_casts, then you must make sure that you cast
back to the actual type (from which you casted to void*) when you access a
variable.

Last note -- why does this only happen on PPC here: the ways in which a 
compiler can optimize depend on machine features. On x86, for example,
instruction reordering is often impossible because of the lack of registers.
This may or may not be the case on PPC. So it's not surprising that you
may get a crash on one platform and not the other.

Regards
  Wolfgang
Attached patch possible patchSplinter Review
This isn't necessarily the way we'd want to fix this, but it should at least
test my understanding of the problem here.

Reporter:  Any chance you could try this patch and see if it fixes the problem?
That fix would work, because the debugging stuff it adds is probably not helpful
under gdb anyway :)
*** Bug 213309 has been marked as a duplicate of this bug. ***
So has anyone had a chance to try the patch in comment 8?
Oh, actually, bug 213309 comment 6 says it works.  I just wasn't cc:ed on that
bug (and thus asked for comments here).
Comment on attachment 127272 [details] [diff] [review]
possible patch

So, scc, any thoughts on this approach?
Attachment #127272 - Flags: review?(scc)
Comment on attachment 127272 [details] [diff] [review]
possible patch

r=scc.	what a shame though.  I'm still thinking about how to really solve
this.  In the meanwhile, this will have to do.
Attachment #127272 - Flags: review?(scc) → review+
Attachment #127272 - Flags: superreview?(bryner) → superreview+
seems like this broke Ports, mingw sludge is red.
This submit has increased the binary size by about 1.4 megs.  See the graph on
the 'brad' tinderbox.
Re comment 15: I think that's unrelated, and the tinderbox configuration was
changed between builds, although I'm not sure.  I suspect it's the error already
mentioned in bug 217009 (see bug 217009 comment 7, etc.).
The only configuration change that was made to sludge was to enable BloatTests.
 The camino tinderboxes are broken as well, though based upon what happened with
monkey, that may be a dependency issue.

What I propose in bug 220291 might fix the mingw bustage, but I'm not sure if
that will happen soon enough.
Well, inlining actually makes the codesize problem worse, on gcc 3.2 anyway
(luna tinderbox):

Z:19.97MB
Zdiff:+417812 (+775259/-357447)
mZ:12.65MB
mZdiff:+291428 (+464791/-173363)
This fixed my optimized build with Mozilla-1.5 and GCC-3.3.2. My browser started
crashing when I compiled with "-O3 -march=athlon-xp -pipe" instead of "-O2
-march=i686 -pipe". Then I applied the patch mentioned here and now it works.
Just so you know.

/Andreas
Ok, so I'd like to come up with another way to address this problem so we can
get back that 1.4 MB of code size.  There are 3 options I've been looking at:

1. Make this all aliasing-safe.  This isn't straightforward, since in order to
get the benefit of nsCOMPtr_base, mRawPtr needs to be part of the
(non-templated) base class.  One idea I've looked at is doing something like this:

union {
  nsISupports *i;
  void        *t;
} mRawPtr;

and treating mRawPtr.t as a T*, from within nsCOMPtr<T>.  nsCOMPtr_base would
use mRawPtr.i to manipulate the pointer as a nsISupports*.  I'm not completely
sure if this addresses the problem -- what happens if I return 
reinterpet_cast<T**>(&mRawPtr.t) from a function and it's then assigned to via a
T** pointer?  Is that ok as long as we only ever use mRawPtr.t as a T**? (you
can't dereference a void pointer, anyway)

I think there may be other subtle things that aren't aliasing-safe that we're
being saved from because they aren't inlined.  For example, looking at the way
our QueryInterface implementation works, a pointer of arbitrary type is cast to
a void** and passed to QI, which then stores through that pointer without
casting it back to the original type.  Is this also a violation?

I'm not saying we have to fix everything at once if we go down this path, but it
could turn out to be a substantial amount of work (especially to not break
frozen interfaces), and I'm hesitant to do that without having some idea of the
payoff (see option #3 below).

2. Turn nsCOMPtr_base back on, after fixing this particular manifestation of the
problem in nsGIFDecoder (I'd like to change all that code to use nsCOMArray
anyway, which will avoid this problem).  If we do this, we may want to make a
general guideline that you should not pass getter_AddRefs() to an inline
function.  The downside is that we depend on gcc optimizer behavior that may change.

3. Turn nsCOMPtr_base back on and use -fno-strict-aliasing.  Do we know for sure
what sort of speedup we get from using -fstrict-aliasing?  I'll run some tests
when I have time.
> If we do this, we may want to make a general guideline that you should not
> pass getter_AddRefs() to an inline function. 

The compiler can inline at will, and in CVS gcc its even able to inline a
function whose definition is after the use of that funciton inthe file (with
-funit-at-a-time, which may or may not be on by default)

> 3. Turn nsCOMPtr_base back on and use -fno-strict-aliasing.  Do we know for sure
> what sort of speedup we get from using -fstrict-aliasing?  I'll run some tests
> when I have time.

You need a recent (>3.3) version to test on, since earlier versions didn't do
aliasing in c++. Even curent versions don't quite do everything, I believe.

The imgContainerGIF cleanup I mentioned is bug 224621 (no direct dependency).
This patch re-enables nsCOMPtr_base for gcc 3, and adds a may_alias attribute
on mRawPtr.  I tested this with a simplified testcase on Mac OS 10.3, gcc 3.3
(where I was able to reproduce a crash resulting from strict aliasing) and
found that this has the desired effect.  The attribute is recognized by gcc 3.3
and later (which happen to be the same versions that caused crashes), and
causes the compiler to assume that mRawPtr may alias any type.

I tried a few different ways of actually fixing nsCOMPtr to not break the
aliasing rules, but couldn't find a way to do so without increasing code size
on platforms other than gcc 3.x.  If anyone can offer up a solution, please do
so, but keep in mind:

- In order to be a T* (or a union containing T*), mRawPtr must be on
nsCOMPtr<T>, not nsCOMPtr_base.
- The best I came up with was to make nsCOMPtr_base no longer be a base class,
but instead be a "helper" class with static functions that would operate on an
nsISupports** (or nsISupports*&).  That worked but still increased code size
over the current approach.
Attachment #135380 - Flags: superreview?(dbaron)
Attachment #135380 - Flags: review?(darin)
Comment on attachment 135380 [details] [diff] [review]
use may_alias attribute for mRawPtr

sr=dbaron.  (Is the attribute really unavailable in versions lower than 3.3?)
Attachment #135380 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 135380 [details] [diff] [review]
use may_alias attribute for mRawPtr

Yeah, the attribute was just added in 3.3.  Earlier versions will ignore it but
give a warning (which I didn't really want to introduce for every source file).
Comment on attachment 135380 [details] [diff] [review]
use may_alias attribute for mRawPtr

r=darin
Attachment #135380 - Flags: review?(darin) → review+
Marking as fixed.  (though if someone actually wants to hack on making nsCOMPtr
strict-aliasing-safe they could reopen this)
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
nsCOMPtr is tricky API glue; the may_alias solution seems in line with
nsCOMPtr's intent and use.  Sure, a `pure' non-aliasing solution would be great
(though not as great as not needing nsCOMPtr at all); but I really like the
may_alias answer, bryner.
You may be interested in bug 472570 which mentions a misunderstanding of "may_alias attribute" as used by nsCOMPtr.h
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: