Open Bug 1536656 Opened 5 years ago Updated 2 years ago

mfbt/RefCounted.h's MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME cannot be used across modules with different MOZILLA_INTERNAL_API setting

Categories

(Core :: MFBT, enhancement)

enhancement

Tracking

()

Tracking Status
firefox68 --- affected

People

(Reporter: mozbugz, Unassigned)

References

(Blocks 1 open bug)

Details

Say we have a ref-counted virtual class:

struct Base {
  MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME(Base)
  virtual ~Base();
};

MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME defines two virtual functions typeName and typeSize, but only if MOZILLA_INTERNAL_API is #defined (and a couple other conditions).

So in a non-MOZILLA_INTERNAL_API module, this class will effectively only have one virtual method (the destructor), but in a MOZILLA_INTERNAL_API module it will have 3 virtual methods.

I experienced the problem when trying to delete an object from mozglue, when that object had been created in libxul. The mozglue code was calling the first vtable function, expecting it to be the destructor, but it was in fact typeName!
And I imagine it could have gone wrong the other way: libxul would try to delete such object by calling the 3rd vtable function, but mozglue would only have provided a 1-function vtable.

I'm not too sure what the best solution should be, especially since I don't quite understand MOZILLA_INTERNAL_API. Suggestions?

(I tried forcing the generation of these virtual functions in my mozglue code; but still had other strange issues, so suspecting it could be because other code in RefCounted.h is #ifdef's, I rolled out my own AddRef/Release functions; but I'm still having other issues I'm investigating now...)

--

That made me realize we may have other similar issues with headers shared between modules, if there are other structures that may effectively be defined differently depending on where they are compiled.
Do we have documentation about that?
Would it be possible to detect at build/link time that structures used in different modules have mismatched footprints?

(In reply to Gerald Squelart [:gerald] from comment #0)

I experienced the problem when trying to delete an object from mozglue, when
that object had been created in libxul.

I'm curious: is this even allowed, in our build?

(In reply to David Major [:dmajor] from comment #1)

(In reply to Gerald Squelart [:gerald] from comment #0)

I experienced the problem when trying to delete an object from mozglue, when
that object had been created in libxul.

I'm curious: is this even allowed, in our build?

Well, I was able to just do it (after fixing the differing-vtable issue), so there's no direct obstacle.

Now, if you meant that we should not do this, then I'd certainly like to know. 😉
It'd be great if there were protections to prevent such misuse.

And what about other uses of shared objects, apart from creation&destruction? The vtable issue would also impact other virtual functions in these objects.

For more context, my use case was:

In mozglue there is a base class for something I want to record.
Subclasses may be defined inside and outside of mozglue.
Concrete derived objects are created in (say) libxul and given to the mozglue code.
The mozglue code keeps owning references to these libxul-created objects through the mozglue base class.
The mozglue code may call virtual functions on these objects, so effectively it would potentially call libxul code.
The mozglue code will usually be the last one to reference these objects, and therefore the one to delete them at the end.

If that whole scenario is not possible to actually implement, I'd like to know.

In which case, are there other ways to achieve my goal of recording non-mozglue things in mozglue?
I'm guessing the last resort would be for non-mozglue code to fill mozglue-provided buffers with POD -- would that work?

The mozglue code will usually be the last one to reference these objects,
and therefore the one to delete them at the end.

I think this is where things get fishy. In general we should never directly call operator delete on an object from another library, because the allocators could be entirely different (regardless of whether or not they are in practice).

The usual pattern (think COM or nsISupports) is to get the object itself to delete this on its last release.

I would have thought that would have happened via https://searchfox.org/mozilla-central/rev/486b64e4bff86b7988af8c8b80845404ad197533/mfbt/RefCounted.h#201 but I don't fully understand the maze of Gecko refcount helpers, so I'm not sure why that's not happening.

Independently of the above, this piece...

Concrete derived objects are created in (say) libxul and given to the mozglue code.

...depends on the C++ rule that "Derived is-a Base", which only works when libxul's idea of Base and mozglue's idea of Base are the same thing. The use of MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME makes that not be true.

Ideally, shared interface definitions should be made as thin as possible, with no implementation inside. Since mozglue's definition of MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME is empty, why use that macro on the base at all? Is there a way to rearrange things such that MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME appears only on the libxul classes?

(In reply to David Major [:dmajor] from comment #4)

The mozglue code will usually be the last one to reference these objects,
and therefore the one to delete them at the end.

I think this is where things get fishy. In general we should never directly call operator delete on an object from another library, because the allocators could be entirely different (regardless of whether or not they are in practice).

So IIUC, you're saying that all objects should be allocated and freed by the same library.
Makes sense I guess, in case they somehow use different allocators and/or pool. (Or some other reason?)
I'll definitely have to revisit my code there...

The usual pattern (think COM or nsISupports) is to get the object itself to delete this on its last release.

I would have thought that would have happened via https://searchfox.org/mozilla-central/rev/486b64e4bff86b7988af8c8b80845404ad197533/mfbt/RefCounted.h#201 but I don't fully understand the maze of Gecko refcount helpers, so I'm not sure why that's not happening.

I don't think delete this guarantees much, as the deletion will happen wherever the last release happens, which could be outside the library that did the initial construction.

Independently of the above, this piece...

Concrete derived objects are created in (say) libxul and given to the mozglue code.

...depends on the C++ rule that "Derived is-a Base", which only works when libxul's idea of Base and mozglue's idea of Base are the same thing. The use of MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME makes that not be true.

So you agree with my original bug. 😉 (or do you think you're saying something different?)

Ideally, shared interface definitions should be made as thin as possible, with no implementation inside.

The main problem here is not the implementation, but the exact list of virtual methods in the vtable layout. Even if these methods were pure-virtual, they would still need to be either A. always absent, or B. always present, in all libraries that #include this header (which currently doesn't work because of external macros).

Since mozglue's definition of MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME is empty, why use that macro on the base at all? Is there a way to rearrange things such that MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME appears only on the libxul classes?

When #included from mozglue, MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME is indeed empty, and also non-virtualAddRef() and Release() don't use them, so in fact we would want MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME to stay empty everywhere.

Since the including code cannot know whether a header is used in other libraries, I see two possible solutions:

  1. MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME could just always declare its methods, at the expense of some wasted space (I think two pointers per non-libxul-defined vtable). Actually I don't think there are any uses outside libxul yet (this may explain why we haven't encountered the issue so far), so I'd be the first and only customer. I think the added safety is worth the extra pointers.
  2. We could separate RefCounted in two classes without macro-controlled code, so that each library would only use the most appropriate one, without risk of differing definitions between libraries. I think it should be fairly safe: The mfbt RefCounted could be used anywhere without risk (though also without leak checking), and the libxul RefCounted could only be used from libxul because it would unconditionally #include "nsXPCOM.h".

What do you think? Any preference? Other solutions?

(In reply to Gerald Squelart [:gerald] from comment #5)

I don't think delete this guarantees much, as the deletion will happen
wherever the last release happens, which could be outside the library that
did the initial construction.

A key piece of the scheme (which I failed to state) is that Release() is virtual, so the call goes to the right DLL's implementation.

...depends on the C++ rule that "Derived is-a Base", which only works when libxul's idea of Base and mozglue's idea of Base are the same thing. The use of MOZ_DECLARE_REFCOUNTED_VIRTUAL_TYPENAME makes that not be true.

So you agree with my original bug. 😉 (or do you think you're saying
something different?)

Maybe. :) What I'm saying is that libxul and mozglue must have the same definition (after preprocessing) for Base. Whether that is achieved by making the macro always consistent, or removing it, is something that needs to be figured out.

I haven't thought about the rest of your question, in case the virtual Release() changes the situation (or is somehow impossible to achieve.) Let me know if it doesn't.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.