Closed Bug 1062479 Opened 10 years ago Closed 9 years ago

WeakReference::AddRef sometimes calls NS_LogAddRef with garbage type names

Categories

(Core :: MFBT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox37 --- wontfix
firefox38 --- fixed
firefox38.0.5 --- fixed
firefox39 --- fixed
firefox40 --- fixed
firefox-esr31 --- wontfix

People

(Reporter: mccr8, Assigned: mccr8)

References

Details

Attachments

(1 file, 1 obsolete file)

I see this when I run some modified code locally:

 0:43.80 == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 21769
 0:43.80 
 0:43.80      |<----------------Class--------------->|<-----Bytes------>|<----------------Objects---------------->|<--------------References-------------->|
 0:43.80                                               Per-Inst   Leaked    Total      Rem      Mean       StdDev     Total      Rem      Mean       StdDev
 0:43.80    0 TOTAL                                          22        0  3776197 18446744073709551491 (571536213717407.38 +/-     0.00)  2381093 18446744073709551471 (983861263474198.12 +/-     0.00)
 0:43.80  459 WeakReference<MessageListener>                 16        0       72 18446744073709551491 (16046610086423922688.00 +/-     0.00)       72 18446744073709551471 (16212709324298360832.00 +/-     0.00)
 0:43.81 

There are two problems here. First, we're clearly underflowing the count.  Secondly, the size should not be zero when we're leaking this many objects.  I looked at WeakReference, but nothing obvious lept out at me.  Something could be wrong with a particular use of it.
John says that's -124.
Weird.  It says they are 16 bytes each (which is surely wrong, but anyways...) but then it says we leaked 0 bytes.
I get this in the parent process with
  ./mach mochitest-plain --e10s testing/mochitest/tests/Harness_sanity/
(disclaimer: this is with a bunch of my patches locally, so something there could be disrupting it.)
See Also: → 1087625
I did some refcount logging, and it seems like the WeakReference<> that is leaking (I think it is leaking) is being allocated by ShadowLayerParent.

<WeakReference<MessageListener>> 0x601B39F0 313 AddRef 1
    #01: PLayerParent (/tmp/./PLayerParent.cpp:27)
    #02: ShadowLayerParent (/home/amccreight/mc/gfx/layers/ipc/ShadowLayerParent.cpp:23)
    #03: mozilla::layers::PLayerTransactionParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerTransactionParent.cpp:353)
    #04: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    #05: mozilla::ipc::MessageChannel::OnMaybeDequeueOne() (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1038)
    #06: MessageLoop::RunTask(Task*) (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:362)
    #07: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:369)
    #08: MessageLoop::DoWork() (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:447)
    #09: base::MessagePumpDefault::Run(base::MessagePump::Delegate*) (/home/amccreight/mc/ipc/chromium/src/base/message_pump_default.cc:34)
    #10: ~AutoRunState (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:508)
    #11: base::Thread::ThreadMain() (/home/amccreight/mc/ipc/chromium/src/base/thread.cc:173)
    #12: ThreadFunc(void*) (/home/amccreight/mc/ipc/chromium/src/base/platform_thread_posix.cc:39)
    #13: start_thread (pthread_create.c:?)
    #14: __clone (/usr/lib/libc.so.6)
    #15: ??? (???:???)

<WeakReference<MessageListener>> 0x601B39F0 313 Release 0
    #01: operator delete(void*) (/tmp/../../dist/include/mozilla/mozalloc.h:232)
    #02: mozilla::layers::LayerTransactionParent::DeallocPLayerParent(mozilla::layers::PLayerParent*) (/home/amccreight/mc/gfx/layers/ipc/LayerTransactionParent.cpp:836 (discriminator 1))
    #03: mozilla::layers::PLayerTransactionParent::RemoveManagee(int, mozilla::ipc::IProtocol*) (/tmp/./PLayerTransactionParent.cpp:215)
    #04: mozilla::layers::PLayerParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerParent.cpp:160)
    #05: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    ... same as above for the rest of it

<WeakReference<MessageListener>> 0x601B39F0 416 AddRef 1
    #01: PLayerParent (/tmp/./PLayerParent.cpp:27)
    #02: ShadowLayerParent (/home/amccreight/mc/gfx/layers/ipc/ShadowLayerParent.cpp:23)
    #03: mozilla::layers::PLayerTransactionParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerTransactionParent.cpp:353)
    #04: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    ... same as above for the rest of it

Do you have any ideas here, Benoit?  It looks like you made the most recent substantial changes to this file.  I can't really figure out what the relationship between all of these weak classes is.
Flags: needinfo?(bjacob)
Hey, just because I have been unfortunate enough to have to touch the code under gfx/layers/ipc (the last time was 3.5 months ago) doesn't imply that I own maintaining it!

There isn't any relevant mention of MessageListener or WeakPtr or SupportsWeakPtr or WeakReference anywhere around gfx/layers. But the IPDL code generator generates MessageListeners, and
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageLink.h?from=MessageListener&case=true#58
MessageListener does inherit from SupportsWeakPtr, which internally uses WeakReference.

So I don't have a lot of clues to contribute, sorry... I'd suggest bisecting it (is it a regression?)
Flags: needinfo?(bjacob)
I was asking because you'd patched some Mozilla weak pointer stuff, not because of the layers stuff.  Sorry, I should have been more explicit.

No, I don't think it is a recent regression.
(In reply to Andrew McCreight [:mccr8] from comment #6)
> I was asking because you'd patched some Mozilla weak pointer stuff, not
> because of the layers stuff.  Sorry, I should have been more explicit.

Ah, OK!

Buuut, as part of my patching WeakPtr stuff (bug 1044658) I expanded TestWeakPtr a lot! If there is a bug in WeakPtr that isn't caught by it, I'd be very interested in knowing what it is, but the above information doesn't allow me to guess what such a bug might be.

> 
> No, I don't think it is a recent regression.

OK.
(In reply to Benoit Jacob [:bjacob] from comment #7)
> Buuut, as part of my patching WeakPtr stuff (bug 1044658) I expanded
> TestWeakPtr a lot! If there is a bug in WeakPtr that isn't caught by it, I'd
> be very interested in knowing what it is, but the above information doesn't
> allow me to guess what such a bug might be.

It doesn't look like there is any XPCOM leak checking in that test, so it wouldn't catch this problem.  But I can try copying that block into a regular browser file and see what happens with leak detection.
Blocks: 1128049
Bug 1128049 remains an ongoing source of orange. Anything we can do to move this along?
Not really.  I'm hoping this will shake out once somebody has a chance to look into the remaining e10s IPC leaks.
(In reply to Andrew McCreight [:mccr8] from comment #8)
> It doesn't look like there is any XPCOM leak checking in that test, so it
> wouldn't catch this problem.  But I can try copying that block into a
> regular browser file and see what happens with leak detection.

I went ahead and did that, and no leaks were detected.
The test case in comment 3 does not trigger this issue against a clean tree any more.
Assignee: nobody → continuation
As seen in bug 1156977, I noticed that these weird overflows seemed to often be accompanied by a leaked object with a blank name.  I landed an assertion, which was caught on TreeHerder.  This revealed that this was being created during the AddRef of an instance of RefCounted<WeakReference<MessageListener>>.

The bloat hash table is keyed on the name of the class, so I think what happens is that we pass in a blank class name to nsTraceRefCnt, which gets treated as a leaking object, but we end up with one more dtor than ctor of the real class of this object, which is WeakReference<MessageListener>. Due to bug 1116550, we get an underflow, and it gets reported as a leak.

But how do we get a mangled class name? Based on reading code, I think the problem is that WeakReference::typeName returns a pointer to the current stack frame.  There's even a comment about how this is "usually not OK": "This is usually not OK, but here we are returning a pointer to a static buffer which will immediately be used by the caller."  However, the static buffer is not immediately used by the caller, RefCounted::AddRef:
    const char* type = static_cast<const T*>(this)->typeName(); // return stack pointer
    uint32_t size = static_cast<const T*>(this)->typeSize();
    const void* ptr = static_cast<const T*>(this);
    MozRefCountType cnt = ++mRefCnt;
    detail::RefCountLogger::logAddRef(ptr, cnt, type, size);

Most of that doesn't do much, but logAddRef does call into NS_LogAddRef, which is a real function, so that probably overwrites parts of the stack, at least some times. When it ends up nulling it out, we end up with an underflow.

This is also consistent with the garbage class names I see sometimes in the bloat view, that look like ""W�������ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"".  Note that W is the first character of WeakReference.

My solution to this is to make subclasses of WeakReference declare a new static method using a new macro:
  static const char* weakReferenceTypeName() { return "WeakReference<" #T ">"; }

WeakReference::typeName can then be implemented as
  return T::weakReferenceTypeName();

Note that this won't properly support subclassing, but the current code does not really seem to support it either. In fact, the way WeakReferences are set up right now I'm not sure they can, because the initial AddRef of a WeakReference (that occurs in WeakPtr's ctor) runs on a WeakReference that has an mPtr of nullptr, so we can't call any methods on it.
Summary: WeakReference<MessageListener> count underflows, has zero size → WeakReference::AddRef sometimes calls NS_LogAddRef with garbage type names
Blocks: 1157304
froydnj pointed out in IRC that the nameBuffer is static, so my theory is wrong.  I don't know how I missed that!  Still, something is corrupting that buffer...
froydnj also pointed out there could be a race condition if this is used across multiple threads, so maybe that's what is happening?

Any thoughts on any of this, Ehsan?
Flags: needinfo?(ehsan)
Sounds like a job for a spin lock!
(Or, can we use std::call_once everywhere?)
The code in question generates a string "WeakReference<Foo>" for each class Foo, so my patch just creates those strings at compile time, so we don't need a buffer for sprintf.
Oh, I see. That works too! (Though, make sure you only generate that string #ifdef MOZ_REFCOUNTED_LEAK_CHECKING!)
(In reply to Andrew McCreight [:mccr8] from comment #16)
> froydnj also pointed out there could be a race condition if this is used
> across multiple threads, so maybe that's what is happening?
> 
> Any thoughts on any of this, Ehsan?

I'm lost as to what is happening here.  Is the issue reported in the bug with or without your patch?
Flags: needinfo?(ehsan) → needinfo?(continuation)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #21)
> I'm lost as to what is happening here.  Is the issue reported in the bug
> with or without your patch?

This is currently happening on tree herder, as tracked in bug 1128049.  The patch here is an attempt to fix that.
Flags: needinfo?(continuation) → needinfo?(ehsan)
This version explicitly does not generate the static strings in non-logging builds.

try run from the previous version looks green:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cfae13245c5

I'll try an all-platform build in case there's some weird platform code that I missed.

Note that this removes a type name for nsOfflineCacheUpdate, but I'm pretty sure that it was never being used anyways.
Attachment #8596248 - Attachment is obsolete: true
Your patch looks good as a solution for this, but please test this on Windows, since IIRC when I was working on this stuff, sometimes things would only break on Windows.
Flags: needinfo?(ehsan)
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

I did a full try run and it looks fine:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d97727e87c2

(The orange is just from another patch in the stack where I adjusted the leak threshold a little too low.)
Attachment #8596749 - Flags: review?(ehsan)
Feel free to pass the review off to an MFBT peer or something...
Blocks: 1158158
Attachment #8596749 - Flags: review?(ehsan) → review+
https://hg.mozilla.org/mozilla-central/rev/39da629933e1
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

Approval Request Comment
[Feature/regressing bug #]: bug 935778
[User impact if declined]: intermittent false failures on TreeHerder
[Describe test coverage new/current, TreeHerder]: this is a failure in tree herder
[Risks and why]: very low, this only affects code we run in debug builds
[String/UUID change made/needed]: none
Attachment #8596749 - Flags: approval-mozilla-beta?
Attachment #8596749 - Flags: approval-mozilla-aurora?
(In reply to Andrew McCreight [:mccr8] from comment #29)
> [User impact if declined]: intermittent false failures on TreeHerder
Specifically, bug 1128049.


Feel free to request approval in whatever other additional branches, Ryan.
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

[Triage Comment]
Taking to simplify the life of sheriffs.
Attachment #8596749 - Flags: approval-mozilla-release+
Attachment #8596749 - Flags: approval-mozilla-beta?
Attachment #8596749 - Flags: approval-mozilla-aurora?
Attachment #8596749 - Flags: approval-mozilla-aurora+
Do we need approval for debug-only patches?
For now, yes. AFAIK, only tests only changesets don't need our approval.
If you think it is crazy, don't hesitate to start a discussion.
Technically, this could affect the binary we ship users, but the stuff it changes shouldn't actually run.
(In reply to Sylvestre Ledru [:sylvestre] from comment #33)
> For now, yes. AFAIK, only tests only changesets don't need our approval.
> If you think it is crazy, don't hesitate to start a discussion.

I think it's relatively crazy, but not something we need too often, so let's carry on.  :-)
(In reply to Andrew McCreight [:mccr8] from comment #34)
> Technically, this could affect the binary we ship users, but the stuff it
> changes shouldn't actually run.

Err, how?  I r+ed under the assumption that when MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to whitespace.  Did I miss something in the patch?
Flags: needinfo?(continuation)
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #36)
> Err, how?  I r+ed under the assumption that when
> MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to
> whitespace.  Did I miss something in the patch?

The new code turns into whitespace in non-leak-checking builds, but the macro the old code used, MOZ_DECLARE_REFCOUNTED_TYPENAME, is always defined, with some explanatory note about how the compiler probably optimizes it out: http://mxr.mozilla.org/mozilla-central/source/mfbt/RefPtr.h#174 Thus, the binary may be changed by my patch, because we no longer define those methods.
Flags: needinfo?(continuation)
(In reply to Andrew McCreight [:mccr8] from comment #37)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #36)
> > Err, how?  I r+ed under the assumption that when
> > MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to
> > whitespace.  Did I miss something in the patch?
> 
> The new code turns into whitespace in non-leak-checking builds, but the
> macro the old code used, MOZ_DECLARE_REFCOUNTED_TYPENAME, is always defined,
> with some explanatory note about how the compiler probably optimizes it out:
> http://mxr.mozilla.org/mozilla-central/source/mfbt/RefPtr.h#174 Thus, the
> binary may be changed by my patch, because we no longer define those methods.

I see, yes, you're right!
esr31 is technically affected, but it isn't worth landing there as it hasn't actually shown up as an issue on Treeherder.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: