WeakReference::AddRef sometimes calls NS_LogAddRef with garbage type names

RESOLVED FIXED in Firefox 38

Status

()

Core
MFBT
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: mccr8, Assigned: mccr8)

Tracking

Trunk
mozilla40
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox37 wontfix, firefox38 fixed, firefox38.0.5 fixed, firefox39 fixed, firefox40 fixed, firefox-esr31 wontfix)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

3 years ago
I see this when I run some modified code locally:

 0:43.80 == BloatView: ALL (cumulative) LEAK AND BLOAT STATISTICS, default process 21769
 0:43.80 
 0:43.80      |<----------------Class--------------->|<-----Bytes------>|<----------------Objects---------------->|<--------------References-------------->|
 0:43.80                                               Per-Inst   Leaked    Total      Rem      Mean       StdDev     Total      Rem      Mean       StdDev
 0:43.80    0 TOTAL                                          22        0  3776197 18446744073709551491 (571536213717407.38 +/-     0.00)  2381093 18446744073709551471 (983861263474198.12 +/-     0.00)
 0:43.80  459 WeakReference<MessageListener>                 16        0       72 18446744073709551491 (16046610086423922688.00 +/-     0.00)       72 18446744073709551471 (16212709324298360832.00 +/-     0.00)
 0:43.81 

There are two problems here. First, we're clearly underflowing the count.  Secondly, the size should not be zero when we're leaking this many objects.  I looked at WeakReference, but nothing obvious lept out at me.  Something could be wrong with a particular use of it.
(Assignee)

Comment 1

3 years ago
John says that's -124.
(Assignee)

Comment 2

3 years ago
Weird.  It says they are 16 bytes each (which is surely wrong, but anyways...) but then it says we leaked 0 bytes.
(Assignee)

Comment 3

3 years ago
I get this in the parent process with
  ./mach mochitest-plain --e10s testing/mochitest/tests/Harness_sanity/
(disclaimer: this is with a bunch of my patches locally, so something there could be disrupting it.)
See Also: → bug 1087625
(Assignee)

Comment 4

3 years ago
I did some refcount logging, and it seems like the WeakReference<> that is leaking (I think it is leaking) is being allocated by ShadowLayerParent.

<WeakReference<MessageListener>> 0x601B39F0 313 AddRef 1
    #01: PLayerParent (/tmp/./PLayerParent.cpp:27)
    #02: ShadowLayerParent (/home/amccreight/mc/gfx/layers/ipc/ShadowLayerParent.cpp:23)
    #03: mozilla::layers::PLayerTransactionParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerTransactionParent.cpp:353)
    #04: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    #05: mozilla::ipc::MessageChannel::OnMaybeDequeueOne() (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1038)
    #06: MessageLoop::RunTask(Task*) (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:362)
    #07: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:369)
    #08: MessageLoop::DoWork() (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:447)
    #09: base::MessagePumpDefault::Run(base::MessagePump::Delegate*) (/home/amccreight/mc/ipc/chromium/src/base/message_pump_default.cc:34)
    #10: ~AutoRunState (/home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:508)
    #11: base::Thread::ThreadMain() (/home/amccreight/mc/ipc/chromium/src/base/thread.cc:173)
    #12: ThreadFunc(void*) (/home/amccreight/mc/ipc/chromium/src/base/platform_thread_posix.cc:39)
    #13: start_thread (pthread_create.c:?)
    #14: __clone (/usr/lib/libc.so.6)
    #15: ??? (???:???)

<WeakReference<MessageListener>> 0x601B39F0 313 Release 0
    #01: operator delete(void*) (/tmp/../../dist/include/mozilla/mozalloc.h:232)
    #02: mozilla::layers::LayerTransactionParent::DeallocPLayerParent(mozilla::layers::PLayerParent*) (/home/amccreight/mc/gfx/layers/ipc/LayerTransactionParent.cpp:836 (discriminator 1))
    #03: mozilla::layers::PLayerTransactionParent::RemoveManagee(int, mozilla::ipc::IProtocol*) (/tmp/./PLayerTransactionParent.cpp:215)
    #04: mozilla::layers::PLayerParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerParent.cpp:160)
    #05: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    ... same as above for the rest of it

<WeakReference<MessageListener>> 0x601B39F0 416 AddRef 1
    #01: PLayerParent (/tmp/./PLayerParent.cpp:27)
    #02: ShadowLayerParent (/home/amccreight/mc/gfx/layers/ipc/ShadowLayerParent.cpp:23)
    #03: mozilla::layers::PLayerTransactionParent::OnMessageReceived(IPC::Message const&) (/tmp/./PLayerTransactionParent.cpp:353)
    #04: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) (/home/amccreight/mc/ipc/glue/MessageChannel.cpp:1110)
    ... same as above for the rest of it

Do you have any ideas here, Benoit?  It looks like you made the most recent substantial changes to this file.  I can't really figure out what the relationship between all of these weak classes is.
Flags: needinfo?(bjacob)
Hey, just because I have been unfortunate enough to have to touch the code under gfx/layers/ipc (the last time was 3.5 months ago) doesn't imply that I own maintaining it!

There isn't any relevant mention of MessageListener or WeakPtr or SupportsWeakPtr or WeakReference anywhere around gfx/layers. But the IPDL code generator generates MessageListeners, and
http://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageLink.h?from=MessageListener&case=true#58
MessageListener does inherit from SupportsWeakPtr, which internally uses WeakReference.

So I don't have a lot of clues to contribute, sorry... I'd suggest bisecting it (is it a regression?)
Flags: needinfo?(bjacob)
(Assignee)

Comment 6

3 years ago
I was asking because you'd patched some Mozilla weak pointer stuff, not because of the layers stuff.  Sorry, I should have been more explicit.

No, I don't think it is a recent regression.
(In reply to Andrew McCreight [:mccr8] from comment #6)
> I was asking because you'd patched some Mozilla weak pointer stuff, not
> because of the layers stuff.  Sorry, I should have been more explicit.

Ah, OK!

Buuut, as part of my patching WeakPtr stuff (bug 1044658) I expanded TestWeakPtr a lot! If there is a bug in WeakPtr that isn't caught by it, I'd be very interested in knowing what it is, but the above information doesn't allow me to guess what such a bug might be.

> 
> No, I don't think it is a recent regression.

OK.
(Assignee)

Comment 8

3 years ago
(In reply to Benoit Jacob [:bjacob] from comment #7)
> Buuut, as part of my patching WeakPtr stuff (bug 1044658) I expanded
> TestWeakPtr a lot! If there is a bug in WeakPtr that isn't caught by it, I'd
> be very interested in knowing what it is, but the above information doesn't
> allow me to guess what such a bug might be.

It doesn't look like there is any XPCOM leak checking in that test, so it wouldn't catch this problem.  But I can try copying that block into a regular browser file and see what happens with leak detection.
(Assignee)

Updated

3 years ago
Blocks: 1128049
Bug 1128049 remains an ongoing source of orange. Anything we can do to move this along?
(Assignee)

Comment 10

3 years ago
Not really.  I'm hoping this will shake out once somebody has a chance to look into the remaining e10s IPC leaks.
(Assignee)

Comment 11

3 years ago
(In reply to Andrew McCreight [:mccr8] from comment #8)
> It doesn't look like there is any XPCOM leak checking in that test, so it
> wouldn't catch this problem.  But I can try copying that block into a
> regular browser file and see what happens with leak detection.

I went ahead and did that, and no leaks were detected.
(Assignee)

Comment 12

3 years ago
The test case in comment 3 does not trigger this issue against a clean tree any more.
(Assignee)

Updated

3 years ago
Assignee: nobody → continuation
(Assignee)

Comment 13

3 years ago
As seen in bug 1156977, I noticed that these weird overflows seemed to often be accompanied by a leaked object with a blank name.  I landed an assertion, which was caught on TreeHerder.  This revealed that this was being created during the AddRef of an instance of RefCounted<WeakReference<MessageListener>>.

The bloat hash table is keyed on the name of the class, so I think what happens is that we pass in a blank class name to nsTraceRefCnt, which gets treated as a leaking object, but we end up with one more dtor than ctor of the real class of this object, which is WeakReference<MessageListener>. Due to bug 1116550, we get an underflow, and it gets reported as a leak.

But how do we get a mangled class name? Based on reading code, I think the problem is that WeakReference::typeName returns a pointer to the current stack frame.  There's even a comment about how this is "usually not OK": "This is usually not OK, but here we are returning a pointer to a static buffer which will immediately be used by the caller."  However, the static buffer is not immediately used by the caller, RefCounted::AddRef:
    const char* type = static_cast<const T*>(this)->typeName(); // return stack pointer
    uint32_t size = static_cast<const T*>(this)->typeSize();
    const void* ptr = static_cast<const T*>(this);
    MozRefCountType cnt = ++mRefCnt;
    detail::RefCountLogger::logAddRef(ptr, cnt, type, size);

Most of that doesn't do much, but logAddRef does call into NS_LogAddRef, which is a real function, so that probably overwrites parts of the stack, at least some times. When it ends up nulling it out, we end up with an underflow.

This is also consistent with the garbage class names I see sometimes in the bloat view, that look like ""W�������ZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZZ"".  Note that W is the first character of WeakReference.

My solution to this is to make subclasses of WeakReference declare a new static method using a new macro:
  static const char* weakReferenceTypeName() { return "WeakReference<" #T ">"; }

WeakReference::typeName can then be implemented as
  return T::weakReferenceTypeName();

Note that this won't properly support subclassing, but the current code does not really seem to support it either. In fact, the way WeakReferences are set up right now I'm not sure they can, because the initial AddRef of a WeakReference (that occurs in WeakPtr's ctor) runs on a WeakReference that has an mPtr of nullptr, so we can't call any methods on it.
(Assignee)

Updated

3 years ago
Summary: WeakReference<MessageListener> count underflows, has zero size → WeakReference::AddRef sometimes calls NS_LogAddRef with garbage type names
(Assignee)

Updated

3 years ago
Blocks: 1157304
(Assignee)

Comment 14

3 years ago
Created attachment 8596248 [details] [diff] [review]
Use static strings for WeakReference type names.
(Assignee)

Comment 15

3 years ago
froydnj pointed out in IRC that the nameBuffer is static, so my theory is wrong.  I don't know how I missed that!  Still, something is corrupting that buffer...
(Assignee)

Comment 16

3 years ago
froydnj also pointed out there could be a race condition if this is used across multiple threads, so maybe that's what is happening?

Any thoughts on any of this, Ehsan?
Flags: needinfo?(ehsan)
Sounds like a job for a spin lock!
(Or, can we use std::call_once everywhere?)
(Assignee)

Comment 19

3 years ago
The code in question generates a string "WeakReference<Foo>" for each class Foo, so my patch just creates those strings at compile time, so we don't need a buffer for sprintf.
Oh, I see. That works too! (Though, make sure you only generate that string #ifdef MOZ_REFCOUNTED_LEAK_CHECKING!)

Comment 21

3 years ago
(In reply to Andrew McCreight [:mccr8] from comment #16)
> froydnj also pointed out there could be a race condition if this is used
> across multiple threads, so maybe that's what is happening?
> 
> Any thoughts on any of this, Ehsan?

I'm lost as to what is happening here.  Is the issue reported in the bug with or without your patch?
Flags: needinfo?(ehsan) → needinfo?(continuation)
(Assignee)

Comment 22

3 years ago
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #21)
> I'm lost as to what is happening here.  Is the issue reported in the bug
> with or without your patch?

This is currently happening on tree herder, as tracked in bug 1128049.  The patch here is an attempt to fix that.
Flags: needinfo?(continuation) → needinfo?(ehsan)
(Assignee)

Comment 23

3 years ago
Created attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

This version explicitly does not generate the static strings in non-logging builds.

try run from the previous version looks green:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=2cfae13245c5

I'll try an all-platform build in case there's some weird platform code that I missed.

Note that this removes a type name for nsOfflineCacheUpdate, but I'm pretty sure that it was never being used anyways.
(Assignee)

Updated

3 years ago
Attachment #8596248 - Attachment is obsolete: true

Comment 24

3 years ago
Your patch looks good as a solution for this, but please test this on Windows, since IIRC when I was working on this stuff, sometimes things would only break on Windows.
Flags: needinfo?(ehsan)
(Assignee)

Comment 25

3 years ago
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

I did a full try run and it looks fine:
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=7d97727e87c2

(The orange is just from another patch in the stack where I adjusted the leak threshold a little too low.)
Attachment #8596749 - Flags: review?(ehsan)
(Assignee)

Comment 26

3 years ago
Feel free to pass the review off to an MFBT peer or something...
(Assignee)

Updated

3 years ago
Blocks: 1158158

Updated

3 years ago
Attachment #8596749 - Flags: review?(ehsan) → review+

Comment 27

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/39da629933e1
https://hg.mozilla.org/mozilla-central/rev/39da629933e1
Status: NEW → RESOLVED
Last Resolved: 3 years ago
status-firefox40: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
(Assignee)

Comment 29

3 years ago
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

Approval Request Comment
[Feature/regressing bug #]: bug 935778
[User impact if declined]: intermittent false failures on TreeHerder
[Describe test coverage new/current, TreeHerder]: this is a failure in tree herder
[Risks and why]: very low, this only affects code we run in debug builds
[String/UUID change made/needed]: none
Attachment #8596749 - Flags: approval-mozilla-beta?
Attachment #8596749 - Flags: approval-mozilla-aurora?
(Assignee)

Comment 30

3 years ago
(In reply to Andrew McCreight [:mccr8] from comment #29)
> [User impact if declined]: intermittent false failures on TreeHerder
Specifically, bug 1128049.


Feel free to request approval in whatever other additional branches, Ryan.
status-firefox38: --- → affected
status-firefox39: --- → affected
Comment on attachment 8596749 [details] [diff] [review]
Use static strings for WeakReference type names.

[Triage Comment]
Taking to simplify the life of sheriffs.
Attachment #8596749 - Flags: approval-mozilla-release+
Attachment #8596749 - Flags: approval-mozilla-beta?
Attachment #8596749 - Flags: approval-mozilla-aurora?
Attachment #8596749 - Flags: approval-mozilla-aurora+

Comment 32

3 years ago
Do we need approval for debug-only patches?
For now, yes. AFAIK, only tests only changesets don't need our approval.
If you think it is crazy, don't hesitate to start a discussion.
(Assignee)

Comment 34

3 years ago
Technically, this could affect the binary we ship users, but the stuff it changes shouldn't actually run.

Comment 35

3 years ago
(In reply to Sylvestre Ledru [:sylvestre] from comment #33)
> For now, yes. AFAIK, only tests only changesets don't need our approval.
> If you think it is crazy, don't hesitate to start a discussion.

I think it's relatively crazy, but not something we need too often, so let's carry on.  :-)

Comment 36

3 years ago
(In reply to Andrew McCreight [:mccr8] from comment #34)
> Technically, this could affect the binary we ship users, but the stuff it
> changes shouldn't actually run.

Err, how?  I r+ed under the assumption that when MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to whitespace.  Did I miss something in the patch?
Flags: needinfo?(continuation)
(Assignee)

Comment 37

3 years ago
(In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from comment #36)
> Err, how?  I r+ed under the assumption that when
> MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to
> whitespace.  Did I miss something in the patch?

The new code turns into whitespace in non-leak-checking builds, but the macro the old code used, MOZ_DECLARE_REFCOUNTED_TYPENAME, is always defined, with some explanatory note about how the compiler probably optimizes it out: http://mxr.mozilla.org/mozilla-central/source/mfbt/RefPtr.h#174 Thus, the binary may be changed by my patch, because we no longer define those methods.
Flags: needinfo?(continuation)

Comment 38

3 years ago
(In reply to Andrew McCreight [:mccr8] from comment #37)
> (In reply to :Ehsan Akhgari (not reading bugmail, needinfo? me!) from
> comment #36)
> > Err, how?  I r+ed under the assumption that when
> > MOZ_REFCOUNTED_LEAK_CHECKING is not defined, all of this stuff expands to
> > whitespace.  Did I miss something in the patch?
> 
> The new code turns into whitespace in non-leak-checking builds, but the
> macro the old code used, MOZ_DECLARE_REFCOUNTED_TYPENAME, is always defined,
> with some explanatory note about how the compiler probably optimizes it out:
> http://mxr.mozilla.org/mozilla-central/source/mfbt/RefPtr.h#174 Thus, the
> binary may be changed by my patch, because we no longer define those methods.

I see, yes, you're right!
https://hg.mozilla.org/releases/mozilla-aurora/rev/838556f52b8d
status-firefox39: affected → fixed
Flags: in-testsuite-
(Assignee)

Comment 40

3 years ago
esr31 is technically affected, but it isn't worth landing there as it hasn't actually shown up as an issue on Treeherder.
status-firefox37: --- → wontfix
status-firefox-esr31: --- → wontfix
https://hg.mozilla.org/releases/mozilla-release/rev/5d903629f9bd
status-firefox38: affected → fixed
https://hg.mozilla.org/releases/mozilla-beta/rev/5d903629f9bd
status-firefox38.0.5: --- → fixed
You need to log in before you can comment on or make changes to this bug.