Closed Bug 420521 Opened 12 years ago Closed 12 years ago

Leaking nsThread and nsTimerImpl running full set of Mochitests

Categories

(Core :: XPCOM, defect, P1)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla1.9beta5

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

(Keywords: memory-leak)

Attachments

(2 files, 1 obsolete file)

python runtests.py --autorun --close-when-done \
  --setenv=XPCOM_MEM_LOG_CLASSES=nsTimerImpl,nsThread \
  --setenv=XPCOM_MEM_REFCNT_LOG=/Users/jwalden/Temp/refcnt-log.txt \
  --leak-threshold=0

...
Serial Numbers of Leaked Objects:
5988 @0x3c02cb30 (1 references; 0 from COMPtrs)
928 @0x3f9feba0 (1 references; 1 from COMPtrs)
ERROR FAIL leaked 132 bytes during test execution (should have leaked no more than 0 bytes)
ERROR FAIL leaked 1 instance of nsThread with size 72 bytes
ERROR FAIL leaked 1 instance of nsTimerImpl with size 60 bytes

See the attachment for the resulting balance trees.

The thread is held alive by the mCallingThread nsCOMPtr in nsTimerImpl, so it's not relevant to analysis -- the timer, and I believe particularly the timer thread, is the root problem.  Take a look at the timer tree and the relevant source (particularly for the addrefs/releases from TimerThread::Run, which just pass the timer thread's ref to nsTimerEvent), and I think the sequence of events is like so:

new timer, master ref in nsRecyclingAllocator (timer at 1 ref total)
InitWithFuncCallback gives TimerThread 1 ref to timer (timer at 2 refs total)
TimerThread::Run passes its ref to raw pointer (timer at 2 refs total)
raw pointer passes its ref to nsTimerEvent (timer at 2 refs total)
nsRefPtr takes ref from nsTimerEvent (timer at 2 refs total)
TimerThread takes 1 ref to timer via timer::Fire (timer at 3 refs total)
nsRefPtr releases 1 ref (timer at 2 refs total)
TimerThread::Run passes its ref to raw pointer (timer at 2 refs total)
???
nsRecyclingAllocator releases master ref (timer at 1 ref total)

It shouldn't be the case that the timer's being inserted in TimerThread::mTimers too late, because the fully-cleared assertion in ~TimerThread didn't fire.  I think what's happening is that after the ref gets passed to a raw pointer, we attempt to post an nsTimerEvent but (silently) fail when mCallingThread->Dispatch fails.  However, I'm not seeing any of the warnings that I think should happen in this case, so I may be wrong.

In any case, I think we have more than enough evidence to figure out what's wrong here and and fix it, and with this fixed we could require (on OS X, no idea about elsewhere) that Mochitests run without refcounted leaks on tinderboxen, which would be huge for preventing leaks from showing up in the future.

Bug 418409 and bug 417855 leak the same things and so could be the same as this bug.
Flags: blocking1.9?
Attached patch Patch (obsolete) — Splinter Review
I must be misreading nsThread::Dispatch somehow, because that's definitely what's failing (I forgot the delete if that failed and saw a single nsTimerEvent leaking after I added MOZ_COUNT_[CD]TOR to nsTimerEvent) and leaking the nsTimerImpl.

I can run Mochitests with this patch without leaking *anything* refcounted!
Attachment #306830 - Flags: superreview?(brendan)
Attachment #306830 - Flags: review?(sayrer)
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Attachment #306830 - Flags: review?(sayrer) → review?(peterv)
Attachment #306830 - Flags: review?(peterv) → review?(dbaron)
Comment on attachment 306830 [details] [diff] [review]
Patch

>+    if (gThread && NS_FAILED(gThread->AddTimer(this))) {
>+      delete event;
>+      return NS_ERROR_NOT_AVAILABLE;

Why not capture and propagate what gThread->AddTimer(this) returned? Better than the funky (based on non-blocking i/o in Unix, EAGAIN originally) NS_ERROR_NOT_AVAILABLE, which might obscure a more precise and critical code?

Thanks for fixing.

/be
(In reply to comment #2)
> Why not capture and propagate what gThread->AddTimer(this) returned? Better
> than the funky (based on non-blocking i/o in Unix, EAGAIN originally)
> NS_ERROR_NOT_AVAILABLE, which might obscure a more precise and critical code?

In this case "might" is "would not":

http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/xpcom/threads/TimerThread.cpp&rev=1.28&mark=332#325

Given that, returning the exact error seems like a useless data dependency and a noisy, mostly-useless temporary.  Finally, we swallow the error further up the stack anyway, since TimerThread::Run can't reasonably return the error, so returning a precise error helps nobody.
Waldo: is NS_ERROR_NOT_AVAILABLE really the right code, though?

/be
I'm not really sure there is one; the current thread not being available to handle dispatched events does seem like NS_ERROR_NOT_AVAILABLE to me.  I figured it was better than NS_ERROR_FAILURE, at least; if you have a better suggestion, I'm all ears.
Comment on attachment 306830 [details] [diff] [review]
Patch

r=me, even though NOT_AVAILABLE still rubs me the wrong way, but really, any error will do. Wasting 31 bits on nsresult, sigh. Mozilla 2 exceptions or equivalent, I can't wait.

dbaron suggested I'm the peer and he's the SR. I think he's right.

/be
Attachment #306830 - Flags: superreview?(dbaron)
Attachment #306830 - Flags: superreview?(brendan)
Attachment #306830 - Flags: review?(dbaron)
Attachment #306830 - Flags: review+
Comment on attachment 306830 [details] [diff] [review]
Patch

>           // We are going to let the call to PostTimerEvent here handle the
>           // release of the timer so that we don't end up releasing the timer
>           // on the TimerThread instead of on the thread it targets.
>-          timer->PostTimerEvent();
>+          if (NS_FAILED(timer->PostTimerEvent()))
>+            NS_RELEASE(timer);

Why is it OK to ignore the comment above this?  We shouldn't release the timer off of the thread it targets since it could own objects that are not-threadsafe and meant to be accessed only on that thread.  Or are *all* failure modes of PostTimerEvent guaranteed to leave somebody else with a reference to the timer as well?

>-private:

Seems like you should leave this, and then:

>   nsTimerEvent* event = new nsTimerEvent(this, mGeneration);

make this an nsRefPtr<nsTimerEvent> so that you;
 * don't have to call delete on this event in all the failure cases
 * don't use event when nobody holds a reference to it.

As far as the error codes, it seems like AddTimer's failure case should perhaps return NS_ERROR_OUT_OF_MEMORY and you should propagate that, but I don't feel strongly.

Do you know why nsThread::Dispatch is failing?
(In reply to comment #7)
> Why is it OK to ignore the comment above this?  We shouldn't release the timer
> off of the thread it targets since it could own objects that are not-threadsafe
> and meant to be accessed only on that thread.  Or are *all* failure modes of
> PostTimerEvent guaranteed to leave somebody else with a reference to the timer
> as well?

And if this is guaranteed not to be the last reference, you should:
 * change the comment to explain why
 * assert that it's not the last reference by using NS_RELEASE2 and asserting that the result of Release is not 0.
Attachment #306830 - Flags: superreview?(dbaron) → superreview-
(In reply to comment #7)
> Why is it OK to ignore the comment above this?

> Or are *all* failure modes of PostTimerEvent guaranteed to leave somebody
> else with a reference to the timer as well?

> And if this is guaranteed not to be the last reference, you should:
>  * change the comment to explain why
>  * assert that it's not the last reference by using NS_RELEASE2 and asserting
> that the result of Release is not 0.

Timers are required to be held by the user until they fire (see IDL), so anyone not doing this is in error.  Further, there's code in Release to check for this condition to prevent the timer thread from holding the only reference.  I clarified this in a comment in the new patch.
 

> Do you know why nsThread::Dispatch is failing?

No; at a closer source read, I think nsEventQueue::PutEvent is failing due to an allocation failure, which seems exceedingly odd but is the only way I can see for DispatchEvent to fail on a non-sync event without warning.  It would be possible to check (with some effort) by setting a breakpoint inside the relevant if and retrying the Dispatch to see what happened, but given I've never been able to reproduce this without running the entire set (and given that I haven't added debugger support to runtests.py.in yet) I can't quite justify spending the time to do so.
Attachment #306830 - Attachment is obsolete: true
Attachment #307924 - Flags: superreview?(dbaron)
Attachment #307924 - Flags: review+
Duplicate of this bug: 418409
Comment on attachment 307924 [details] [diff] [review]
With better comments

sr=dbaron
Attachment #307924 - Flags: superreview?(dbaron) → superreview+
Comment on attachment 307924 [details] [diff] [review]
With better comments

Go-go-gadget leak-begone!
Attachment #307924 - Flags: approval1.9?
Comment on attachment 307924 [details] [diff] [review]
With better comments

This is a blocker... P1, too!
Attachment #307924 - Flags: approval1.9?
Fixt.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9beta5
You need to log in before you can comment on or make changes to this bug.