Intermittent TEST-UNEXPECTED-FAIL | leakcheck | default process: 84 bytes leaked (Mutex, nsBaseAppShell)

RESOLVED FIXED

Status

()

defect
P3
blocker
RESOLVED FIXED
4 years ago
2 years ago

People

(Reporter: philor, Assigned: mccr8)

Tracking

({intermittent-failure, memory-leak})

Trunk
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox45 wontfix, firefox47 wontfix, firefox48 wontfix, firefox49 wontfix, firefox-esr45 fixed, firefox50 fixed, firefox51 fixed, firefox52 fixed)

Details

(Whiteboard: tpi:-)

Attachments

(4 attachments, 2 obsolete attachments)

Blocks: 1195295
So this is an orange that existed before 1195295 landed, but the impression I'm getting here is that bug 1195295 somehow tickled it and is making it more frequent.

The vast majority of the leaks also seem to take place on our Windows machines (though I see one instance of it hitting android-4-3-armv7-api11 in comment 7).

Hey mccr8 - I'm not amazing at sussing out intermittent leaks, but I think you've got a reputation for being pretty good at it. Are there any steps you can suggest for me to investigate this?
Flags: needinfo?(continuation)
Usually when you see big surprises like that one Android instance, it's just a misstar because someone didn't notice they still had another job selected - that one wasn't a leak at all.
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #9)
> Hey mccr8 - I'm not amazing at sussing out intermittent leaks, but I think
> you've got a reputation for being pretty good at it. Are there any steps you
> can suggest for me to investigate this?

This is a small enough leak that just reading code is probably your best bet. What might keep this alive? Racey shutdown issues could cause a leak like this. For instance, I saw an intermittent leak one time because there was an nsITimer that wasn't shut down properly, and if it happened to fire at some point during shutdown (I surmised) then it might leak.

Looking at nsBaseAppShell, I noticed that it adds itself as a strong observer to the observer service, and it also calls into nsIThreadInternal::SetObserver, whatever that is, and it does not clear itself from either of those. Maybe those things are an issue?

Another good approach would be to try refcount logging. Even if you can't reproduce the leak, you can at least see what releases this object late in shutdown, and try to think about what might sometimes not be doing that releasing depending on timing. There's some information about it here:
  https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing
Component: General → Widget
Flags: needinfo?(continuation)
Oopsie, little more serious now, since it's leaking 100% on aurora on WinXP and Win8, though a mere 50 or 60% on Win7.
Severity: normal → blocker
Blocks: 1234697
I'll take this.
Assignee: nobody → janus926
Backing out 18eec0849a1f seems to fix the permaleak on Aurora.
  https://treeherder.mozilla.org/#/jobs?repo=try&revision=557502e5b88a
(In reply to Andrew McCreight [:mccr8] (PTO-ish) from comment #21)
> Backing out 18eec0849a1f seems to fix the permaleak on Aurora.
>   https://treeherder.mozilla.org/#/jobs?repo=try&revision=557502e5b88a

It doesn't fix the intermittent itself, though. I can still see one in your backing out commit:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=76d888a191cf&filter-job_type_symbol=bc7&filter-ref_data_name=Windows

But it is now in a significantly lower frequency, so good job!
(In reply to Andrew McCreight [:mccr8] (PTO-ish) from comment #11)
> Looking at nsBaseAppShell, I noticed that it adds itself as a strong
> observer to the observer service, and it also calls into
> nsIThreadInternal::SetObserver, whatever that is, and it does not clear
> itself from either of those. Maybe those things are an issue?

nsThread::ThreadFunc() set mObserver null at https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#422.
According to the log [1], it is the refcnt added/released from nsAppShell::ScheduleNativeEventCallback() [2] and nsAppShell::EventWindowProc() [3] are not balanced.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=2d45f5a81e34
[2] https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsAppShell.cpp#310
[3] https://dxr.mozilla.org/mozilla-central/source/widget/windows/nsAppShell.cpp#191
(In reply to Ting-Yu Chou [:ting] from comment #26)
> the refcnt added/released from nsAppShell::ScheduleNativeEventCallback() [2] and
> nsAppShell::EventWindowProc() [3] are not balanced.

The refcnt adding/releasing were added long time ago [1] [2], and don't seem necessary now as:

1. nsBaseAppShell lives as long as the nsThread initializes it
   nsBaseAppShell::Init() set itself to |mObserver| of nsThread [3], and nsThread::ThreadFunc() sets |mObserver| null after all the pending events are processed [4].
2. nsAppShell::ScheduleNativeEventCallback() is for triggering NativeEventCallback(), I don't see reasons it needs to keep the nsAppShell alive.

:bsmedberg, is my understanding correct?

[1] https://hg.mozilla.org/users/gszorc_mozilla.com/gecko-full/rev/e6d813355e8b
[2] https://hg.mozilla.org/users/gszorc_mozilla.com/gecko-full/rev/820e65270436
[3] https://dxr.mozilla.org/mozilla-central/source/widget/nsBaseAppShell.cpp#53
[4] https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#447
Flags: needinfo?(benjamin)
froydnj might also have an opinion on comment 27.
Flags: needinfo?(nfroyd)
I don't understand why that manual refcounting exists, but I also don't understand why you say they aren't balanced.

I bet this has something to do with keeping the appshell alive during the shutdown process after it would normally be dead.
Flags: needinfo?(benjamin)
Ting-Yu's try run suggests that we're calling ScheduleNativeEventCallback (and addref'ing the app shell), but we're never getting to EventWindowProc to release the extra reference.  So why isn't the message that we're posting getting processed?

What does the PostMessage() call in ScheduleNativeEventCallback return?  (I'm assuming the message queue isn't full at this point, but you never know...)

Is it possible that we're tearing down the window while unresolved messages remain in the message queue and therefore we're dropping the message on the floor?  (I have no idea what happens to messages in the queue when you try to tear down the window.)

Do any of these questions make sense, since I just read about PostMessage() five minutes ago and am tossing off theories? ;)
Flags: needinfo?(nfroyd)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #31)
> I don't understand why that manual refcounting exists, but I also don't
> understand why you say they aren't balanced.

The commit message of the manual refcounting [1] is: 

  Run method now does an AddRef so it stays alive until the last message can
  come through the pump. 

Note there was a while loop in nsAppShell::Run() calling GetMessage(), so manual refcounting was added to keeps nsAppShell alive until all the messages are processed. But now nsAppShell::Run() no longer works like that, the reason doesn't seem valid anymore.

They're not balanced since based on the logs, the number of times nsAppShell::ScheduleNativeEventCallback() calling NS_ADDREF_THIS() and nsAppShell::EventWindowProc() calling NS_RELEASE() are not the same.

> 
> I bet this has something to do with keeping the appshell alive during the
> shutdown process after it would normally be dead.

I don't see code needs nsAppShell::ScheduleNativeEventCallback() to keep appshell alive during the shutdown process after it would normally be dead, as it's just for triggering NativeEventCallback() now.

As I'd like to remove the manual refcounting, are there some other engineers familiar with this part I should talk to? 

(In reply to Nathan Froyd [:froydnj] from comment #32)
> to release the extra reference.  So why isn't the message that we're posting
> getting processed?

EventWindowProc() is registered as lpfnWndProc, when will it get called is not in our control.

>
> What does the PostMessage() call in ScheduleNativeEventCallback return? 
> (I'm assuming the message queue isn't full at this point, but you never
> know...)

As my propsal is to remove the irrelevant refcounting, the return value of PostMessage() doesn't make difference.

> 
> Is it possible that we're tearing down the window while unresolved messages
> remain in the message queue and therefore we're dropping the message on the
> floor?  (I have no idea what happens to messages in the queue when you try
> to tear down the window.)

If you're saying the queue PostMessage() posts to, I don't see we're calling GetMessage() or PeekMessage() while tearing down. If you're saying the event queue nsThread maintains, nsThread::ThreadFunc() seems will process them all before return.

[1] https://hg.mozilla.org/users/gszorc_mozilla.com/gecko-full/rev/e6d813355e8b
Posted patch wip (obsolete) — Splinter Review
NI for comment 33, thanks.
Flags: needinfo?(benjamin)
> They're not balanced since based on the logs, the number of times
> nsAppShell::ScheduleNativeEventCallback() calling NS_ADDREF_THIS() and
> nsAppShell::EventWindowProc() calling NS_RELEASE() are not the same.

This is the bug, though, right? As far as I can tell, those are always supposed to be balanced calls?

I'm worried that the change you've proposed here maybe papering over some other real issue, since we don't understand why it's leaking in the first place.

> As I'd like to remove the manual refcounting, are there some other engineers
> familiar with this part I should talk to? 

I doubt that there is a real expert on this code now. Nathan and I and maybe Jim Mathies all sorta know it, but it's terrifying code and so we're all going to be extremely conservative about accepting changes unless they are well-understood.
Flags: needinfo?(benjamin)
(In reply to Ting-Yu Chou [:ting] from comment #33)
> > to release the extra reference.  So why isn't the message that we're posting
> > getting processed?
> 
> EventWindowProc() is registered as lpfnWndProc, when will it get called is
> not in our control.

I was wrong, EventWindowProc() is called when nsAppShell::ProcessNextNativeEvent() calls DispatchMessageW().

It seems somehow nsAppShell::ScheduleNativeEventCallback() gets called many times before dispatching the message, so there're still queued messages before the process exits and leak the appshell. Will dig this further.
Posted file nsprlog.zip
(In reply to Ting-Yu Chou [:ting] from comment #38)
> (In reply to Ting-Yu Chou [:ting] from comment #33)
> > > to release the extra reference.  So why isn't the message that we're posting
> > > getting processed?
> > 
> > EventWindowProc() is registered as lpfnWndProc, when will it get called is
> > not in our control.
> 
> I was wrong, EventWindowProc() is called when
> nsAppShell::ProcessNextNativeEvent() calls DispatchMessageW().
> 
> It seems somehow nsAppShell::ScheduleNativeEventCallback() gets called many
> times before dispatching the message, so there're still queued messages
> before the process exits and leak the appshell. Will dig this further.

I've attached an nsThread:5 NSPR log from one of these failures. I concur that the leak is probably due to a runnable still being queued when the process wants to exit. Is there a way of logging or knowing what kind of runnable this is?
Flags: needinfo?(janus926)
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #39)
> I've attached an nsThread:5 NSPR log from one of these failures. I concur
> that the leak is probably due to a runnable still being queued when the
> process wants to exit. Is there a way of logging or knowing what kind of
> runnable this is?

I am not sure how to do this.

Though per what I understand, the leak is from comment 26. And it could be fixed by either 1) get rid of the addref/release in nsAppShell::ScheduleNativeEventCallback() and nsAppShell::EventWindowProc(), or 2) do nothing in nsAppShell::ScheduleNativeEventCallback() in there's already a "nsAppShell:EventID" message in windows message queue.

But just as bsmedberg's concern in comment 36, I should figure out how does the leak happen.
Flags: needinfo?(janus926)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #36)
> > They're not balanced since based on the logs, the number of times
> > nsAppShell::ScheduleNativeEventCallback() calling NS_ADDREF_THIS() and
> > nsAppShell::EventWindowProc() calling NS_RELEASE() are not the same.
> 
> This is the bug, though, right? As far as I can tell, those are always
> supposed to be balanced calls?

Yes, they are supposed to, but unfortunately are not.

Basically there're two functions go to nsAppShell::ScheduleNativeEventCallback() (addref):

  nsThread::PutEvent(),
  nsThread::ProcessNextEvent()

nsThread::PutEvent() goes to nsBaseAppShell::OnDispatchedEvent() and it calls ScheduleNativeEventCallback() only if !mNativeEventPending, i.e., it expects at most one native event in the queue.

nsThread::ProcessNextEvent() goes to nsAppShell::ProcessNextNativeEvent(), which calls ScheduleNativeEventCallback() whenever |timeSinceLastNativeEventScheduled| is greater than 1 second. Note it does *not* care whether there are queued native events or not.

This means there could be more than one native event queued when nsThread::ThreadFunc() calls NS_ProcessNextEvent()/NS_ProcessPendingEvents() to clean up queued gecko events while exiting, and the code does not guarantee all the native events will be handled. So there're chances that there're still queued native events in windows message queue and leak the appshell.

The solution can be one of those:

1) Make sure all the queued native events are dispatched to release refcnt in nsAppShell::Exit()
2) Get rid of addref/release in nsAppShell::ScheduleNativeEventCallback() and nsAppShell::EventWindowProc()
3) Let nsAppShell::ProcessNextNativeEvent() call ScheduleNativeEventCallback() only if there's no pending native event

I chose 2 for the reasons in comment 27. Also if addref/release are unnecessary, add more code to "fix" it seems odd. But if there're other concerns, I will pick 3.

How do you think? Please let me know if there are still things unclear.
Flags: needinfo?(nfroyd)
Flags: needinfo?(benjamin)
(In reply to Ting-Yu Chou [:ting] from comment #33)
> > What does the PostMessage() call in ScheduleNativeEventCallback return? 
> > (I'm assuming the message queue isn't full at this point, but you never
> > know...)
> 
> As my propsal is to remove the irrelevant refcounting, the return value of
> PostMessage() doesn't make difference.

Confirmed it returns true.
(In reply to Ting-Yu Chou [:ting] from comment #41)
> The solution can be one of those:
> 
> 1) Make sure all the queued native events are dispatched to release refcnt
> in nsAppShell::Exit()

What about doing something like the OS X code?

http://mxr.mozilla.org/mozilla-central/source/widget/cocoa/nsAppShell.mm#708

> 2) Get rid of addref/release in nsAppShell::ScheduleNativeEventCallback()
> and nsAppShell::EventWindowProc()

I *think* the higher level stuff (both the startup code and the main thread itself) hold all the references we need here, but doing this still makes me nervous.

> 3) Let nsAppShell::ProcessNextNativeEvent() call
> ScheduleNativeEventCallback() only if there's no pending native event

Why not check for a pending native event in ScheduleNativeEventCallback itself?
Flags: needinfo?(nfroyd)
What do you think of froydnj's suggestions? I'll happily test a patch - especially since bug 1195295 (which is blocked by this), seems to trigger the leak reliably.
Flags: needinfo?(janus926)
Posted patch patch v1 (obsolete) — Splinter Review
It's this patch:

> > 3) Let nsAppShell::ProcessNextNativeEvent() call
> > ScheduleNativeEventCallback() only if there's no pending native event
> 
> Why not check for a pending native event in ScheduleNativeEventCallback
> itself?

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=be5c6ed1fd7e

(In reply to Mike Conley (:mconley) - Needinfo me! from comment #45)
> What do you think of froydnj's suggestions? I'll happily test a patch -
> especially since bug 1195295 (which is blocked by this), seems to trigger
> the leak reliably.

Sorry for being late, I was hoping to have bsmedberg's opinion on comment 41 as well. Note the Try above does not include bug 1195295's backout, and the leak is gone.
Attachment #8705454 - Attachment is obsolete: true
Flags: needinfo?(janus926)
Flags: needinfo?(benjamin)
Attachment #8709313 - Flags: review?(nfroyd)
Comment on attachment 8709313 [details] [diff] [review]
patch v1

Review of attachment 8709313 [details] [diff] [review]:
-----------------------------------------------------------------

First off, thanks for the patch.  The try run shows that it does indeed fix the leak.

What worries me is that this patch fixes the leak, while leaving questions like:

* Why do we have these extra messages in the message queue at shutdown?
* Why does the patch for bug 1195294 cause this problem to happen significantly more frequently?

completely unanswered.  And so we don't understand the root cause of the leak, which means this patch might just be papering over symptoms.  And as this code is delicate, terrifying code, I don't want to accept patches without a really solid understanding/explanation of the problem.  (Comment 41 has a great explanation of the situation the patch fixes, but is lacking in answers to the above questions.)

This patch changes the amount of interaction between Gecko's event loop and the native Windows event loop but doesn't include an explanation of why doing this is 100% safe, which also worries me.  (For instance, did the extra messages we were inserting into Windows's message queue actually do anything useful?  Why or why not?  I am willing to believe that they didn't do anything useful, since nsBaseAppShell::NativeEventCallback has an early return for mNativeEventPending, but the patch didn't include any justification of this, which is unhelpful.)
Attachment #8709313 - Flags: review?(nfroyd)
Will come back with answers for comment 47.
(In reply to Nathan Froyd [:froydnj] from comment #43)
> > 2) Get rid of addref/release in nsAppShell::ScheduleNativeEventCallback()
> > and nsAppShell::EventWindowProc()
> 
> I *think* the higher level stuff (both the startup code and the main thread
> itself) hold all the references we need here, but doing this still makes me
> nervous.

Could you please tell me what makes you nervous? Are there anything you think may be broken, or anything that test cases not covered? Thank you.
Flags: needinfo?(nfroyd)
(In reply to Ting-Yu Chou [:ting] from comment #49)
> (In reply to Nathan Froyd [:froydnj] from comment #43)
> > > 2) Get rid of addref/release in nsAppShell::ScheduleNativeEventCallback()
> > > and nsAppShell::EventWindowProc()
> > 
> > I *think* the higher level stuff (both the startup code and the main thread
> > itself) hold all the references we need here, but doing this still makes me
> > nervous.
> 
> Could you please tell me what makes you nervous? Are there anything you
> think may be broken, or anything that test cases not covered? Thank you.

Because when we push that message into the message queue, we need to be absolutely sure that the appshell is going to be around when the message gets processed.  Otherwise we have a use-after-free vulnerability, which is bad.  The current refcounting scheme provides us that guarantee.

I am unsure whether we can process messages after the startup code and the main thread release their references to the appshell.  I am also quite willing to believe that our tests do not provide sufficient coverage for conditions like this.
Flags: needinfo?(nfroyd)
ni to :ting for comment 50.
Flags: needinfo?(janus926)
(In reply to Nathan Froyd [:froydnj] from comment #50)
> I am unsure whether we can process messages after the startup code and the
> main thread release their references to the appshell.

Do we need/expect to process messages after these? From nsThread::ThreadFunc(), it seems we do not?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #47)
> * Why do we have these extra messages in the message queue at shutdown?

The leaked global windows check [1] before test finishes will go through entire heap's CC logs by using a timer to process 5000 entries for each run [2]. The extra native events are enqueued during this period like this:

02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ProcessNextNativeEvent > ScheduleNativeEventCallback
02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback mNumQueued=1
02:14:04     INFO -  !!! 0FBD13E0 nsBaseAppShell::OnDispatchedEvent > ScheduleNativeEventCallback
02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback mNumQueued=2
02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ProcessNextNativeEvent > ScheduleNativeEventCallback
02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback mNumQueued=2
02:14:06     INFO -  !!! 0FBD13E0 nsBaseAppShell::OnDispatchedEvent > ScheduleNativeEventCallback
02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback mNumQueued=3

Note nsAppShell::ProcessNextNativeEvent() cross over 1s threshold and post a native event, then the following nsBaseAppShell::OnDispatchedEvent() also post another native event as mNativeEventPending has been cleared. Repeat this more times, there will be more native events queued.

> * Why does the patch for bug 1195294 cause this problem to happen
> significantly more frequently?

The patch of bug 1195294 somehow causes a lot more JS function objects in CC logs and extends the time processing it, so more native events are queued (>10) and make the problem happen more frequently. I am still checking what are they.

[1] https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/browser-test.js#702
[2] https://dxr.mozilla.org/mozilla-central/source/testing/mochitest/cc-analyzer.js#47
Flags: needinfo?(janus926)
(In reply to Ting-Yu Chou [:ting] from comment #53)
> The patch of bug 1195294 somehow causes a lot more JS function objects in CC
               ^^^^^^^^^^^

Correction: bug 1195295
(In reply to Ting-Yu Chou [:ting] from comment #54)
> (In reply to Ting-Yu Chou [:ting] from comment #53)
> > The patch of bug 1195294 somehow causes a lot more JS function objects in CC
>                ^^^^^^^^^^^
> 
> Correction: bug 1195295

Note that the patch in bug 1195295 removes a sync message from content to the parent when a new tab loads up. With this sync message gone, we might see a shift in timings - though I'm not sure why that'd put so many more JS functions into the cycle collector.
Gentle ping for nfroyd on the needinfo in comment 52
(In reply to Ting-Yu Chou [:ting] from comment #53)
> (In reply to Nathan Froyd [:froydnj] from comment #47)
> > * Why do we have these extra messages in the message queue at shutdown?
> 
> The leaked global windows check [1] before test finishes will go through
> entire heap's CC logs by using a timer to process 5000 entries for each run
> [2]. The extra native events are enqueued during this period like this:
> 
> 02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ProcessNextNativeEvent >
> ScheduleNativeEventCallback
> 02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback
> mNumQueued=1
> 02:14:04     INFO -  !!! 0FBD13E0 nsBaseAppShell::OnDispatchedEvent >
> ScheduleNativeEventCallback
> 02:14:04     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback
> mNumQueued=2
> 02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ProcessNextNativeEvent >
> ScheduleNativeEventCallback
> 02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback
> mNumQueued=2
> 02:14:06     INFO -  !!! 0FBD13E0 nsBaseAppShell::OnDispatchedEvent >
> ScheduleNativeEventCallback
> 02:14:06     INFO -  !!! 0FBD13E0 nsAppShell::ScheduleNativeEventCallback
> mNumQueued=3
> 
> Note nsAppShell::ProcessNextNativeEvent() cross over 1s threshold and post a
> native event, then the following nsBaseAppShell::OnDispatchedEvent() also
> post another native event as mNativeEventPending has been cleared. Repeat
> this more times, there will be more native events queued.

But these are still native events being queued before the browser actually shuts down, right?  We still have plenty of time to process events (various setTimeouts that will return control to the event loop, browser shutdown itself, etc.); the CC analyzer will implicitly spin the event loop for situations like this  Why are these new messages not being processed appropriately?
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #58)
> But these are still native events being queued before the browser actually
> shuts down, right? We still have plenty of time to process events (various
> setTimeouts that will return control to the event loop, browser shutdown
> itself, etc.); the CC analyzer will implicitly spin the event loop for
> situations like this  Why are these new messages not being processed
> appropriately?

The number of queued native events decreases but not down to 0 before shuts down, and there's no guarantee of it, isn't it? nsThread::ThreadFunc() only clears all the queued gecko events. Note native events only get processed when nsBaseAppShell::OnProcessNextEvent() is called, also there're still chances to enqueue more native events.

Could you please clarify comment 52? Thanks.
Flags: needinfo?(nfroyd)
(In reply to Ting-Yu Chou [:ting] from comment #59)
> (In reply to Nathan Froyd [:froydnj] from comment #58)
> > But these are still native events being queued before the browser actually
> > shuts down, right? We still have plenty of time to process events (various
> > setTimeouts that will return control to the event loop, browser shutdown
> > itself, etc.); the CC analyzer will implicitly spin the event loop for
> > situations like this  Why are these new messages not being processed
> > appropriately?
> 
> The number of queued native events decreases but not down to 0 before shuts
> down, and there's no guarantee of it, isn't it? nsThread::ThreadFunc() only
> clears all the queued gecko events. Note native events only get processed
> when nsBaseAppShell::OnProcessNextEvent() is called, also there're still
> chances to enqueue more native events.
> 
> Could you please clarify comment 52? Thanks.

Well, if there was a guarantee that the queued native messages got processed, we wouldn't be having this conversation. :)

I don't follow the last half of your comment ("Note native events...") and the reference to nsThread::ThreadFunc makes me think there's some confusion about how the event loop for the main thread works.  To clarify what's going on with event loops here: the main thread doesn't run nsThread::ThreadFunc.  The main thread gets set up by nsThreadManager::Init:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#95

which is called by NS_InitXPCOM2:

http://mxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#549

This function gets called from several different places; the most important for our purposes is from ScopedXPCOMStartup:

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#1539

which is a member of the class XREMain:

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3108
http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#4414

which we create in XRE_main:

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#4519

and then eventually wind up calling XREMain::XRE_mainRun:

http://mxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#4044

which calls nsAppStartup::Run:

http://mxr.mozilla.org/mozilla-central/source/toolkit/components/startup/nsAppStartup.cpp#264

which asks its appshell to run:

http://mxr.mozilla.org/mozilla-central/source/widget/nsBaseAppShell.cpp#148
http://mxr.mozilla.org/mozilla-central/source/widget/windows/nsAppShell.cpp#246

Notice that Windows's nsAppShell has its own override for Run, but which ends up calling nsBaseAppShell::Run anyway, after doing some Windows-specific things.

Anyway, we can see in nsBaseAppShell::Run that we're going to run the current MessageLoop.  This bit exists primarily to integrate with the IPC bits for e10s, but also interacts with the event loop capabilities of nsThread.  It's worth pointing out that we initialized the current event loop all the way back when we initialized XPCOM:

http://mxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#524

Since we're passing TYPE_MOZILLA_UI for the kind of message loop, we arrive here:

http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#109

or here, for child processes:

http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_loop.cc#112

There's not a huge amount of different between the two kinds of pumps; both of them eventually wind up at MessagePump::Run:

http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#77

Reading through that function, you can see that there's some alteration between running Gecko runnables (the NS_ProcessNextEvent bits) and MessageLoop tasks (the Do*Work bits).  Since we shouldn't be running very many MessageLoop tasks (as those are primarily for IPC), we'll focus on the NS_ProcessNextEvent bits.  NS_ProcessNextEvent itself is just a shortcut for nsIThread::ProcessNextEvent:

http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsThreadUtils.h#167
http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#881

There's a lot of stuff going on here; the core bits are getting an event out of the queue and running it:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#976

But before we do that, we need to notify our thread observers that we're doing something interesting:

http://mxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#964

In the main thread's case, our thread observer is the appshell itself, setup when the appshell was initialized:

http://mxr.mozilla.org/mozilla-central/source/widget/nsBaseAppShell.cpp#53

And that is when the appshell's OnProcessNextEvent is called (!).  (It's worth noting that a similar chain of reasoning holds for non-main threads, as NS_ProcessPendingEvents winds up calling nsThread::ProcessNextEvent.)

What comment 52 was saying is that we still have plenty of chances to go through the event loop, giving nsAppShell::ProcessNextNativeEvent plenty of opportunity to drain the messages.  Why are we not doing that?  It's also interesting that we're generating so many more native events with bug 1195295; investigating why we're doing that would also be worthwhile.
Flags: needinfo?(nfroyd)
(In reply to Nathan Froyd [:froydnj] from comment #60)
> What comment 52 was saying is that we still have plenty of chances to go
> through the event loop, giving nsAppShell::ProcessNextNativeEvent plenty of
> opportunity to drain the messages.  Why are we not doing that?  It's also

Thanks a lot for walking me through the code and correcting me that main thread does not run nsThread::ThreadFunc :)

One thing I don't follow is why there have plenty of chances for nsAppShell::ProcessNextNativeEvent to drain the messages.

From Try [1], there are 6 seconds between CCAnalyzer finishes processing and the process exits:

  07:15:57     INFO -  CCAnalyzer.processingCount=98373
  07:15:57     INFO -  Completed ShutdownLeaks collections in process 2168
  07:16:03     INFO -  TEST-INFO | Main app process: exit 0

and what I meant for the last half of comment 59 ("Note native events...") was native events only get poped when nsThread::ProcessNextEvent is called. So we need to have enough nsThread::ProcessNextEvent calls to drain >10 native events we posted in 6 seconds, while nsAppShell::ProcessNextNativeEvent and nsBaseAppShell::OnDispatchedEvent can still enqueue more native events and also there're real windows native events.

> interesting that we're generating so many more native events with bug
> 1195295; investigating why we're doing that would also be worthwhile.

The reason why many more native events are queued was explained in comment 53. So I'd like to leave the investigation of what creates more JS functions in CC in bug 1195295 as it does not affect how we fix appshell leak here.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=487f96a26495
(In reply to Nathan Froyd [:froydnj] from comment #47)
> This patch changes the amount of interaction between Gecko's event loop and
> the native Windows event loop but doesn't include an explanation of why
> doing this is 100% safe, which also worries me.

I can't proof this is 100% safe :(

Current code could have nsBaseAppShell::NativeEventCallback called earlier after nsBaseAppShell::OnDispatchedEvent() schedule a native event if there're native events queued already. Take an example, assume nsBaseAppShell::mNativeEventPending is false and nsAppShell::ProcessNextNativeEvent does not cross the 1s threshold, current code does:

  Q = |N0|N1|X|X|    initial
      |N0|N1|X|X|N2| nsBaseAppShell::OnDispatchedEvent   // enqueue native event N2
      |N1|X|X|N2|    nsAppShell::ProcessNextNativeEvent  // nsBaseAppShell::NativeEventCallback gets called for N0
      |N1|X|X|N2|N3| nsBaseAppShell::OnDispatchedEvent   // enqueue native event N3
      |X|X|N2|N3|    nsAppShell::ProcessNextNativeEvent  // nsBaseAppShell::NativeEventCallback gets called for N1

But if there's at most one native event, the 2nd nsBaseAppShell::NativeEventCallback will come later:

  Q = |N0|X|X|       initial
      |N0|X|X|       nsBaseAppShell::OnDispatchedEvent   // does not enqueue native event
      |X|X|          nsAppShell::ProcessNextNativeEvent  // nsBaseAppShell::NativeEventCallback gets called for N0
      |X|X|N1|       nsBaseAppShell::OnDispatchedEvent   // enqueue native event N1
      |X|N1|         nsAppShell::ProcessNextNativeEvent  // windows native event X
      |N1|           nsAppShell::ProcessNextNativeEvent  // windows native event X
      ||             nsAppShell::ProcessNextNativeEvent, // nsBaseAppShell::NativeEventCallback gets called for N1

I don't know how would this impact the event handling.

I'll go back to my original plan to get rid of addref/release, re-trace the code and make sure whether there will be the use-after-free problem Nathan mentioned in comment 50.
Posted patch patch v2Splinter Review
The previous patch that has at most one native event at any time will cause a different behavior I mentioned in comment 62, and I don't know how much would it impact. So here's the patch v2 which removes addref/release and does not change how native events are enqueued/processed.

There're three places hold a reference to the instance of nsAppShell:

1. nsThread::mObserver
   The reference is cleared to null when ShutdownXPCOM() calls nsThreadManager::Shutdown():

   https://dxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#879
   https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThreadManager.cpp#176

   Also note nsThread always accesses |mObserver| with a null check, which nsThread will not use nsAppShell after it's free:

   https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#590
   https://dxr.mozilla.org/mozilla-central/source/xpcom/threads/nsThread.cpp#963

2. sAppShell@nsAppShellSingleton.h
   The reference is released in nsAppShellShutdown() when ~KnownModule() is called from nsComponentManagerImpl::Shutdown() in ShutdownXPCOM():

   https://dxr.mozilla.org/mozilla-central/source/widget/nsAppShellSingleton.h#57
   https://dxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#986

   do_GetService() does not return an invalid pointer after ShutdownXPCOM() sets |gXPCOMShuttingDown| true:

   https://dxr.mozilla.org/mozilla-central/source/xpcom/components/nsComponentManager.cpp#1329
   https://dxr.mozilla.org/mozilla-central/source/xpcom/components/nsComponentManager.cpp#1520
   https://dxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#909
   https://dxr.mozilla.org/mozilla-central/source/xpcom/build/Services.cpp#72

3. nsAppStartup::mAppShell
   The reference is released when ~nsAppStartup() is called:

   a. XRE_main() sets |mScopedXPCOM| null and ~ScopedXPCOMStartup() returns
      https://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#4450
      https://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#1471

   b. ScopedXREEmbed::Stop() winds up to ShutdownXPCOM(), and nsCycleCollector_shutdown() triggers GC:
      https://dxr.mozilla.org/mozilla-central/source/ipc/glue/ScopedXREEmbed.cpp#115
      https://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsEmbedFunctions.cpp#209
      https://dxr.mozilla.org/mozilla-central/source/xpcom/build/XPCOMInit.cpp#952

I don't see code access an invalid pointer to appshell directly after the references listed above are released.
Attachment #8709313 - Attachment is obsolete: true
Attachment #8714230 - Flags: review?(nfroyd)
Duplicate of this bug: 1234697
(In reply to Ting-Yu Chou [:ting] from comment #64)
> I don't see code access an invalid pointer to appshell directly after the
> references listed above are released.

I think you are correct that there are no direct references.  However, what we're worried about in this case is not direct references, but anything that might access the Windows message queue, because we know there can be messages sitting in the queue that, when processed, will access the appshell.

I'm experimenting with a try run to see if any badness turns up:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b0d205a9517

Obviously, that's no guarantee that this patch works.  If the current patch is approved, I would feel a little better if we landed the patch with the additional MOZ_RELEASE_ASSERT.
(In reply to Nathan Froyd [:froydnj] from comment #66)
> I think you are correct that there are no direct references.  However, what
> we're worried about in this case is not direct references, but anything that
> might access the Windows message queue, because we know there can be
> messages sitting in the queue that, when processed, will access the appshell.

Could you point me out how could a queued Windows message get processed after all three references in comment 64 are cleared? I don't see a nsThread::ProcessNextEvent() call after that. Thanks.
This is a bit hackish, but should actually make sense, since we do destroy
the browsing context tree explicitly anyhow, and frame message managers are part of that.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f89dec44d94c

I have no idea whether this helps with the issue people have been seeing there. Could someone test?

However, this shouldn't be _needed_. If there is a leak, we should fix that. Though, if the leak is just some very temporary, it might not be too bad.
Comment on attachment 8714230 [details] [diff] [review]
patch v2

Review of attachment 8714230 [details] [diff] [review]:
-----------------------------------------------------------------

Canceling this review, because I think that patches that have landed in the dependent bugs make this fix unnecessary.
Attachment #8714230 - Flags: review?(nfroyd)
The dependent bugs (bug 1249795 and bug 1249439) fix the leak bug 1195295 revealed which cause this intermittent failure happens more frequently, but not the appshell leak here. There should be chances to see this failure again, we can wait for some more days.
Comment on attachment 8714230 [details] [diff] [review]
patch v2

Request for review again because of comment 81, 82, 83.
Attachment #8714230 - Flags: review?(nfroyd)
Comment on attachment 8714230 [details] [diff] [review]
patch v2

Review of attachment 8714230 [details] [diff] [review]:
-----------------------------------------------------------------

Given that the permaleaks seen had a different solution, and a better solution, I am willing to state that these intermittents are in a similar situation.  This is a band-aid patch, not a solution to the actual issue.
Attachment #8714230 - Flags: review?(nfroyd)
This is currently the #1 leak on OrangeFactor.
(In reply to Nathan Frown [:froydnj] from comment #85)
> Given that the permaleaks seen had a different solution, and a better
> solution, I am willing to state that these intermittents are in a similar
> situation.  This is a band-aid patch, not a solution to the actual issue.

I feel sorry to hear that. The patch is not for fixing the permaleaks bug 1195295 seemed caused but the unbalanced addref/release calls on the appshell. Also I do not think current intermittents are in a similar situation with bug 1195295 because there will always be things for CCanalyzer and create the chance of leaking appshell.

Let me recap, and if anyone can make a correct fix, please feel free to take this.

Why the leak
------------
Because the number of addrefs and releases at nsAppShell::ScheduleNativeEventCallback() and nsAppShell::EventWindowProc() are not the equal.

Here are some notes:

- Two paths go to nsAppShell::ScheduleNativeEventCallback(), one checks whether there's already an enqueued event, one does not. (comment 41)
- CCAnalyzer makes the path that does not check native message existence pushes more native messages which can't all be digested. (comment 53)
- nsAppShell::EventWindowProc() gets called when nsThread::ProcessNextEvent() goes to nsAppShell::ProcessNextNativeEvent() and DispatchMessageW().
- The addref/release were added to make sure the appshell stays alive until the last message is pumped from nsAppShell::Run(), which was a while loop calling GetMessage() at that time, note it is not what we are doing today (comment 33). This is the main reason why I think addref/release are unnecessary.

Why the patch
-------------
I had three options (comment 41), and the patch is (2):

1) Make sure all the queued native messages are handled to have balanced release calls
2) Get rid of the addref/release
3) Call ScheduleNativeEventCallback() only if there's no pending native message

(1) is odd for me to add more code when addref/release are not needed, and (3) has an side effect which I can't prove it is 100% safe (comment 62). Note I've checked the use-after-free :froydnj is concerned for removing addref/release (comement 50), and I don't see it (comment 64, comment 67).

What's about bug 1195295
------------------------
The patch of bug 1195295 reveals a leak of xul browser and its nsInProcessTabChildGlobal, which moves a lot of things into CC graph and makes CCAnalyzer takes much longer on processing the entries. That pushes more messages to the native queue can hardly be processed before exiting, so it looks like permaleaks. Fixing the leak removes the extra entries CCAnaylzer processes, and the frequency goes back to intermittent. As such, I think bug 1195295 has nothing to do with the appshell leak here.
Assignee: janus926 → nobody
ting-yu, this seems to be permanent at least on aurora - is there a kind of eta when this could be fixed ?
Flags: needinfo?(janus926)
It seems aurora is a bit different, it fails bc1 while the other branches are usually bc7 and it comes with this message:

  OSError: IO Completion Port failed to signal process shutdown

For the ETA, I am sorry but I don't have it. I've proposed my patch but please see comment 85 and 90 for current status, I am not sure how to proceed.
Flags: needinfo?(janus926)
Still extremely frequent. What can we do to get this moving forward again?
Flags: needinfo?(bugs)
This is insanely bad. See the link below. Can we please find someone to investigate this?
https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=win%20debug%20bc2&fromchange=834fed3f42db37b6427d1140a6e32a3cb6af1ed9
Flags: needinfo?(overholt)
I started an email thread with involved parties. Let's see what we can come up with.
Flags: needinfo?(overholt)
Posted patch 4debugSplinter Review
To keep GC/CC graphs from Try runs, so we check whether there're leaks causing the huge CC graph.

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=768f6dd8cc1d
(In reply to Andrew Overholt [:overholt] from comment #138)
> I started an email thread with involved parties. Let's see what we can come
> up with.

Any updates on that email thread? :)
Flags: needinfo?(overholt)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #171)
> (In reply to Andrew Overholt [:overholt] from comment #138)
> > I started an email thread with involved parties. Let's see what we can come
> > up with.
> 
> Any updates on that email thread? :)

Nothing great came out of it. It seems to be a difficult problem.
Flags: needinfo?(overholt)
Whiteboard: tpi:-
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
The leaks are happening in browser/base/content/test/general. Wonder if we could narrow down the problematic test and disable our way to victory here since fixing the actual bug is apparently something that can't be done.
Depends on: 1305836
Not counting my retries while we were greening up the tree, 50 on beta on WinXP is 3% green. If you round up.

According to my simulation of merging 51 to beta, on WinXP and Win7 it will be 0% green.
Fixed by bug 1305836.
Assignee: nobody → continuation
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(bugs)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.