Closed Bug 1147390 Opened 5 years ago Closed 5 years ago

Intermittent test_mousecapture_area.html | application crashed [@ xul.dll!][RunnableMethod<SoftwareDisplay,void ( SoftwareDisplay::*)(mozilla::TimeStamp),Tuple1<mozilla::TimeStamp> >::Run()]

Categories

(Core :: XPCOM, defect)

x86
Windows XP
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox38 --- unaffected
firefox39 --- wontfix
firefox40 --- fixed
firefox-esr31 --- unaffected
b2g-v2.2 --- wontfix
b2g-master --- fixed

People

(Reporter: RyanVM, Assigned: mchang)

References

Details

(Keywords: assertion, crash, intermittent-failure, Whiteboard: gfx-noted)

Crash Data

Attachments

(3 files, 1 obsolete file)

Attached file test log
See the attached log and comment 1 for more info.
Flags: needinfo?(aklotz)
Something isn't right here. The main thread I/O logger does not make sense in the context of the surrounding calls, either above or below that frame. The scary part is that this is from a debug build no less!
Flags: needinfo?(aklotz)
The vsync stuff is interesting.
This is confusing. The vsync compositor on windows landed in bug 1140723, which landed on March 18 but this showed up on March 25th. Also, from the stack trace in comment 5, we create a RunDuringShutdown message on the compositor thread, which I don't think we run at all during actual composition. The only place I see RunDuringShutdown messages are in MediaStreamGraph.cpp, but all the uses look like pretty old code.  

Coincidentally, Bug 1144317, the vsync refresh driver on windows, landed on March 25th and backed out on March 26th, which correlates with these failures. The fx-team failure in comment 25 is odd since it occurred after the backout, but there might have not been a merge by then. Let's see if this still happens.
Hmm, so it looks like the vsync refresh driver isn't the problem from comment 8.
Duplicate of this bug: 1144825
The crash from comment 11 is very confusing. A task from MediaStreamGraph.cpp is running on the compositor thread during shutdown.

Thread 37 (crashed)
 0  xul.dll!`mozilla::MediaStream::Destroy'::`2'::Message::RunDuringShutdown [MediaStreamGraph.cpp:ee7cc272d5db : 1586 + 0x0]
    eip = 0x0196f2a0   esp = 0x0dbffdd4   ebp = 0x0dbffe04   ebx = 0x0dbfff0c
    esi = 0x145bc398   edi = 0x0dbffdf0   eax = 0x145bc370   ecx = 0x00000000
    edx = 0x0dbffe24   efl = 0x00010216
    Found by: given as instruction pointer in context
 1  xul.dll!RunnableMethod<mozilla::layers::CompositorVsyncObserver,void ( mozilla::layers::CompositorVsyncObserver::*)(mozilla::TimeStamp),Tuple1<mozilla::TimeStamp> >::Run() [task.h:ee7cc272d5db : 310 + 0x13]
    eip = 0x01369db5   esp = 0x0dbffdd8   ebp = 0x0dbffe04
    Found by: call frame info
This is really really confusing. The crash from comment 15 shows it in nsDisplayList on the compositor thread. That shouldn't be happening at all, none of the display list code should execute outside the main thread. The CompositorVsyncDispatcher has been running on OS X, Windows, Linux, and b2g emulators for a month or so now and this only shows up on Windows XP.

Windows XP uses the software vsync timer, which is used on Linux and the b2g emulators. All these stack traces show some random crash on the compositor thread where the surrounding context also doesn't make sense. I wonder if the crash stack on XP has some bug instead since none of these make sense...
David, do you have any suggestions as to why these stacks are so messed up?
Flags: needinfo?(dmajor)
It's COMDAT folding [0]. nsDisplayItem::GetPerFrameKey is just two lines of disassembly, and libxul has lots of other functions with the same two lines, so the linker merged them all together.

I would just ignore the misleading line, or if you really want, you can try to figure out what it originally was by looking at the caller's frame (task.h:310).

[0] http://blogs.msdn.com/b/oldnewthing/archive/2005/03/22/400373.aspx
Flags: needinfo?(dmajor)
(Same goes for comment 13 -- that MediaStreamGraph function is identical to GetPerFrameKey at the object code level. From build to build the linker often selects a different name as the "canonical" one)
(In reply to David Major [:dmajor] from comment #19)
> (Same goes for comment 13 -- that MediaStreamGraph function is identical to
> GetPerFrameKey at the object code level. From build to build the linker
> often selects a different name as the "canonical" one)

Can we see a list of names for the same address that the linker folded?
Flags: needinfo?(dmajor)
I can't find a full-symbols file for those failing pushes, but here is what my debugger says for my own-built nightly. It should be pretty similar.

0:000> ln xul!nsDisplayItem::GetPerFrameKey
(1024656f)   xul!IPC::Channel::`vcall'{4}'   |  (10246574)   xul![thunk]:mozilla::ipc::IOThreadChild::`vector deleting destructor'
Exact matches:
    xul!pp::Diagnostics::report (pp::Diagnostics::ID, struct pp::SourceLocation *, class std::basic_string<char,std::char_traits<char>,std::allocator<char> > *)
    xul!mozilla::`anonymous namespace'::CreateMessage::RunDuringShutdown (void)
    xul!`mozilla::MediaInputPort::Destroy'::`2'::Message::RunDuringShutdown (void)
    xul!`mozilla::ProcessedMediaStream::AllocateInputPort'::`2'::Message::RunDuringShutdown (void)
    xul!`mozilla::MediaStream::Destroy'::`2'::Message::RunDuringShutdown (void)
    xul!nsDisplayItem::GetPerFrameKey (void)
    xul!webrtc::AudioDecoder::DecodeRedundant (unsigned char *, unsigned int, short *, webrtc::AudioDecoder::SpeechType *)
Flags: needinfo?(dmajor)
(In reply to David Major [:dmajor] from comment #21)
> I can't find a full-symbols file for those failing pushes, but here is what
> my debugger says for my own-built nightly. It should be pretty similar.
> 
> 0:000> ln xul!nsDisplayItem::GetPerFrameKey
> (1024656f)   xul!IPC::Channel::`vcall'{4}'   |  (10246574)  
> xul![thunk]:mozilla::ipc::IOThreadChild::`vector deleting destructor'
> Exact matches:
>     xul!pp::Diagnostics::report (pp::Diagnostics::ID, struct
> pp::SourceLocation *, class
> std::basic_string<char,std::char_traits<char>,std::allocator<char> > *)
>     xul!mozilla::`anonymous namespace'::CreateMessage::RunDuringShutdown
> (void)
>     xul!`mozilla::MediaInputPort::Destroy'::`2'::Message::RunDuringShutdown
> (void)
>    
> xul!`mozilla::ProcessedMediaStream::AllocateInputPort'::`2'::Message::
> RunDuringShutdown (void)
>     xul!`mozilla::MediaStream::Destroy'::`2'::Message::RunDuringShutdown
> (void)
>     xul!nsDisplayItem::GetPerFrameKey (void)
>     xul!webrtc::AudioDecoder::DecodeRedundant (unsigned char *, unsigned
> int, short *, webrtc::AudioDecoder::SpeechType *)

None of these also make sense as to why they would be on the vsync thread. Do we coalesce the addresses for the second frame as well?
Flags: needinfo?(dmajor)
Crash Signature: @nsDisplayItem::GetPerFrameKey()
Crash Signature: @nsDisplayItem::GetPerFrameKey() → nsDisplayItem::GetPerFrameKey()
Crash Signature: nsDisplayItem::GetPerFrameKey() → [@ nsDisplayItem::GetPerFrameKey() ]
It turns out that "GetPerFrameKey" is just being used as a vcall stub (a little forwarder that lets you call a virtual function through a non-virtual function pointer) for the RunnableMethod's |meth_| member. I bet the toolchain considered it too trivial to even give it a name in my list.

The vcall stub uses offset 0x4, and the template type on |meth_| is:
void (__thiscall SoftwareDisplay::*)(mozilla::TimeStamp),Tuple1<mozilla::TimeStamp>
so we're looking for the second virtual function on SoftwareDisplay, and it takes one TimeStamp.

That points to: virtual void NotifyVsync(TimeStamp aVsyncTimestamp);
Flags: needinfo?(dmajor)
...and based on the register state, the |this| pointer for that call to NotifyVsync is null -- meaning that |obj_| is null at RunnableMethod::Run. But the code just recently did a null-check on |obj_|!

https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/task.h#309

Perhaps somebody called Cancel on the RunnableMethod in between the null-check and the call? That could explain why it's intermittent.
Assignee: nobody → mchang
Status: NEW → ASSIGNED
Whiteboard: gfx-noted
(In reply to David Major [:dmajor] from comment #26)
> ...and based on the register state, the |this| pointer for that call to
> NotifyVsync is null -- meaning that |obj_| is null at RunnableMethod::Run.
> But the code just recently did a null-check on |obj_|!
> 
> https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/task.
> h#309
> 
> Perhaps somebody called Cancel on the RunnableMethod in between the
> null-check and the call? That could explain why it's intermittent.

Hmm so essentially that means if we have to call Cancel() on a posted task, it has to be on the same thread that posted the task. We protect against this case in the SoftwareDisplay, but if the actual task posting / execution code isn't thread safe, we can still hit this. Hmmph.
I don't see where the protection comes in (but I'm an outsider to this code). DisableVSync is called on the main thread, and it calls Cancel on a task running on the vsync thread. Sounds like it could cause these symptoms.
The message loop code isn't thread safe and has an owning pointer to the runnable method. When we disable vsync on the main thread, we cancel the running task, which nulls out the object[1]. At the same time, the task runs and passes in the NULL obj on the vsync thread, which causes the crash. Calling cancel() on a different thread than the task is running is bad, even though we protected it with mutexes.

This patch does not null out the task when we disable vsync. Instead it just sets a flag and let's the vsync task run. We bail out early in the NotifyVsync on the vsync thread if vsync is disabled. The shutdown case is still OK because before the SoftwareDisplay can destruct, it waits for the thread to stop.

We also don't leak between disabling vsync / enabling vsync. For example, if we disabled vsync, then re-enable vsync which posts a new vsync task before the original task finishes running. We'll just potentially run vsync twice. 

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/task.h#321
Attachment #8589782 - Flags: review?(bugmail.mozilla)
(In reply to David Major [:dmajor] from comment #28)
> I don't see where the protection comes in (but I'm an outsider to this
> code). DisableVSync is called on the main thread, and it calls Cancel on a
> task running on the vsync thread. Sounds like it could cause these symptoms.

Sorry I should've clarified. We protect the calls to the Cancel() and NotifyVsync with mutexes, but it didn't matter since the actual message loop code is the one nulling it out here.
Comment on attachment 8589782 [details] [diff] [review]
Do not cancel vsync task on main thread

Review of attachment 8589782 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/SoftwareVsyncSource.cpp
@@ +65,5 @@
>    { // scope lock
>      mozilla::MonitorAutoLock lock(mCurrentTaskMonitor);
>      mVsyncEnabled = false;
> +    // do not cancel the current task as
> +    // the message loop code is not not thread safe

too many "not"s

@@ +125,5 @@
>      delay = mozilla::TimeDuration::FromMilliseconds(0);
>    }
>  
> +  { // scope lock
> +    mozilla::MonitorAutoLock lock(mCurrentTaskMonitor);

So if I'm understanding this correctly, there's a potential problem here. Let's say the vsync thread is at the start of the call to ScheduleNextVsync (anywhere between the lock scope at the top of NotifyVsync and the lock scope here). While the vsync thread is here, the main thread quickly calls DisableVsync() followed by EnableVsync(). The DisableVsync() call will flip mVsyncEnabled to false, and then the EnableVsync() call will flip it back to true *and* schedule a new NotifyVsync task. And execution here will continue as if nothing ever happened, scheduling a second NotifyVsync task. So now you'll have two "streams" of vsync events firing instead of one. I know you love vsync, but this is too much vsync! :p

Not sure what the right fix here is - I thought of a few things but they all turned out to be flawed. I'll think about it some more. I feel like if we can come up with the right invariants and maintain them throughout this code it will make it simpler and more correct, but I'm not thinking straight right now.
Attachment #8589782 - Flags: review?(bugmail.mozilla) → review-
Oh yeah, the two streams of vsync would be bad haha. What about stopping and starting the thread when we disable/enable vsync? We originally didn't do this because stopping the base::thread actually destroys the underlying platform thread and starting the base::thread actually creates a new platform thread. But this would solve the two stream problem relatively cleanly. The only downside is that stopping / starting vsync become more expensive. So in psuedo code:

EnableVsync() {
   mVsyncThread->Start(); // creates a new underlying platform thread
   mVsyncThread->PostTask(NotifyVsync);
}

DisableVsync() {
   mVsyncEnabled = false;

   // blocks until the thread stops, which would be when it
   // NotifyVsync() and bails out early. Means could wait up to 16 ms.
   mVsyncThread->Stop();
}

I don't mind the latency that much and the behavior of creating a new thread every time would be equivalent to OS X.
I thought of a better solution: leave the vsync thread running, but when EnableVsync and DisableVsync are called, have those post enable and disable tasks to the vsync thread. Then all the important code will run serially on the vsync thread and you won't need any mutexes at all. On enabling, set the flag and notify vsync/schedule next vsync right away. On disabling, clear the flag and cancel any posted vsync task. That's it!
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #33)
> I thought of a better solution: leave the vsync thread running, but when
> EnableVsync and DisableVsync are called, have those post enable and disable
> tasks to the vsync thread. Then all the important code will run serially on
> the vsync thread and you won't need any mutexes at all. On enabling, set the
> flag and notify vsync/schedule next vsync right away. On disabling, clear
> the flag and cancel any posted vsync task. That's it!

Good idea! 

Enable / Disable vsync post a task to the vsync thread to actually enable/disable vsync. However, we still have to update the internal state that vsync is enabled on the main thread since this is the underlying assumption in all the code. This works as long as we're consistent. We schedule / cancel vsync tasks on the vsync thread and post the enable/disable vsync tasks only on the main thread. Ordering should be the same as well since we post tasks from the main thread in the same order.

The big difference is that the shutdown order has to be changed a bit. We can no longer guarantee that the SoftwareDisplay will be cleaned up on the main thread, so we call shutdown from the SoftwareVsyncSource, which disables vsync and shuts down the thread. We also get to delete locks!
Attachment #8589782 - Attachment is obsolete: true
Attachment #8590435 - Flags: review?(bugmail.mozilla)
Comment on attachment 8590435 [details] [diff] [review]
Enable / disable vsync on the vsync thread only

Review of attachment 8590435 [details] [diff] [review]:
-----------------------------------------------------------------

Will need to ponder this a bit more.

::: gfx/thebes/SoftwareVsyncSource.cpp
@@ +137,1 @@
>    mVsyncThread->Stop();

This DisableVsync call may not work as intended because the task it posts to the mVsyncThread may not run - the thread is killed right after the task is posted.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #35)
> Comment on attachment 8590435 [details] [diff] [review]
> Enable / disable vsync on the vsync thread only
> 
> Review of attachment 8590435 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Will need to ponder this a bit more.
> 
> ::: gfx/thebes/SoftwareVsyncSource.cpp
> @@ +137,1 @@
> >    mVsyncThread->Stop();
> 
> This DisableVsync call may not work as intended because the task it posts to
> the mVsyncThread may not run - the thread is killed right after the task is
> posted.

The base::Thread::Stop() posts a task on the message loop, which should be the last task [1]. It then waits for the task to finish then joins. Stop() guarantees that nothing on the loop should execute. 

[1] http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/thread.cc#113
Comment on attachment 8590435 [details] [diff] [review]
Enable / disable vsync on the vsync thread only

Review of attachment 8590435 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, looks good. Please also add an initializer for mCurrentVsyncTask though.
Attachment #8590435 - Flags: review?(bugmail.mozilla) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #38)
> Comment on attachment 8590435 [details] [diff] [review]
> Enable / disable vsync on the vsync thread only
> 
> Review of attachment 8590435 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Ok, looks good. Please also add an initializer for mCurrentVsyncTask though.

Thanks! Added.

https://hg.mozilla.org/integration/mozilla-inbound/rev/39f5e5456b5f
On a nice note, since we disabled the vsync compositor on Windows on Gecko 39, which was the only place enabled on 39 without hardware vsync, we're seeing the crashes due to this go away on 39.
https://hg.mozilla.org/mozilla-central/rev/39f5e5456b5f
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Your call if you want to uplift to Aurora for completeness' sake or not, but I presume you want this on b2g37 at least?
Flags: needinfo?(mchang)
Rebased for v 2.2. Still not sure we want to uplift this. Nothing on b2g 2.2 should actually be using this as the preferences are still off for the scenarios where this would be used. And there was enough difference between master and 2.2 that some other bug may exist and it's getting late. I'll let this sit on master for a few more days before deciding.
Attachment #8591786 - Flags: review+
FWIW, we do run Windows builds/tests on b2g37.
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #45)
> FWIW, we do run Windows builds/tests on b2g37.

Good to know! But the vsync stuff is disabled on Windows until Gecko 40.
Marking as won't fix. The software vsync code isn't actually used anywhere other than Gecko 40. On b2g 2.2, we're using actual hardware vsync so this problem won't occur. On Gecko 39 on Mac, we use hardware vsync as well so this is unlikely. If we start seeing crashes, we can uplift, but we shouldn't be unless the hardware composer is failing on b2g, in which case we should fix the hardware composer.
Untracking since this is wontfixed for 39.
You need to log in before you can comment on or make changes to this bug.