Closed Bug 1130061 Opened 5 years ago Closed 5 years ago

Firefox crash in webrtc::videocapturemodule::DeviceInfoDS::CreateCapabilityMap(char const*)

Categories

(Core :: WebRTC: Audio/Video, defect, P1, critical)

36 Branch
x86
Windows NT
defect

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox38 + verified
firefox38.0.5 --- fixed
firefox39 --- fixed
firefox40 --- fixed

People

(Reporter: marcia, Assigned: dmajor)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-b296e542-44fa-41bb-bcc4-f81b02150203.
=============================================================

Seen while reviewing Beta 6 crash stats - almost all URLs are facebook.com. All URLs are here: https://crash-stats.mozilla.com/report/list?signature=webrtc::videocapturemodule::DeviceInfoDS::CreateCapabilityMap%28char%20const*%29


Frame 	Module 	Signature 	Source
0 	xul.dll 	webrtc::videocapturemodule::DeviceInfoDS::CreateCapabilityMap(char const*) 	media/webrtc/trunk/webrtc/modules/video_capture/windows/device_info_ds.cc
1 	xul.dll 	webrtc::videocapturemodule::DeviceInfoImpl::NumberOfCapabilities(char const*) 	media/webrtc/trunk/webrtc/modules/video_capture/device_info_impl.cc
2 	xul.dll 	webrtc::ViEInputManager::NumberOfCaptureCapabilities(char const*) 	media/webrtc/trunk/webrtc/video_engine/vie_input_manager.cc
3 	xul.dll 	webrtc::ViECaptureImpl::NumberOfCapabilities(char const*, unsigned int) 	media/webrtc/trunk/webrtc/video_engine/vie_capture_impl.cc
4 	xul.dll 	mozilla::MediaEngineWebRTCVideoSource::SatisfiesConstraintSets(nsTArray<mozilla::dom::MediaTrackConstraintSet const*> const&) 	dom/media/webrtc/MediaEngineWebRTCVideo.cpp
5 	xul.dll 	mozilla::VideoDevice::SatisfiesConstraintSets(nsTArray<mozilla::dom::MediaTrackConstraintSet const*> const&) 	dom/media/MediaManager.cpp
6 	xul.dll 	mozilla::GetSources<mozilla::VideoDevice, mozilla::VideoTrackConstraintsN> 	dom/media/MediaManager.cpp
7 	xul.dll 	mozilla::GetUserMediaTask::SelectDevice(mozilla::MediaEngine*) 	dom/media/MediaManager.cpp
8 	xul.dll 	NS_SetThreadName<9>(nsIThread*, char const (&)[9]) 	xpcom/glue/nsThreadUtils.h
9 	xul.dll 	mozilla::MediaEngineWebRTC::MediaEngineWebRTC(mozilla::MediaEnginePrefs&) 	dom/media/webrtc/MediaEngineWebRTC.cpp
10 	xul.dll 	mozilla::GetUserMediaTask::Run() 	dom/media/MediaManager.cpp
11 	xul.dll 	MessageLoop::DoWork() 	ipc/chromium/src/base/message_loop.cc
12 		@0x5
[Tracking Requested - why for this release]: Somewhat high crash (#31, 0.42%) on 38 beta. Might not be worth tracking, your call.
For some reason |streamConfig| is 0xF0 in these crashes. Any idea?
Flags: needinfo?(rjesup)
Most of those are in https://www.facebook.com/videocall/incall/?...

Almost 50% of those crashes are in the first 60 seconds of Firefox uptime, making it rank high in the importance-rated top crash score stats I'm experimenting with (at least for 38.0b5).
Tracking as it impacts some major website.
Maire, do you know who could help on this? Thanks
Flags: needinfo?(mreavy)
Flags: needinfo?(blassey.bugs)
I believe Bas is the person who wrote this code and is the best person to assess what might be going on and what our options are.  This code hasn't changed in quite some time; so it may be due to a graphics driver update.  I'll needinfo Milan too in case he has more info on that.
Flags: needinfo?(rjesup)
Flags: needinfo?(mreavy)
Flags: needinfo?(milan)
Flags: needinfo?(blassey.bugs)
Flags: needinfo?(bas)
So, looking at one of the crashes:
 RELEASE_AND_CLEAR(streamConfig);

However, RELEASE_AND_CLEAR(p) -> if (p) { (p)->Release(); (p) = NULL; }
So it shouldn't be a null streamConfig.  and streamConfig is initialized to NULL before we fetch it.
Plus nothing modifies streamConfig after it's fetched.

I presume therefore that Release() was inlined somehow and is actually crashing there via a hidden null pointer
This is a COM interface so Release is a virtual call:

620df032 8b4c2410        mov     ecx,dword ptr [esp+10h] ; get p
620df036 85c9            test    ecx,ecx                 ; if (p == nullptr)
620df038 740b            je      ...                     ;   skip the call
620df03a 8b01            mov     eax,dword ptr [ecx]     ; load vtable (*CRASH HERE*)
620df03c 51              push    ecx                     ; push 'this'
620df03d ff5008          call    dword ptr [eax+8]       ; call Release

The crash is while deref'ing streamConfig to get its vtable. And the address is always 0xf0:

Address facet Rank 	Address 	Count 	%
1 	0xf0 	2707 	100.00 %
I agree though that streamConfig must have been valid at some earlier point in the function.

Maybe the stack contents (or the stack pointer) got busted sometime after that?
(In reply to David Major [:dmajor] from comment #9)
> I agree though that streamConfig must have been valid at some earlier point
> in the function.
> 
> Maybe the stack contents (or the stack pointer) got busted sometime after
> that?

Seems unlikely (and a very predictable failure if so for stack trashing).  Perhaps it was somehow released behind our back?  Or some driver bug, though this doesn't seem linked to a specific OS version or driver/manufacturer.  Perhaps we're not doing something to lock it down.  just grabbing at straws here....
The trouble is it's a pointer on the stack. Releasing behind our back wouldn't change that pointer. That's why I too am grabbing at straws :)
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #6)
> I believe Bas is the person who wrote this code and is the best person to
> assess what might be going on and what our options are.  This code hasn't
> changed in quite some time; so it may be due to a graphics driver update. 
> I'll needinfo Milan too in case he has more info on that.

I do not recall writing this code? Did I write this code?

I don't think I did....
Flags: needinfo?(bas)
bas: bug 731407: Remove DShow BaseClass usage from webrtc drop r=cpearce
3 years ago....
Looking at the registers in several dumps, the value of |count| is always 99 (0x63), and |tmp| is some huge number. That could be more evidence for stack bustage.

(I also noticed that the StringFromGUID2 block is missing arguments on the logging, and seems to leak |pmt|, but those shouldn't affect this issue)
So this magic 0xf0 is always right next to a 0x140, in other words 320x240.
https://pastebin.mozilla.org/8831233
99/0x63 is apparently kVideoUnknown
At this point I'd recommend stepping through this code on a live build with an unknown media subtype, and seeing how the stack layout compares to that pastebin.
The stack where we expect streamConfig, actually looks like the params for a GetFrameRateList:

1df4f4cc  0aa60908 ; videoControlConfig this ptr
1df4f4d0  0aa60a04 ; outputCapturePin, in similar memory region
1df4f4d4  00000000 ; tmp
1df4f4d8  00000140 ; size.cx
1df4f4dc  000000f0 ; size.cy (supposed to be streamConfig)
1df4f4e0  1df4f548 ; &listSize, on stack
1df4f4e4  1df4f504 ; &frameDurationList, on stack

Btw - frameDurationList leaks!
Note that sequence will be right below the current SP when we go into (or out of) that call.  Stepping through a Nightly, I do see a sequence like that in the stack at the CALL for GetFrameRateList(); in my case, stepping over the CALL trashed the outputCapturePin stack arg (and popped them all off).

I can't see any sequence where we get from here to RELEASE_AND_CLEAR(streamConfig) with this stack still intact, or even close.  Feels like someone trashed the stack pointer... perhaps while in GetFrameRateList(), so when we pop out it's wrong (but that might hose the return address too) or it got trashed at some later point in the function.  Neither makes a ton of sense.

It's interesting that tmp is 0.

I did try hosing the subtype so it went through the GUID trace; no trashing I could see.

Is there any way we could contact a few of the reporters and find out what HW/drivers they're running (especially cameras)?
Flags: needinfo?(kairo)
> Btw - frameDurationList leaks!

It's not a pointer to memory you're supposed to free, so it doesn't.

We do leak pmt if the subtype is unknown (when we 'continue'), and in that case we also use stack garbage for the width/height in the TRACE debug, though that shouldn't matter to this bug.
Rank: 5
Priority: -- → P1
> > Btw - frameDurationList leaks! 
> It's not a pointer to memory you're supposed to free, so it doesn't.

Hm, the doc for IAMVideoControl::GetFrameRateList says:
The caller is responsible for freeing the memory through a call to CoTaskMemFree.
(In reply to Randell Jesup [:jesup] from comment #19)
> Is there any way we could contact a few of the reporters and find out what
> HW/drivers they're running (especially cameras)?

The graphics chips we have in reports, but stuff like cameras not. That said, I'm not sure how much end users even know about what they actually are running, and we rarely reach out to them at all, and with not too much success when we do, in my experience - but then, all outreach to users on crashes has been done via the User Advocacy team so far.

All that said, could you add info on interesting hardware/drivers to crash reports with annotations?
Flags: needinfo?(kairo)
(In reply to David Major [:dmajor] from comment #21)
> > > Btw - frameDurationList leaks! 
> > It's not a pointer to memory you're supposed to free, so it doesn't.
> 
> Hm, the doc for IAMVideoControl::GetFrameRateList says:
> The caller is responsible for freeing the memory through a call to
> CoTaskMemFree.

Huh.   Missed the Remark.  THanks!  Not the cause of this problem, though
> All that said, could you add info on interesting hardware/drivers to crash
> reports with annotations?

Any pointers to how that's done?
Flags: needinfo?(kairo)
If you want to add a new field, that would be CrashReporter::AnnotateCrashReport.

There's also a free-form notes area that you can access through CrashReporter::AppendAppNotesToCrashReport (mostly used by the gfx team).

To see how they end up looking in crash-stats, you can go to bp-b296e542-44fa-41bb-bcc4-f81b02150203 and click the Metadata tab.
Flags: needinfo?(kairo)
(In reply to David Major [:dmajor] from comment #25)
> There's also a free-form notes area that you can access through
> CrashReporter::AppendAppNotesToCrashReport (mostly used by the gfx team).

New fields are preferred to that, though (ideally, I'd like us to fade out the free-form notes field, it's a leftover from the times when adding fields and searching for those wasn't as easy as now).
Argh, I should have checked for correlations sooner:

    100% (175/175) vs.   0% (175/64366) vwcsource.ax (1.5.0.0)

    Image path: C:\Program Files (x86)\WebcamMax\vwcsource.ax
    Timestamp:        Tue Dec 29 22:01:53 2009 (4B3AC2A1)

Looks like they're all with version 1.5 from 2009. Should we just block it?
(In reply to David Major [:dmajor] from comment #27)
> Looks like they're all with version 1.5 from 2009. Should we just block it?
Probably!
Attached patch BlocklistSplinter Review
Flags: needinfo?(milan)
Attachment #8597297 - Flags: review?(benjamin)
Attachment #8597297 - Flags: review?(benjamin) → review+
Assignee: nobody → dmajor
Keywords: checkin-needed
David, could you fill the uplift request now so that it is ready? (I will start the build of beta 8 during the week end).
FWIW, the next beta for 38 is going to build on Sunday to be released on Monday, if we can get this to m-r before that, would be awesome to see if it works to prevent the crash for the beta audience.
Approval Request Comment
[Feature/regressing bug #]: external app
[User impact if declined]: crashes in webrtc
[Describe test coverage new/current, TreeHerder]: pending
[Risks and why]: the usual blocklisting risks (app is broken, may not fail gracefully, etc)
[String/UUID change made/needed]: none
Comment on attachment 8597297 [details] [diff] [review]
Blocklist

Bad paste; please see above for approval request
Attachment #8597297 - Flags: approval-mozilla-beta?
Attachment #8597297 - Flags: approval-mozilla-aurora?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #32)
> FWIW, the next beta for 38 is going to build on Sunday to be released on
> Monday, if we can get this to m-r before that, would be awesome to see if it
> works to prevent the crash for the beta audience.

If a merge doesn't happen, I'll hand-merge the patch over to m-c to ensure it's in tomorrow's nightly.
https://hg.mozilla.org/mozilla-central/rev/c09eaddb6740
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
NI to kairo to check crash stats since it should be in m-c today; next Beta build will be Sunday
Flags: needinfo?(kairo)
Comment on attachment 8597297 [details] [diff] [review]
Blocklist

[Triage Comment]
we merged m-b into m-r
Attachment #8597297 - Flags: approval-mozilla-release+
Attachment #8597297 - Flags: approval-mozilla-beta?
Attachment #8597297 - Flags: approval-mozilla-aurora?
Attachment #8597297 - Flags: approval-mozilla-aurora+
(In reply to Randell Jesup [:jesup] from comment #37)
> NI to kairo to check crash stats since it should be in m-c today; next Beta
> build will be Sunday

It's going to take a while to verify this; it's super low volume on nightly (no surprise). Even on aurora it will need 2-3 builds before I'd feel confident in the numbers.
(In reply to David Major [:dmajor] from comment #40)
> It's going to take a while to verify this; it's super low volume on nightly
> (no surprise). Even on aurora it will need 2-3 builds before I'd feel
> confident in the numbers.

That said, if we want this fix in 38 release, we should uplift it until Wednesday this week (yes, only two days left) as the only build done after that is the hopefully-final release candidate.
Flags: needinfo?(kairo)
And it has the approval for that anyhow. I'll surely watch Beta 38 data as always, let's leave it as a reminder to check for this one specifically.
Flags: needinfo?(kairo)
Yeah, given the nature of the crash and the fix, I don't think we need to block on a verify to put this on 38 (and it's already a+ for that). Just setting expectations for verification later.
I can confirm that this signature is not recorded for 38.0b9 any more.
Flags: needinfo?(kairo)
You need to log in before you can comment on or make changes to this bug.