Closed Bug 1581898 Opened 3 months ago Closed 2 months ago

Firefox 69: Tab crash when establishing a WebRTC call (CrashChannel::OpenContentStream)

Categories

(Core :: WebRTC: Audio/Video, defect, P2)

69 Branch
defect

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox69 --- wontfix
firefox70 --- fixed
firefox71 --- fixed

People

(Reporter: g, Assigned: bwc, NeedInfo)

References

(Regression)

Details

(Keywords: regression)

Attachments

(6 files)

User Agent: Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0

Steps to reproduce:

We have a Web application that is using the WebRTC Framework. Up until version 68 everything works fine. In version 69 released a few days ago, the tab crashes "Gah. Your tab just crashed." when attempting to establish the audio channel.

The offending function is:
nsresult CrashChannel::OpenContentStream(bool, class nsIInputStream**, class nsIChannel**)

This behavior is deterministic. It is happening in virtually every machine we can find, all running windows (tested with Windows 8 and Windows 10).

Additional info:

We can provide SDP/SIP captures if needed.

The crash id is bp-cc88dcbc-95da-4c95-8fae-f594d0190917

Thank you in advance.

Actual results:

Tab crashed

Expected results:

The WebRTC audio channel gets established

hi, can you try to come up with an exact regression range for when this crash first started using the tool from https://mozilla.github.io/mozregression/?

Hi, thank you for the fast response.
Here's the mozregression output:

2019-09-18T10:28:10: INFO : Narrowed inbound regression window from [196e3189, 1e420eb1] (3 builds) to [0489b95c, 1e420eb1] (2 builds) (~1 steps left)
2019-09-18T10:28:10: DEBUG : Starting merge handling...
2019-09-18T10:28:10: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=1e420eb1ad7a182dc12c8e9a7de290a00e1d09c7&full=1
2019-09-18T10:28:11: DEBUG : Found commit message:
Bug 1333879: Handle multiple codecs in answer. r=mjf

Differential Revision: https://phabricator.services.mozilla.com/D30687

2019-09-18T10:28:11: DEBUG : Did not find a branch, checking all integration branches
2019-09-18T10:28:11: INFO : The bisection is done.
2019-09-18T10:28:11: INFO : Stopped

Status: UNCONFIRMED → NEW
Has Regression Range: --- → yes
Component: Untriaged → WebRTC: Audio/Video
Ever confirmed: true
Flags: needinfo?(docfaraday)
Product: Firefox → Core
Regressed by: 1333879

So it looks like the parent process forcing the content process to crash? I am not sure what situations would cause this.

Keywords: regression

Here's a link to a recent ASAN debug build: https://queue.taskcluster.net/v1/task/QcRm5uqYRzub_P5v073YTQ/runs/0/artifacts/public/build/setup.exe

If you could use that, and run it with stdout/stderr logging enabled, that might give me enough information.

Alternately, if you can give me access to the web app you're developing, I could try to reproduce the problem myself.

Flags: needinfo?(docfaraday) → needinfo?(g)
Attached file install.log
Hi,

That setup appears to have no content (~700KB) and it doesn't download anything.
I get an error when the setup tries to create a shortcut for the firefox.exe which doesn't exist.  Only 3 files are created in the destination folder:
/install.log
/uninstall/shortcuts_log.ini
/uninstall/uninstall.log

Here is the install.log:
Attached file stderr.log
Attached file stdout.log

oops, it appears the comment gets submitted when adding an attachment, sorry about that.

Flags: needinfo?(g)
Attached file sdps.txt
Sure thing, here it goes. 
The first SDP is the original invite from the SIP Server that we set as remote description. The second is the brower's offer. Let me known if you need anything else.

Ok, the only thing in there that looks a little unusual is the fact that telephone-event's pt comes before the actual audio codec on the m-line, but that is completely valid so we shouldn't be getting upset about that.

Assignee: nobody → docfaraday

So, I cannot reproduce this using just the SDP exchange on OS X. Maybe a windows-specific bug? Looking into it.

Fiddle that I'm trying to repro with, very minimal: https://jsfiddle.net/whe950ar/1/

Yeah, this is probably going to require actual media flow.

Ok, this reproduces the bug: https://jsfiddle.net/sa12bjc4/2/

Yeah, webrtc.org just has a fit if telephone event is first in the list of configured codecs.

Bug 1333879 broke this because the code that used to remove all but the first "real" codec also happened to perform this reordering. I am going to implement this workaround logic as close to the boundary with webrtc.org as possible, and make sure there's a comment.

We also need to have a test for this, probably a mochitest or crashtest.

Hi,

Thanks for the update Byron.
When you have a fix for this, we are glad to test it out on our app.

Flags: needinfo?(g)

@Ryan VanderMeulen, I noticed that you changed the tracking flag "firefox69" to "wontfix". May I ask why? Is this fix only be available on ff70+?

there isn't a fix available yet and firefox 70 is scheduled to be released in a month.
until then the issue probably doesn't have a major general impact and doesn't meet the criteria for a patch to be uplifted to the release channel: https://wiki.mozilla.org/Release_Management/Uplift_rules#Release_Uplift

We were just wondering what was the reasoning behind the decision, and it makes perfect sense. Thank you for the clarification.

Priority: -- → P2

Hmm, the crashtest is timing out on OS X. Need to figure out what is happening there.

It looks like PeerConnectionImpl::Initialize is failing maybe?

Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c29af4939f66
Add test-case that moves payload type 101 to the front. r=jib

Backed out changeset c29af4939f66 (Bug 1581898) for crashing mochitests

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&classifiedState=unclassified&searchStr=linux%2Cx64%2Casan%2Cmochitests%2Ctest-linux64-asan%2Fopt-mochitest-media-e10s&fromchange=95114b9893afbbbaae7bb945b42b38d5c8ad01f3&tochange=36847c53be73d10f49fe45891c37104291e389b9

Backout link: https://hg.mozilla.org/integration/autoland/rev/daee59fa527009d0bdc0e34499310018c8da0bff

Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=269376010&repo=autoland&lineNumber=9526

[task 2019-10-02T06:22:00.517Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: D/mtransport Trickle candidate is redundant for stream 'PC:1569997320211340 (id=2147483749 url=https://example.com/tests/dom/media/tests/mochitest/test_peerConnection_telephoneEventFirst transport-id=transport_0' because it is completed: candidate:3 1 TCP 2105524479 fd15:4ba5:5a2b:100a:0:242:ac11:4 9 typ host tcptype active
[task 2019-10-02T06:22:00.517Z] 06:22:00 INFO - GECKO(1380) | (ice/ERR) ICE(PC:1569997320211340 (id=2147483749 url=https://example.com/tests/dom/media/tests/mochitest/test_peerConnection_telephoneEventFirst): peer (PC:1569997320211340 (id=2147483749 url=https://example.com/tests/dom/media/tests/mochitest/test_peerConnection_telephoneEventFirst:default), stream(PC:1569997320211340 (id=2147483749 url=https://example.com/tests/dom/media/tests/mochitest/test_peerConnection_telephoneEventFirst transport-id=transport_0 - dc13ee45:67e90eb3329a522e88eed4f054d5f545) tried to trickle ICE in inappropriate state 4
[task 2019-10-02T06:22:00.517Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: D/mtransport Trickle candidate is redundant for stream 'PC:1569997320211340 (id=2147483749 url=https://example.com/tests/dom/media/tests/mochitest/test_peerConnection_telephoneEventFirst transport-id=transport_0' because it is completed:
[task 2019-10-02T06:22:00.518Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: I/mtransport Flow[transport_0(none)]; Layer[dtls]: ****** SSL handshake completed ******
[task 2019-10-02T06:22:00.521Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: I/mtransport Flow[transport_0(none)]; Layer[dtls]: Selected ALPN string: webrtc
[task 2019-10-02T06:22:00.521Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: D/mtransport Created SRTP flow!
[task 2019-10-02T06:22:00.521Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: I/mtransport Flow[transport_0(none)]; Layer[dtls]: ****** SSL handshake completed ******
[task 2019-10-02T06:22:00.522Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: I/mtransport Flow[transport_0(none)]; Layer[dtls]: Selected ALPN string: webrtc
[task 2019-10-02T06:22:00.522Z] 06:22:00 INFO - GECKO(1380) | [Child 1424: Socket Thread]: D/mtransport Created SRTP flow!
[task 2019-10-02T06:22:00.523Z] 06:22:00 INFO - GECKO(1380) | #
[task 2019-10-02T06:22:00.523Z] 06:22:00 INFO - GECKO(1380) | # Fatal error in /builds/worker/workspace/build/src/media/webrtc/trunk/webrtc/voice_engine/channel.cc, line 1719
[task 2019-10-02T06:22:00.523Z] 06:22:00 INFO - GECKO(1380) | # last system error: 0
[task 2019-10-02T06:22:00.524Z] 06:22:00 INFO - GECKO(1380) | # Check failed: props
[task 2019-10-02T06:22:00.524Z] 06:22:00 INFO - GECKO(1380) | #
[task 2019-10-02T06:22:00.524Z] 06:22:00 INFO - GECKO(1380) | #
[task 2019-10-02T06:22:00.612Z] 06:22:00 ERROR - GECKO(1380) | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2019-10-02T06:22:00.761Z] 06:22:00 INFO - GECKO(1380) | JavaScript error: resource://services-settings/RemoteSettingsClient.jsm, line 149: Error: Unknown callback
[task 2019-10-02T06:22:00.842Z] 06:22:00 INFO - GECKO(1380) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2019-10-02T06:22:01.688Z] 06:22:01 INFO - GECKO(1380) | 1569997321685 Marionette TRACE Received observer notification xpcom-will-shutdown
[task 2019-10-02T06:22:01.690Z] 06:22:01 INFO - GECKO(1380) | 1569997321687 Marionette INFO Stopped listening on port 2828
[task 2019-10-02T06:22:01.690Z] 06:22:01 INFO - GECKO(1380) | 1569997321687 Marionette DEBUG Remote service is inactive
[task 2019-10-02T06:22:01.893Z] 06:22:01 INFO - GECKO(1380) | -----------------------------------------------------
[task 2019-10-02T06:22:01.893Z] 06:22:01 INFO - GECKO(1380) | Suppressions used:
[task 2019-10-02T06:22:01.893Z] 06:22:01 INFO - GECKO(1380) | count bytes template
[task 2019-10-02T06:22:01.894Z] 06:22:01 INFO - GECKO(1380) | 27 832 nsComponentManagerImpl
[task 2019-10-02T06:22:01.895Z] 06:22:01 INFO - GECKO(1380) | 2 288 libfontconfig.so
[task 2019-10-02T06:22:01.895Z] 06:22:01 INFO - GECKO(1380) | -----------------------------------------------------

Flags: needinfo?(docfaraday)

Yeah, I butterfingered that; the second part was supposed to land too, to fix the failure the test-case causes.

Flags: needinfo?(docfaraday)
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/1176f96ac84a
Add test-case that moves payload type 101 to the front. r=jib
https://hg.mozilla.org/integration/autoland/rev/0b26b9b42bd7
Move telephone-event to the back of the codec list, because webrtc.org crashes if we don't. r=mjf
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71

Can you verify the fix worked in the latest nightly? Thanks!

Flags: needinfo?(g)

Is this something you think might be good for beta uplift? We're heading into beta 13 (of 14) so there is not much leeway.

Flags: needinfo?(docfaraday)

This would be a fairly easy uplift I think. I'll request tomorrow.

Comment on attachment 9094691 [details]
Bug 1581898: Move telephone-event to the back of the codec list, because webrtc.org crashes if we don't. r?mjf

Beta/Release Uplift Approval Request

  • User impact if declined: Websites could trivially cause nullptr tab crashes, and certain legitimate webrtc scenarios would also be broken.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This is a fairly simple patch, with a test case that hits the new code.
  • String changes made/needed: None
Flags: needinfo?(docfaraday)
Attachment #9094691 - Flags: approval-mozilla-beta?
Attachment #9094690 - Flags: approval-mozilla-beta?

Comment on attachment 9094691 [details]
Bug 1581898: Move telephone-event to the back of the codec list, because webrtc.org crashes if we don't. r?mjf

Fix for WebRTC crash, let's uplift for beta 13.

Attachment #9094691 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9094690 [details]
Bug 1581898: Add test-case that moves payload type 101 to the front. r?jib

Let's also uplift the test.

Attachment #9094690 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Hi
We have confirmed that the fix works well with our application (tested with nightly 71.0a1 (2019-10-04) (64-bit))
Thank you everyone for the support.

You need to log in before you can comment on or make changes to this bug.