Closed Bug 1025765 Opened 10 years ago Closed 10 years ago

Win64 media failures in debug mochitest-3

Categories

(Core :: WebRTC, defect)

32 Branch
x86_64
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
Tracking Status
firefox-esr31 --- wontfix

People

(Reporter: away, Unassigned)

References

Details

(Keywords: sec-high)

Attachments

(4 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=41404461&tree=Date

Filing one bug for now under the assumption that it's a single root cause. Can split these out as necessary.

808 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | Test timed out.
822 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcLocal)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
823 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | [SimpleTest.finish()] this test already called finish!
824 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | called finish() multiple times
828 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcRemote)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
829 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | [SimpleTest.finish()] this test already called finish!
830 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/identity/test_peerConnection_peerIdentity.html | called finish() multiple times
868 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
869 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcRemote) initial dataChannels[0] failed to switch to 'open'
870 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
871 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Test timed out.
872 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcLocal)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
873 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  [SimpleTest.finish()] this test already called finish!
874 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  called finish() multiple times
875 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcRemote)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
876 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  [SimpleTest.finish()] this test already called finish!
877 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  called finish() multiple times
976 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
980 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | PeerConnectionWrapper (pcRemote) initial dataChannels[0] failed to switch to 'open'
984 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
985 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | Test timed out.
994 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcLocal)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
995 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | [SimpleTest.finish()] this test already called finish!
996 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | called finish() multiple times
1000 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcRemote)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
1001 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | [SimpleTest.finish()] this test already called finish!
1002 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudio.html | called finish() multiple times
1106 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
1110 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | PeerConnectionWrapper (pcRemote) initial dataChannels[0] failed to switch to 'open'
1114 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
1115 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Test timed out.
1124 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcLocal)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
1125 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | [SimpleTest.finish()] this test already called finish!
1126 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | called finish() multiple times
1130 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcRemote)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
1131 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | [SimpleTest.finish()] this test already called finish!
1132 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | called finish() multiple times
1224 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideoCombined.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
1228 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideoCombined.html | PeerConnectionWrapper (pcRemote) initial dataChannels[0] failed to switch to 'open'
1232 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideoCombined.html | PeerConnectionWrapper (pcLocal) initial dataChannels[0] failed to switch to 'open'
1233 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideoCombined.html | Test timed out.
1234 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | 4 test timeouts, giving up.
1235 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | Skipping 1789 remaining tests.
1240 ERROR TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideoCombined.html finished in a non-clean fashion, probably because it didn't call SimpleTest.finish()
1245 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcLocal)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
1246 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | [SimpleTest.finish()] this test already called finish!
1247 ERROR TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | called finish() multiple times
1251 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | Unexpected event 'onsignalingstatechange' fired with message = 'PeerConnectionWrapper (pcRemote)' at: ["PeerConnectionWrapper/this._pc.onsignalingstatechange@http://mochi.test:8888/tests/dom/media/tests/mochitest/pc.js:1374:5",""]
1252 INFO TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | [SimpleTest.finish()] this test already called finish!
1253 ERROR TEST-UNEXPECTED-FAIL | (SimpleTest/TestRunner.js) | called finish() multiple times
Return code: 1
Component: Video/Audio → WebRTC
Log from "mach test dom/media/tests/mochitest/test_dataChannel_basicAudio.html" on 32-bit dbg with max logging level.

I removed pldhash size complaints since it's reasonable for those to differ on win64.
Log from "mach test dom/media/tests/mochitest/test_dataChannel_basicAudio.html" on 64-bit dbg with max logging level.

I closed the browser window at 0:50 when it became clear that the test wasn't going to finish on its own.
I want to debug this but there's another bug that I really ought to be working on.

:jesup are you interested in taking a look? This bug is the last remaining unknown for the win64 tests on Date. If you can't/don't get to this, then I'll come back to it on my next event loop.
Flags: needinfo?(rjesup)
The important piece in the 64bit log is IMHO this:

0:18.42 4380[2212000]: Flow[87a6e8bd765ad168:2,rtp(none)]; Layer[dtls]: PacketReceived(135)
0:18.42 4380[2212000]: Flow[87a6e8bd765ad168:2,rtp(none)]; Layer[ice]:  SendPacket(15) succeeded
0:18.42 4380[2212000]: Flow[87a6e8bd765ad168:2,rtp(none)]; Layer[dtls]: SSL handshake error -12246

For whatever reason the DTLS handshake fails and there is currently no way to inform the JS side about it.

And the log file from the initial report is full of "SSL handshake error". So it seems like we have a DTLS problem on Win64.

@jesup: who would be the best person to look into this, besides Ekr?
@ekr: Looks like DTLS is not working on Win64. Any idea? Who could help debugging this?
Flags: needinfo?(rjesup) → needinfo?(ekr)
Group: media-core-security
OK, so we are doing four SendPacket(135)s and yet there are only four DTLS connections:

 0:17.74 4380[2212000]: Setting up DTLS as client
 0:17.83 4380[2212000]: Setting up DTLS as client
 0:18.23 4380[2212000]: Setting up DTLS as server
 0:18.38 4380[2212000]: Setting up DTLS as server

This is suspicious since the 135s are likely ClientHellos which is what you would expect if both sides thought they were Clients.

Nils: can you make this happen in your machine? If so, I've got some suggestions for places to look.
Flags: needinfo?(ekr)
In case it helps, the test failure is only on debug builds. (Or maybe the underlying issue is present on opt builds and goes unnoticed, I don't know)
Interesting.
(In reply to Eric Rescorla (:ekr) from comment #7) 
> This is suspicious since the 135s are likely ClientHellos which is what you
> would expect if both sides thought they were Clients.
> 
> Nils: can you make this happen in your machine? If so, I've got some
> suggestions for places to look.

To MT and me it looks like both ends get setup as DTLS client and server. The server side is missing an ICE packet so that the ICE connection is not open yet when it receives the client HELLO and therefore discards it. But then the server side suddenly sends this 135 bytes, which look like a client HELLO and that obviously confuses the client side.

David: How do I build this locally? Could you try to re-run the 64bit test with these env variables set SSLTRACE=1 and SSLDEBUG=1?
Flags: needinfo?(dmajor)
(In reply to Nils Ohlmeier [:drno] from comment #10)
> (In reply to Eric Rescorla (:ekr) from comment #7) 
> > This is suspicious since the 135s are likely ClientHellos which is what you
> > would expect if both sides thought they were Clients.
> > 
> > Nils: can you make this happen in your machine? If so, I've got some
> > suggestions for places to look.
> 
> To MT and me it looks like both ends get setup as DTLS client and server.

Yes, I agree.

That's what I cut and pasted above.


> The server side is missing an ICE packet so that the ICE connection is not
> open yet when it receives the client HELLO and therefore discards it. But
> then the server side suddenly sends this 135 bytes, which look like a client
> HELLO and that obviously confuses the client side.

Yes, that's what confuses me as well.

> 
> David: How do I build this locally? Could you try to re-run the 64bit test
> with these env variables set SSLTRACE=1 and SSLDEBUG=1?
Can we put a breakpoint in TLDtls::SetRole() to see if it's being called in some weirdly unexpected place?
Flags: needinfo?(dmajor)
> David: How do I build this locally?

I used "start-shell-msvc2010-x64.bat" and the mozconfig at browser/config/mozconfigs/win64/debug.

Feel free to use the Date branch as a playground if you need it. It does x64 builds and tests (the test results are displayed next to the 32-bit ones).
(In reply to David Major [:dmajor] from comment #14)
> > David: How do I build this locally?
> 
> I used "start-shell-msvc2010-x64.bat" and the mozconfig at
> browser/config/mozconfigs/win64/debug.
> 
> Feel free to use the Date branch as a playground if you need it. It does x64
> builds and tests (the test results are displayed next to the 32-bit ones).

Unfortunately the other log files was not as helpful as we were hoping.
Thanks for the pointer (without I did not succeed to compile 64bit so far). I'm going to give it a try with a local build later.
(In reply to Eric Rescorla (:ekr) from comment #12)
> Can we put a breakpoint in TLDtls::SetRole() to see if it's being called in
> some weirdly unexpected place?

I got the same four hits on both win32 and win64. Two client followed by two server, same stacks.
Group: media-core-security → core-security
I have a 64bit build running locally now as well, which hangs at test_dataChannel_basicAudio.html.
David can you confirm that if you run just test_peerConnection_basicAudio.html it passes?
Flags: needinfo?(dmajor)
Confirmed, test_peerConnection_basicAudio.html passes.
Flags: needinfo?(dmajor)
Here is the confirmation for our theory from my logs:

 0:21.89 2014-07-15 03:48:28.950000 UTC - 3860[2012000]: Setting up DTLS as client
 0:22.43 1796: SSL3[501600656]: send client_hello handshake
 0:22.51 2014-07-15 03:48:29.558000 UTC - 3860[2012000]: Setting up DTLS as server
 0:22.70 1796: SSL3[501602672]: send client_hello handshake
OK, I have a partial theory here. If you look at the code, the client/server switch
keys off the same value, role_, but using the opposite test.

role_ is an enum with only two values, CLIENT and SERVER.

Here is the switch
for whether we set up as the client or the server, where we test against CLIENT:
http://dxr.mozilla.org/mozilla-central/source/media/mtransport/transportlayerdtls.cpp#469

and here is the switch for how we reset the handshake, where we test against SERVER
http://dxr.mozilla.org/mozilla-central/source/media/mtransport/transportlayerdtls.cpp#571

Note that this is the same function.

If role_ somehow got set to some value that we neither client nor server,
(E.g., due to memory error) then we would take the SERVER arm for the first
of these branches and the CLIENT arm in the second one, which would be consistent
with what we are seeing.

Note that we initialize the value in the ctor at:
http://dxr.mozilla.org/mozilla-central/source/media/mtransport/transportlayerdtls.h#50

And the only place that seems to call SetRole() is at:
http://dxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp#3148

Nils: I would suggest that you put a check in TransportLayerDtls::Setup() where you
print out the value of role_ and see if it's neither CLIENT nor SERVER. If so,
then you'll need to track down where it gets messed up. Perhaps you can track
it down with printfs/asserts if that's what the problem is here.

I'm not totally in love with this theory since it would seem to imply that only
one side is getting messed up, which is weird. So, maybe something else is happening,
but at least this seems like a place to start.
Wasted some time today with chasing a false alarm. Martin and me had the same theory/observation on Friday. Will put instrumentation around that tomorrow.

A few observations:
- this happens with all data channel tests, but not with the peerConnection tests
- the test_dataChannel_bug1013809.html initialized first the server side then the client and it looks like the server side sends a client_hello even in that case (which probably means the role_ being initialized wrong from the get go):

 0:24.36 2014-07-15 05:30:57.212000 UTC - 3012[2012000]: Setting up DTLS as server
 0:24.62 3548: SSL3[401248416]: send client_hello handshake
 0:24.86 2014-07-15 05:30:57.761000 UTC - 3012[2012000]: Setting up DTLS as server
 0:24.87 3548: SSL3[188835632]: send client_hello handshake
 0:26.25 2014-07-15 05:30:59.151000 UTC - 3012[2012000]: Setting up DTLS as client
 0:26.26 3548: SSL3[161130720]: send client_hello handshake
 0:26.61 2014-07-15 05:30:59.516000 UTC - 3012[2012000]: Setting up DTLS as client
 0:26.62 3548: SSL3[175048256]: send client_hello handshake
Aha! The second test is: cmp dword ptr [rsi+0E8h],ebx

Where ebx is supposed to be 1, but it's actually garbage due to bug 1025729.

I've applied the NSS patch and it fixes test_dataChannel_basicAudio.html. I'm going to bet it fixes the rest of them too...
Let's see how this turns out: https://tbpl.mozilla.org/?tree=Date&rev=2a11c3906c7e

I should have put two and two together since I've hit the rbx issue before. Sorry for the time waste.
This sounds like we're reading random garbage, so I'm going to set this to sec-high.
Keywords: sec-high
I'm not trying to debate the security level, but as I understand the situation,
we have code here which is correctly trying to read some memory value
(in this case a register) which has been independently trashed. so this is
a side effect of that other problem, which might manifest other side effects
in other cases, right?
Depends on: 1025729
Attachment #8456257 - Flags: review?(ekr) → review?(martin.thomson)
Comment on attachment 8456257 [details] [diff] [review]
Use the same check in DTLS init

Review of attachment 8456257 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not sure that we really need to land this, if it's down to a corruption issue elsewhere.

It is fine though.
Attachment #8456257 - Flags: review?(martin.thomson) → review+
1025729 has landed in m-c and has fixed this issue. Do you still want to check in the DTLS change?
Flags: needinfo?(drno)
If it works without this change it might be better to not touch a working system. So let close this issue if it works now for you.
Flags: needinfo?(drno)
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
Does this bug need to remain private?
Flags: needinfo?(drno)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #30)
> Does this bug need to remain private?

I don't see anything in here which needs to remain private.
Flags: needinfo?(drno)
Group: core-security
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: