Closed Bug 839677 Opened 11 years ago Closed 11 years ago

(bad message queue pointer) Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out

Categories

(Core :: WebRTC: Signaling, defect, P1)

21 Branch
x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla22

People

(Reporter: abr, Assigned: abr)

References

Details

(Whiteboard: [WebRTC],[blocking-webrtc+] [qa-])

Attachments

(1 file, 1 obsolete file)

philor
https://tbpl.mozilla.org/php/getParsedLog.php?id=19459762&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound opt test mochitest-2 on 2013-02-05 10:30:24
slave: talos-r4-lion-032

26010 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out.
Summary: /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH → Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH
Whiteboard: [WebRTC],[blocking-webrtc+]
Ms2ger%gmail.com
https://tbpl.mozilla.org/php/getParsedLog.php?id=19512848&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound debug test mochitest-2 on 2013-02-06 16:39:53
slave: talos-r4-lion-068

26008 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out.
Summary: Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH → Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out
Depends on: 841566, 841457
Any TBPL stars after this comment should contain useful logging information that isolates this problem to a smaller part of the system.
Priority: -- → P1
(In reply to Adam Roach [:abr] from comment #6)
> Any TBPL stars after this comment should contain useful logging information
> that isolates this problem to a smaller part of the system.

Has the logging provided any more insight? :-)
(In reply to Ed Morley [:edmorley UTC+0] from comment #30)
> (In reply to Adam Roach [:abr] from comment #6)
> > Any TBPL stars after this comment should contain useful logging information
> > that isolates this problem to a smaller part of the system.
> 
> Has the logging provided any more insight? :-)

It has, and I spent quite a bit of time on Friday doing analysis of the logs of good runs versus bad runs to try to nail down where things go wrong. I did manage to find a fairly consistent difference that I suspected was the problem; however, after doing work to make things happen in the order that appeared to yield success, I found that forcing the order that I thought would cause failure didn't actually cause failure.

The good news is that Bug 845523, now landed on m-c, will eliminate the ability for this set of events to occur in different orders. Hopefully, this will make the actual differences between successful and failure runs easier to find.

Believe me, I understand that this is annoying for the sheriffs, and getting rid of the intermittent oranges is on the top of my priority list.
Thank you for your work on this so far - much appreciated :-)
Okay, I think I see the problem now. It appears that the failure runs all show the CCApp thread getting on the CPU and starting to process messages before the GSM Task thread has CPU cycles at all. The model here is that the first thing each thread does is sets its inbound message queue. But since GSM hasn't run at all, its queue is still NULL. This means the CCApp->GSM message "SETPEERCONNECTION" is going to fail to be delivered.

The failure path is pretty self-evident from that point forward.

Rather than trying to synchronize start up further, I think the easy fix here is to initialize all the queues prior to starting any of the threads.
It turns out only the GSM Task queue copied the queue to a module-local variable. Everyone else uses the globals declared in init.c. The patch I just attached -- as of yet untested -- changes GSM to behave the same way, which should eliminate any possibility of some other thread attempting to enqueue a message to GSM before it's ready.

I'll be requesting review on the patch as soon as I determine that I haven't broken anything, hopefully later today (but before bugzilla goes down for the upgrade).
Attachment #721471 - Attachment is obsolete: true
Comment on attachment 721493 [details] [diff] [review]
Remove problematic gsm_msg_queue and use gsm_msgq instead

Randell: This passes signaling_unittests and mochi tests on my local machine. Given the state of the try infrastructure, I'm not sure this kind of change warrants a try push. Let me know if you'd prefer to see a try run.
Attachment #721493 - Flags: review?(rjesup)
Attachment #721493 - Flags: review?(rjesup) → review+
https://hg.mozilla.org/mozilla-central/rev/f7acf064582d
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Backed out for now while we investigate bug 848966. I'll re-land whatever comes up clean.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cb432984d5ce
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/d7f59fd537d9
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Flags: in-testsuite+
Whiteboard: [WebRTC],[blocking-webrtc+] → [WebRTC],[blocking-webrtc+] [qa-]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
In studying the log for comment 47, it shows a very different pathology than the original source of the bug. To avoid confusion, I'm re-closing this bug and moving the new problem into Bug 853858.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Summary: Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out → (bad message queue pointer) Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: