(bad message queue pointer) Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out

RESOLVED FIXED in mozilla22

Status

()

Core
WebRTC: Signaling
P1
normal
RESOLVED FIXED
5 years ago
5 years ago

People

(Reporter: abr, Assigned: abr)

Tracking

21 Branch
mozilla22
x86
Mac OS X
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [WebRTC],[blocking-webrtc+] [qa-])

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

5 years ago
philor
https://tbpl.mozilla.org/php/getParsedLog.php?id=19459762&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound opt test mochitest-2 on 2013-02-05 10:30:24
slave: talos-r4-lion-032

26010 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out.
Keywords: intermittent-failure

Updated

5 years ago
Summary: /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH → Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH
Whiteboard: [WebRTC],[blocking-webrtc+]
Comment hidden (Treeherder Robot)
(Assignee)

Updated

5 years ago
Duplicate of this bug: 839679
(Assignee)

Comment 3

5 years ago
Ms2ger%gmail.com
https://tbpl.mozilla.org/php/getParsedLog.php?id=19512848&tree=Mozilla-Inbound
Rev4 MacOSX Lion 10.7 mozilla-inbound debug test mochitest-2 on 2013-02-06 16:39:53
slave: talos-r4-lion-068

26008 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out.
Summary: Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out WITH NO CRASH → Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out
(Assignee)

Updated

5 years ago
Depends on: 841566, 841457
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 6

5 years ago
Any TBPL stars after this comment should contain useful logging information that isolates this problem to a smaller part of the system.
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(Assignee)

Updated

5 years ago
Priority: -- → P1
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
(In reply to Adam Roach [:abr] from comment #6)
> Any TBPL stars after this comment should contain useful logging information
> that isolates this problem to a smaller part of the system.

Has the logging provided any more insight? :-)
Comment hidden (Treeherder Robot)
(Assignee)

Comment 32

5 years ago
(In reply to Ed Morley [:edmorley UTC+0] from comment #30)
> (In reply to Adam Roach [:abr] from comment #6)
> > Any TBPL stars after this comment should contain useful logging information
> > that isolates this problem to a smaller part of the system.
> 
> Has the logging provided any more insight? :-)

It has, and I spent quite a bit of time on Friday doing analysis of the logs of good runs versus bad runs to try to nail down where things go wrong. I did manage to find a fairly consistent difference that I suspected was the problem; however, after doing work to make things happen in the order that appeared to yield success, I found that forcing the order that I thought would cause failure didn't actually cause failure.

The good news is that Bug 845523, now landed on m-c, will eliminate the ability for this set of events to occur in different orders. Hopefully, this will make the actual differences between successful and failure runs easier to find.

Believe me, I understand that this is annoying for the sheriffs, and getting rid of the intermittent oranges is on the top of my priority list.
Thank you for your work on this so far - much appreciated :-)
(Assignee)

Comment 34

5 years ago
Okay, I think I see the problem now. It appears that the failure runs all show the CCApp thread getting on the CPU and starting to process messages before the GSM Task thread has CPU cycles at all. The model here is that the first thing each thread does is sets its inbound message queue. But since GSM hasn't run at all, its queue is still NULL. This means the CCApp->GSM message "SETPEERCONNECTION" is going to fail to be delivered.

The failure path is pretty self-evident from that point forward.

Rather than trying to synchronize start up further, I think the easy fix here is to initialize all the queues prior to starting any of the threads.
(Assignee)

Comment 35

5 years ago
Created attachment 721471 [details] [diff] [review]
Remove problematic gsm_msg_queue and use gsm_msgq instead
(Assignee)

Comment 36

5 years ago
It turns out only the GSM Task queue copied the queue to a module-local variable. Everyone else uses the globals declared in init.c. The patch I just attached -- as of yet untested -- changes GSM to behave the same way, which should eliminate any possibility of some other thread attempting to enqueue a message to GSM before it's ready.

I'll be requesting review on the patch as soon as I determine that I haven't broken anything, hopefully later today (but before bugzilla goes down for the upgrade).
(Assignee)

Comment 37

5 years ago
Created attachment 721493 [details] [diff] [review]
Remove problematic gsm_msg_queue and use gsm_msgq instead
(Assignee)

Updated

5 years ago
Attachment #721471 - Attachment is obsolete: true
(Assignee)

Comment 38

5 years ago
Comment on attachment 721493 [details] [diff] [review]
Remove problematic gsm_msg_queue and use gsm_msgq instead

Randell: This passes signaling_unittests and mochi tests on my local machine. Given the state of the try infrastructure, I'm not sure this kind of change warrants a try push. Let me know if you'd prefer to see a try run.
Attachment #721493 - Flags: review?(rjesup)

Updated

5 years ago
Attachment #721493 - Flags: review?(rjesup) → review+
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
https://hg.mozilla.org/mozilla-central/rev/f7acf064582d
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla22
Backed out for now while we investigate bug 848966. I'll re-land whatever comes up clean.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cb432984d5ce
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
https://hg.mozilla.org/mozilla-central/rev/d7f59fd537d9
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED

Updated

5 years ago
Flags: in-testsuite+
Whiteboard: [WebRTC],[blocking-webrtc+] → [WebRTC],[blocking-webrtc+] [qa-]
Comment hidden (Treeherder Robot)
Comment hidden (Treeherder Robot)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 48

5 years ago
In studying the log for comment 47, it shows a very different pathology than the original source of the bug. To avoid confusion, I'm re-closing this bug and moving the new problem into Bug 853858.
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Keywords: intermittent-failure
Resolution: --- → FIXED
Summary: Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out → (bad message queue pointer) Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out
You need to log in before you can comment on or make changes to this bug.