Closed Bug 1223394 Opened 5 years ago Closed 5 years ago

Intermittent test_dataChannel_basicAudioVideo.html | Error in test execution: NotFoundError: The object can not be found here.

Categories

(Taskcluster Graveyard :: Docker Images, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure, Whiteboard: [systemsfe][tests disabled])

Attachments

(2 files, 1 obsolete file)

https://treeherder.mozilla.org/logviewer.html#?job_id=17056808&repo=mozilla-inbound

 03:53:21 INFO - 95 INFO TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Error in test execution: NotFoundError: The object can not be found here.
jya: could you take a look, seems this very frequent
Flags: needinfo?(jyavenard)
I guess I should turn off webrtc tests on mulet again
(In reply to Gregor Wagner [:gwagner] from comment #3)
> I guess I should turn off webrtc tests on mulet again

yeah its failing very frequently can you turn this off ?
Flags: needinfo?(anygregor)
Blocks: 1222478
Flags: needinfo?(anygregor)
Attached patch disable.diff (obsolete) — Splinter Review
Assignee: nobody → anygregor
This patch disables the media tests again that were enabled in bug 1222478.
Jean, do you want to take a look if you can fix the actual tests or should we just land this patch?
Whiteboard: [systemsfe]
Target Milestone: --- → 2.6 S1 - 11/20
This is webrtc. I'm not involved with this. :jesup would.
Flags: needinfo?(jyavenard) → needinfo?(rjesup)
Lets wait for jesup to get online. The patch is ready to go if the orange is getting out of control.
Never mind my attempt with the try run in comment #10.

After looking at log more I think this looks more like an infra problem to me. I looked at a few of the logs and they all show ALSA error message of some form or the other. E.g.:

https://treeherder.mozilla.org/logviewer.html#?job_id=2636702&repo=mozilla-central

 02:02:21     INFO -  ALSA lib pcm_direct.c:1118:(snd1_pcm_direct_initialize_poll_fd) unable to open timer 'hw:CLASS=3,SCLASS=0,CARD=0,DEV=0,SUBDEV=1'
 02:02:21     INFO -  ALSA lib pcm_dsnoop.c:681:(snd_pcm_dsnoop_open) unable to initialize poll_fd
 02:02:21     INFO -  ALSA lib pcm_direct.c:1118:(snd1_pcm_direct_initialize_poll_fd) unable to open timer 'hw:CLASS=3,SCLASS=0,CARD=0,DEV=1,SUBDEV=1'
 02:02:21     INFO -  ALSA lib pcm_dsnoop.c:681:(snd_pcm_dsnoop_open) unable to initialize poll_fd

And in all cases it looks like the getUserMedia() request for video succeeds, but the one for audio fails:

 02:02:24     INFO -  118 INFO PeerConnectionWrapper (pcLocal) gUM local stream {95fa6a9c-3548-433a-8b32-fd404bdb3a5b} with video track {d5f9d173-21b7-4aef-ba32-292a93e00c1f}
 02:02:24     INFO -  119 INFO Got media stream: video (local)

I guess the "NotFoundError" comes from the fact that gUM() fails to find a proper sound device.

Now for the the particular "unable to initialize poll_fd" it sounds like that might simply be a user permission problem according to this:
http://www.linuxquestions.org/questions/slackware-14/sound-problem-mpg321-mp3-785591/

But other test runs show different error messages around sound. The important thing to point out here is that I think all Linux test machines are suppose to have a fake ALSA loop back driver installed. Is it possible that these are not installed on all the Mulet machines?
Greg, could you have a look?
Flags: needinfo?(garndt)
@sheriff's feel free to land gwagner's patch as a temporary solution if we can't fix this quickly enough.
I think the PulseAudio config on the tester images is...odd. We should consider just switching these jobs to use the desktop-test image that dustin has put together.
I was thinking of moving this to WebRTC (since the general Audio/Video category is now a placeholder for untriaged media bugs), but I think this really should be an infra bug.  Can someone advise on the best place in bugzilla for an infra bug like this?  Thanks.
I agree that right now this probably should be marked as infra.  I am looking at this and will move this under Taskcluster for the time being.  If it's determined to be related to something outside of infra we can move it again.
Component: Audio/Video → Docker Images
Flags: needinfo?(garndt)
Product: Core → Taskcluster
Target Milestone: 2.6 S1 - 11/20 → ---
Flags: needinfo?(garndt)
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> I think the PulseAudio config on the tester images is...odd. We should
> consider just switching these jobs to use the desktop-test image that dustin
> has put together.

That's a big change -- desktop-test runs tests as 'worker', while tester runs tests as 'root'.
(In reply to Dustin J. Mitchell [:dustin] from comment #18)
> (In reply to Ted Mielczarek [:ted.mielczarek] from comment #15)
> > I think the PulseAudio config on the tester images is...odd. We should
> > consider just switching these jobs to use the desktop-test image that dustin
> > has put together.
> 
> That's a big change -- desktop-test runs tests as 'worker', while tester
> runs tests as 'root'.

Ick. Could we at least harmonize the way they configure Pulseaudio? It looks like tester is trying to hand off an actual audio device?
For those following along at home:

tester (gaia):
  https://dxr.mozilla.org/mozilla-central/source/testing/docker/tester/bin/entrypoint#12 (starts as root)

https://public-artifacts.taskcluster.net/FZDA27TxT3CNWGM207WlYQ/0/public/logs/live_backing.log
+ pulseaudio --start
W: [pulseaudio] main.c: [1mThis program is not intended to be run as root (unless --system is specified).[0m

desktop-test (gecko):
  https://dxr.mozilla.org/mozilla-central/source/testing/docker/desktop-test/bin/test.sh (sudo's to worker, invokes..)
  https://dxr.mozilla.org/mozilla-central/source/testing/taskcluster/scripts/tester/test-linux.sh#49

It's worth futzing with, although from my memory and reviewing what I see above, I think the key difference is whether the pulseaudio daemon is started is root.  I don't see anything to match "trying to hand off an actual audio device" in the gaia case.

TBH, not a lot changed when we started running tests as 'worker' (once we started using 'sudo' instead of 'su').  It would be pretty easy to parameterize `entrypoint` so it can run as root or worker depending on a task parameter, then switch jobs over one by one.
I started looking into this yesterday but got distracted, I'm resuming today and seeing what happens if we use desktop-test for the mulet tests right now just to understand the differences.  Might make sense to just use that image if there is not any major difference, or at the very least duplicate how pulseaudio and xvfb are setup.
I should have a better answer by Monday, but so far this is looking promising incorporating some of the changes that was made to desktop-test into the tester image. https://treeherder.mozilla.org/#/jobs?repo=try&revision=a549cff4d3fe

Not sure yet the implication for other tests so running that now on try.
Slightly depressing that after I landed attachment 8685932 [details] [diff] [review] on m-c and merged it around everywhere, I see that it didn't actually disable the permaorange tests, but rather disabled some other tests.
Attachment #8685932 - Attachment is obsolete: true
Interesting thing about that comment 29 try run: the Mulet mochitest-3 has the failures from this bug, but the run is marked as passing.

I like it!
Ah, turns out that's the reason why this failure, which should be permaorange, appears to only be failing 90% of the runs. So, only 90% of failing TC runs are actually marked as failing?

I don't like it.
(In reply to Phil Ringnalda (:philor) from comment #26)
> Slightly depressing that after I landed attachment 8685932 [details] [diff] [review]
> [review] on m-c and merged it around everywhere, I see that it didn't
> actually disable the permaorange tests, but rather disabled some other tests.

These are the media related tests that I enabled in bug 1222478. If this doesn't fix the issue then its a regression.
Right, it's an infra issue, purely coincidental that it hit just after you enabled those, I just assumed without looking that it was disabling the things that were failing, rather than disabling the things that you had enabled.
No longer blocks: 1222478
(In reply to Phil Ringnalda (:philor) from comment #34)
> Right, it's an infra issue, purely coincidental that it hit just after you
> enabled those, I just assumed without looking that it was disabling the
> things that were failing, rather than disabling the things that you had
> enabled.

I assumed it fixed the problem because of https://treeherder.mozilla.org/#/jobs?repo=pine&revision=a8bb648b532b :/
Funny story: comment 32. Dunno how you got that many of them in a row, but if you click on one of those 3s, and then the Failure summary tab, and then look at the full log, you'll see that there were 11 instances of TEST-UNEXPECTED-FAIL, which were summarized oddly partway through the log as "Failed: 5", but because the summary at the end equally oddly claims "Failed: 0" the run is marked as passing. In the cases where a run is marked as failing, there are typically around 18-20 TEST-UNEXPECTED-FAILs, the partway through summary is still "Failed: 5" but the final summary is "Failed: 7" or "Failed: 9".

Apparently math is hard.
Those may, perhaps, I hope, take care of the weirdness - http://mxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/ipc/test_ipc.html?force=1 runs the tests in http://mxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/ipc/ipc.json, making them look like they are running normally, so it looked like we were running test_dataChannel_basicAudioVideo.html even though http://mxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/mochitest.ini actually says twice over, at the top disabling the entire manifest and again under the individual test, to not run it. My hope is that the meaning of "(nested ipc not working)" in http://mxr.mozilla.org/mozilla-central/source/dom/media/tests/mochitest/ipc/mochitest.ini#9 is "it'll put weird partial summaries in the log and be green even when tests fail" and that all that will go away now, and that disabling the webspeech/recognition tests will get us back to both green and actually not having any unexpected-fails in the green.
And so the only one of those garndt wants to revert while landing https://hg.mozilla.org/try/rev/977ab02f7c27 is https://hg.mozilla.org/mozilla-central/rev/4b04af1dc5f5 - 97297187539c we never want to see reverted, and all mucking about in dom/media/tests/mochitest/mochitest.ini without removing that skip-if at the top which skips every test in the manifest on Mulet is a no-op.
And that one can be reverted, too. Looks like you'll still have a little intermittent orange in other webspeech directories, which is deeply weird because it's intermittent, but I'd like that not to be my problem.
Keywords: leave-open
Whiteboard: [systemsfe] → [systemsfe][tests disabled]
Assignee: anygregor → nobody
This commit [1] in the last try push seems to (I Hope!) re-enable the tests that were disabled earlier,a dn everything ran green (needed to retrigger an intermitten for a test not touched by this rev).

I can push this up for review if someone wants to review it and we can get it landed or here is the revision in case someone wants to play with it more.  All that changed was moving towards using desktop-test image for mulet mochitests.  All other mulet tests are still using the tester image as they seemed green already.  Ideally in a follow up patch the tests are converted over to one common image (such as desktop-test).


[1] https://hg.mozilla.org/try/rev/6722b0fa5510
Flags: needinfo?(garndt)
Blocks: 1224936
(In reply to Phil Ringnalda (:philor) from comment #36)
> Funny story: comment 32. Dunno how you got that many of them in a row, but
> if you click on one of those 3s, and then the Failure summary tab, and then
> look at the full log, you'll see that there were 11 instances of
> TEST-UNEXPECTED-FAIL, which were summarized oddly partway through the log as
> "Failed: 5", but because the summary at the end equally oddly claims
> "Failed: 0" the run is marked as passing. In the cases where a run is marked
> as failing, there are typically around 18-20 TEST-UNEXPECTED-FAILs, the
> partway through summary is still "Failed: 5" but the final summary is
> "Failed: 7" or "Failed: 9".
> 
> Apparently math is hard.

This is very concerning. Selena, is your team maintaining the parsing of the logs?
Flags: needinfo?(sdeckelmann)
That's comment 39 - that part you did cause, with http://hg.mozilla.org/mozilla-central/rev/520daea77ae0#l12.1, because nested ipc is no more supported on Mulet than it is on any other flavor of b2g, and the way it is not supported is by failing in the most imcomprehensible way, causing the test run to appear to end while continuing on, failing but not counting the failure. That part is already fixed, because I undid your changes there, and the only thing that remains for this bug is for someone to tell garndt to land https://hg.mozilla.org/try/rev/6722b0fa5510 because the lack of audio support is now causing Gecko patches to bounce, and then to back out https://hg.mozilla.org/mozilla-central/rev/bc74dbdea094 and then to close this bloody bug before anyone else gets baffled by it.

I hereby elect you to tell him to land it.
Flags: needinfo?(sdeckelmann)
Flags: needinfo?(rjesup)
Flags: needinfo?(anygregor)
I'll submit it for review and flag the necessary people for it.
(In reply to Gregor Wagner [:gwagner] from comment #45)
> (In reply to Phil Ringnalda (:philor) from comment #36)
> > Funny story: comment 32. Dunno how you got that many of them in a row, but
> > if you click on one of those 3s, and then the Failure summary tab, and then
> > look at the full log, you'll see that there were 11 instances of
> > TEST-UNEXPECTED-FAIL, which were summarized oddly partway through the log as
> > "Failed: 5", but because the summary at the end equally oddly claims
> > "Failed: 0" the run is marked as passing. In the cases where a run is marked
> > as failing, there are typically around 18-20 TEST-UNEXPECTED-FAILs, the
> > partway through summary is still "Failed: 5" but the final summary is
> > "Failed: 7" or "Failed: 9".
> > 
> > Apparently math is hard.
> 
> This is very concerning. Selena, is your team maintaining the parsing of the
> logs?

Parsing of the logs happens entirely within treeherder right now.  

Looking at this [1] it picks up failures that start with "TEST-UNEXPECTED-" but that's only for displaying them in the failure summary, generating the bug suggestions, and displaying the log viewer.  It does not change the status of a task.

Taskcluster tasks rely on the exit code of the command that's running.  This harness is returning exit code 0, which is assumed to be successful.  Looking at one of the jobs there seems to be one set of tests that had 5 failures, and the last set had 0, within the same task.  Perhaps something is only taking that last run into account when exiting.  That's my uneducated guess about it.

[1] https://github.com/mozilla/treeherder/blob/master/treeherder/log_parser/parsers.py#L363
Bug 1223394 - Use desktop-test image for mochitests to fix audio issues r=gwagner
Attachment #8688624 - Flags: review?(anygregor)
Having only those two layers, treeherder and taskcluster, in the abstraction layer-cake would certainly be simpler than the actual case.

Treeherder reports the job status it was told by taskcluster. Taskcluster sets its job status based on the exit code of the command it runs. That command is python ./mozharness/scripts/desktop_unittest.py, so it's mozharness setting that exit code, based on running mochitest/runtest.py and deciding what happened there. Last time I looked, what mozharness did was look for the last instance of "Failed: \d" in the log from runtest.py on the assumption that "Failed: 0" means success, "Failed: >0" means failure, and no "Failed: \d" means a crash or something equally horrible, so there's no failure in log parsing, the only failure is that the mochitest harness is not hardened against whatever horrible thing is happening to it when a flavor of b2g tries to do nested ipc.

If someone really wanted to chase this down, rather than just resolving never again to enable that test, they would need to start by actually looking at what happens, so they could tell the maintainers of our fork of mochitest what sort of thing (one fun thing you can do to get results a bit like this is to navigate the entire browser Back, rather than navigating inside the iframe where the tests are running) they are not hardened against.
Comment on attachment 8688624 [details]
MozReview Request: Bug 1223394 - Use desktop-test image for mochitests to fix audio issues r=gwagner

Aus probably knows much better whats going on here.
Flags: needinfo?(anygregor)
Attachment #8688624 - Flags: review?(anygregor) → review?(aus)
Attachment #8688624 - Flags: review?(aus) → review+
Comment on attachment 8688624 [details]
MozReview Request: Bug 1223394 - Use desktop-test image for mochitests to fix audio issues r=gwagner

https://reviewboard.mozilla.org/r/25359/#review22881

This looks good. Just wanted to double check that it was our intention to re-enable the tests for Mulet, if that's indeed the case, we're good to go.
Good question....I think it's ok to enable it? Perhaps I'm wrong with which test was the offender.  I don't know much about the mochitests configs.  Just saw that it was disabled and thought if the problem is fixed then it could be enabled?  Perhaps someone can answer that before I push.

I was basing that assumption on comment 40. I hope I re-enabled the right thing.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: Taskcluster → Taskcluster Graveyard
You need to log in before you can comment on or make changes to this bug.