Closed Bug 994877 Opened 7 years ago Closed 6 years ago

Debug mochitest-1 nearly perma-fail in media mochitests

Categories

(Core :: Audio/Video, defect)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: RyanVM, Assigned: jwwang)

References

(Depends on 1 open bug)

Details

Attachments

(2 files)

In addition to the frequent leaks reported in bug 994289, since the end of last week, OSX 10.6 debug mochitest-1 has been nearly perma-fail in mochitest, primarily under test_seek.html, test_bug495145.html, and test_replay_metadata.html.

The spike is very visible in bugs like bug 762774 and bug 684173. We need this investigated ASAP or we will have to resort to mass test disablings.
Flags: needinfo?(cpearce)
OrangeFactor suggests that this started around April 6 or 7 PDT 

hg log content/media/ -d ">Apr 4" outputs:

changeset:   177651:26d87e24848b
user:        Chris Pearce <cpearce@mozilla.com>
date:        Wed Apr 09 16:45:32 2014 +1200
summary:     Bug 993003 - Ensure we abort media load if IMFSourceReader creation fails. r=padenot

changeset:   177644:c333abd5318d
user:        Kyle Huey <khuey@kylehuey.com>
date:        Tue Apr 08 17:26:33 2014 -0700
summary:     Back out bug 991812 for bustage on a CLOSED TREE. r=me

changeset:   177639:88ee33546b3a
user:        Kyle Huey <khuey@kylehuey.com>
date:        Tue Apr 08 16:37:05 2014 -0700
summary:     Bug 991812: Remove uses of RefCounted in code that lives solely in Gecko. r=ehsan

changeset:   177628:de7487db16d9
user:        Boris Zbarsky <bzbarsky@mit.edu>
date:        Tue Apr 08 18:27:18 2014 -0400
summary:     Bug 991742 part 8.  Remove the "aScope" argument of WebIDL/nsWrapperCache WrapObject() methods.  r=bholley

changeset:   177626:c438f7b1d1b5
user:        Boris Zbarsky <bzbarsky@mit.edu>
date:        Tue Apr 08 18:27:17 2014 -0400
summary:     Bug 991742 part 6.  Remove the "aScope" argument of binding Wrap() methods.  r=bholley

changeset:   177534:57d7504371af
user:        Gabriele Svelto <gsvelto@mozilla.com>
date:        Mon Apr 07 13:20:57 2014 +0200
summary:     Bug 988760 - Account extra time since blocking correctly. r=karlt

changeset:   177353:a201e70b790e
user:        Peter Van der Beken <peterv@propagandism.org>
date:        Mon Apr 07 22:18:53 2014 +0200
summary:     Back out 75c95dac7fe0 (bug 984497) and f1b0d3d13755 (bug 990475) to fix bustage on a CLOSED TREE.

changeset:   177345:d5b0e9e6a849
user:        Brian Hackett <bhackett1024@gmail.com>
date:        Mon Apr 07 13:04:37 2014 -0700
summary:     Bug 987508 - Create array buffers lazily for small typed arrays, r=sfink.

changeset:   177342:8b87a6adad14
user:        Ryan VanderMeulen <ryanvm@gmail.com>
date:        Mon Apr 07 15:49:48 2014 -0400
summary:     Backed out changeset e35851f07b67 (bug 987508) for non-unified bustage.

changeset:   177339:423df46d8d57
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:42:01 2014 -0400
summary:     Backed out changeset 974c4db3003e (bug 818822)

changeset:   177338:670cb6d1750a
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:40:55 2014 -0400
summary:     Backed out changeset 5349ecd9c313 (bug 818822)

changeset:   177336:5d7494ed030d
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:37:56 2014 -0400
summary:     Backed out changeset 87f437be7de5 (bug 982490)

changeset:   177333:3ae7d42531c7
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:37:52 2014 -0400
summary:     Backed out changeset e3664615ecbf (bug 694814)

changeset:   177332:20aea86b3432
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:37:51 2014 -0400
summary:     Backed out changeset 74e5c32c6fa2 (bug 694814)

changeset:   177331:63be52cd09c5
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:37:50 2014 -0400
summary:     Backed out changeset 6dc08e9fc7e8 (bug 694814)

changeset:   177329:206169eef995
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 15:37:48 2014 -0400
summary:     Backed out changeset daf5df0306b2 (bug 985714)

changeset:   177322:e35851f07b67
user:        Brian Hackett <bhackett1024@gmail.com>
date:        Mon Apr 07 11:46:54 2014 -0700
summary:     Bug 987508 - Create array buffers lazily for small typed arrays, r=sfink.

changeset:   177316:0cb71c012f85
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 13:50:28 2014 -0400
summary:     Bug 991504 - Temporary assertion removal to fix bustage in AudioSegment r=jesup

changeset:   177288:974c4db3003e
user:        Randell Jesup <rjesup@jesup.org>
date:        Mon Apr 07 08:48:24 2014 -0400
summary:     Bug 818822: Reduce fake audio/video rates on b2g debug only to avoid overloading mochitest emulator VMs r=padenot

changeset:   177266:e31ba8d051be
user:        Matt Woodrow <mwoodrow@mozilla.com>
date:        Mon Apr 07 15:17:41 2014 +1200
summary:     Bug 904890 - Part 4: Enable hardware accelerated video decoding for OMTC+D3D9/11. r=cpearce

changeset:   177259:814f77d08ee7
user:        Matt Woodrow <mwoodrow@mozilla.com>
date:        Mon Apr 07 13:32:49 2014 +1200
summary:     Bug 991028 - Remove deprecated IPDL SurfaceDescriptor types. r=nical

changeset:   177229:2579095d0f7e
user:        Phil Ringnalda <philringnalda@gmail.com>
date:        Sun Apr 06 21:21:38 2014 -0700
summary:     Backed out 4 changesets (bug 991028) for nonunified bustage

changeset:   177225:147581a518c3
user:        Matt Woodrow <mwoodrow@mozilla.com>
date:        Mon Apr 07 13:32:49 2014 +1200
summary:     Bug 991028 - Remove deprecated IPDL SurfaceDescriptor types. r=nical

changeset:   177108:fcd79d6f4a7e
user:        Ed Morley <emorley@mozilla.com>
date:        Fri Apr 04 16:32:19 2014 +0100
summary:     Backed out changeset 2ac8fe9a90c5 (bug 948269) for timeouts in gaia-integration tests; CLOSED TREE

changeset:   177107:b327711444ed
user:        Ed Morley <emorley@mozilla.com>
date:        Fri Apr 04 16:31:44 2014 +0100
summary:     Backed out changeset e00d10064639 (bug 948269)

changeset:   177060:e00d10064639
user:        Matthew Gregan <kinetik@flim.org>
date:        Fri Apr 04 15:31:10 2014 +1300
summary:     Bug 948269 - Remove incorrect assertion from AudioSink::Drain.  r=cpearce

changeset:   177054:5fb973d5e276
user:        Neil Rashbrook <neil@parkwaycc.co.uk>
date:        Thu Apr 03 23:06:26 2014 +0100
summary:     Bug 514280 Only use nsCOMPtr for interfaces r=bsmedberg

changeset:   177052:904297de3d1e
user:        Chris Pearce <cpearce@mozilla.com>
date:        Fri Apr 04 10:39:42 2014 +1300
summary:     Bug 986947 - Make MP3 contained in MP4 playback again on Windows with WMF backend. r=padenot

changeset:   177051:9c208ea4d63c
user:        Chris Pearce <cpearce@mozilla.com>
date:        Fri Apr 04 10:39:15 2014 +1300
summary:     Bug 991448 - Skip Theora decode to next keyframe after seek, so that we don't get visual artifacts after a fastSeek. r=cajbir


The only thing that stands out is Bug 991448, but it merged to m-c about a day earlier than the spike started, so I'm hesitant to declare it the cause.

Jwwang, are you able to take this?
Flags: needinfo?(cpearce) → needinfo?(jwwang)
test_seek.html might be related to Bug 995090. I am still debugging test_seek.html.
Assignee: nobody → jwwang
Flags: needinfo?(jwwang)
Depends on: 995090
This is really a cross-platform issue. Failure rates on media mochitests (timeouts, shutdown hangs/leaks, etc) are currently extremely high - I've heard it ballparked around 40%. Where do we stand on investigating here? I don't want to start indiscriminately disabling tests, but this is have a significantly negative impact on our overall failure rates.
OS: Mac OS X → All
Hardware: x86_64 → All
Summary: OSX 10.6 debug mochitest-1 nearly perma-fail in media mochitests → Debug mochitest-1 nearly perma-fail in media mochitests
In case JW doesn't notice your question here, ni jw here.
Flags: needinfo?(jwwang)
We have 2 bugs here that could cause timeouts:
1. Bug 995090
2. sometimes timer callbacks fail to fire and cause the MediaDecoderStateMachine stuck which I am still investigating

For 1, the bug could be hard to fix according to the current design of MediaResource. The cloned ChannelMediaResource doesn't have its own channel and depends on the cached data downloaded by the original ChannelMediaResource. If the original ChannelMediaResource is destroyed before download complete, there is no way for the cloned ChannelMediaResource to acquire new data. If we create a new channel for the cloned ChannelMediaResource, it will break the purpose of resource caching and break some test cases. Moreover, if the cloned ChannelMediaResource seeks to a position where data is not present, there is no way to notice the original ChannelMediaResource to download the requested data.

For 2, it looks like a bug in our nsITimer implementation which I am afraid will have an impact on the overall system.

I can try  to find a workaround to solve the failures in test cases, but (2) should be worth investigating a bit more which really concern me.

Hi Chris, can you share your opinion about (1) since I could be wrong about (1) for I am not so familiar with the MediaResource.
Flags: needinfo?(jwwang) → needinfo?(cpearce)
Can we keep a count on the ChannelMediaResource of the number of clones, and only destroy the ChannelMediaResource when it reaches 0?

Roc wrote the MediaResource, so he may have something to say too.
Flags: needinfo?(cpearce) → needinfo?(roc)
Let's discuss that in bug 995090.
Flags: needinfo?(roc)
Disable resource cloning for some test cases that fail due to Bug 995090.
Attachment #8407499 - Flags: review?(cpearce)
Workaround for sometimes timer callback with timeout == 0 doesn't fire.
Attachment #8407500 - Flags: review?(cpearce)
try: https://tbpl.mozilla.org/?tree=Try&rev=a89ab19dfdea

No test_seek.html and test_bug495145.html timeouts on OSX 10.6 debug for 50 runs.
If a timer with timeout == 0 isn't firing, that's a bug, a serious bug, and we need to fix it and any fallout, and not wallpaper it or force everyone to 0-check their timer starts.

Please spin off a bug on that and CC/needinfo bsmedberg, ehsan, and bz (and me).  I'm sure there are others, but that's a start
> 2. sometimes timer callbacks fail to fire and cause the MediaDecoderStateMachine stuck which I am still investigating

Please try adding something that logs timers with 0 timeouts (before actually starting them) and logs when they fire.  Then we can see if they ever fail to do so, or if it's some other problem.
Comment on attachment 8407499 [details] [diff] [review]
part1_disable_resource_clone.patch

Review of attachment 8407499 [details] [diff] [review]:
-----------------------------------------------------------------

Let's try and fix the underlying issue in bug 995090. We can use this patch if we really need to.

Roc should review your patch for bug 995090.
Attachment #8407499 - Flags: review?(cpearce)
Comment on attachment 8407500 [details] [diff] [review]
part2_dont_schedule_timeout_0.patch

Review of attachment 8407500 [details] [diff] [review]:
-----------------------------------------------------------------

I agree with Jesup, a 0 timer should still work, and we should figure out why. This could cause other bugs too.
Attachment #8407500 - Flags: review?(cpearce)
Depends on: 997844
Depends on: 998168
This has been fixed in other bugs.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.