Open Bug 1262336 Opened 5 years ago Updated 3 years ago

Recent surge in Windows plugin mochitest hangs/timeouts

Categories

(Core :: Plug-ins, defect, P3)

Unspecified
Windows
defect

Tracking

()

Tracking Status
firefox48 --- affected

People

(Reporter: RyanVM, Unassigned)

References

(Blocks 8 open bugs)

Details

I've noticed lately when looking at Treeherder that Windows has become extremely timeout-prone in plugins tests. It appears to be a combination of new failures and old ones exacerbated (or more-likely, improperly reopened).

Both 32bit and 64bit builds are affected. The problem mainly affects non-e10s runs (though we don't run a lot of Windows e10s tests in full production, so that may not be indicative of much). Many of the stacks have a top frame of either "application crashed [None]" or "[@ KiFastSystemCallRet + 0x0]" (and I see that Bill has a pending ni? request in related-looking bug 1256077). Also, PGO appears to exacerbate the failures, but they're not exclusive to those builds either.

From what I can see looking at the graphs on OrangeFactor, the worst of the spike appears to have started around a week ago (around March 24 or thereabouts). However, bug 1207918 in particular appears to have spiked closer to the beginning of March.

I'm filing this bug so that we can hopefully pinpoint whatever is going on and share the results in one place instead of being spread across a wide number of bugs.

Some example logs:
https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=8348698
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3614480
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3610295
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=25198424
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=3614859
Flags: needinfo?(aklotz)
Depends on: 1262337
Blocks: 1114897
Blocks: 1260202, 1260200
Blocks: 1223243
Blocks: 909899
Blocks: 1261278
Blocks: 1246440
Blocks: 1097617
Blocks: 1006234
Blocks: 1183302
Jim, do you have any suggestions for who might be able to investigate this? As the deps show, this can basically strike any test that happens to use plugins on Windows, so it's pretty widespread. Possibly related to the plugin hang rate we're seeing in the wild these days too? I'll work on getting better test plugin stacks over in bug 1262337 in the meantime.
Flags: needinfo?(jmathies)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
> Jim, do you have any suggestions for who might be able to investigate this?
> As the deps show, this can basically strike any test that happens to use
> plugins on Windows, so it's pretty widespread. Possibly related to the
> plugin hang rate we're seeing in the wild these days too? I'll work on
> getting better test plugin stacks over in bug 1262337 in the meantime.

Did you have any luck tracking down a regression range?
Flags: needinfo?(jmathies) → needinfo?(ryanvm)
I haven't been able to conclusively narrow it down much past March 22-23, unfortunately. Retriggers on Treeherder have given me a bit of an indication, but not as blatantly obvious as I was hoping for.
Flags: needinfo?(ryanvm)
I'm not really sure what to say here: I haven't touched plugin code in a while now. A regression range would be nice.

qdot has been doing some plugin work lately. Kyle, do you know if anything landed around March 24?
Flags: needinfo?(aklotz) → needinfo?(kyle)
Nothing I'm aware of, I don't have anything I landed that day.
Flags: needinfo?(kyle)
Blocks: 1267022
Blocks: 1267125
Blocks: 1267460
Blocks: 1267533
I see this often in the bugs:
(msgtype=0xAE0008,name=PPluginModule::Msg_PPluginInstanceConstructor) Channel timeout: cannot send/recv

let me add to this- I have been working on getting the windows 7 tests running in AWS VM machines.  Running on different instance types, I see some patterns.  We switches from c3.2xlarge to m3.xlarge as an experiment and the m3.xlarge had a large pile of plugin related test timeouts, most of which are documented on this bug and follow the channel timeout problem.

while there is a slightly larger disk on the c3.2xlarge, they both have the same amount of memory and the m3.xlarge has 4 vCPUs whereas the c3.2xlarge has 8 vCPUs.  Our current hardware solution has 1 Intel X3450 CPU.

is it possible we are cpu bound somehow?  is crashreporter an issue here, plugins, or both?
in early march we upgraded to vs2015 for the compiler- that might be partially related.

regarding the timeout I mentioned above, I cannot find the code where it comes from, but this looks to be related to:
https://hg.mozilla.org/mozilla-central/annotate/e5a10bc7dac4ee2453d8319165c1f6578203eac7/dom/plugins/ipc/PluginModuleChild.cpp#l2172

I am tempted to put all plugin related tests into their own job and by the nature of them failing so frequently the job would be hidden by default.  We need to find an owner for these and somebody to do this, otherwise we might as well turn off these tests.

:jimm, do you have ideas on how to move forward in debugging this?
Flags: needinfo?(jmathies)
(In reply to Joel Maher (:jmaher) from comment #7)
> in early march we upgraded to vs2015 for the compiler- that might be
> partially related.
> 
> regarding the timeout I mentioned above, I cannot find the code where it
> comes from, but this looks to be related to:

This is an ipc response timeout, it applies generically to all ipdl message sends. There's nothing unique to look at here for plugins.

A couple comments - 

1) I was doing a survey of the first fifteen or so bugs hooked up here and I don't see why some of them are attached. Some haven't had failures reported in a long time. Maybe the failures aren't being reported to the bugs, not sure. I'd love to get a list of the top 15 offenders that involve test timeouts.

2) In cases like this - 

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1207918&startday=2016-04-25&endday=2016-05-01&tree=all

It appear that the main browser thread is caught up in a sync send message, probably to a plugin window in the plugin process. Could we figure out some way to get automation to start dumping thread stacks for child processes involved into the logs? It would help if we could see where the other processes are hung up. Generally this would be useful for debugging e10s issues as well.
Flags: needinfo?(jmathies) → needinfo?(jmaher)
That's one of the big problems with this sort of systemic failure that can hit any of a class of tests: it doesn't mesh well with our utter indifference to low-frequency failures. OrangeFactor Robot only comments in bugs when more than 5 failures per week are starred as being that bug, so you could easily have a failure mode which hit 50 bugs on 50 tests 4 times each per week which would be our fourth most-common failure without getting a single bug comment.

The easiest way to not get fooled by our indifference is to go to https://bugzilla.mozilla.org/userprefs.cgi?tab=settings and flip the pref for "When viewing a bug, show its corresponding Orange Factor page" to on, so you get a little graph and the number of failures and a link to that week's trunk failures on Orange Factor inserted into the bug page up by the flags. Without, bug 1097617 looks like it was pointlessly reopened, with you see both that it has had a microscopic failure rate on 10.6 all along and that it has started failing on Windows once a week since early April.
thanks :jimm for the response.  If there is a programatic way to get thread stacks, I would be happy to help make that a part of the toolchain, especially on failures.  possibly a few dozen thread stacks and we can find a pattern or a few bugs and reduce the pain here.
Flags: needinfo?(jmaher)
Blocks: 1272848
I have filed bug 1295977 explicitly about the LdrLoadDll type of hang. I do not think that covers everything that's happening here, so I'm going to leave this one separate although at this point I'm not sure what is actionable here :-(
Depends on: 1301994
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.