Closed Bug 920952 Opened 11 years ago Closed 11 years ago

Flush metro test harness output to avoid timeouts due to no output

Categories

(Testing :: General, defect)

x86
Windows 8
defect
Not set
critical

Tracking

(firefox27 fixed, firefox28 fixed, firefox29 fixed, firefox-esr24 fixed, b2g-v1.2 fixed)

RESOLVED FIXED
mozilla29
Tracking Status
firefox27 --- fixed
firefox28 --- fixed
firefox29 --- fixed
firefox-esr24 --- fixed
b2g-v1.2 --- fixed

People

(Reporter: cbook, Assigned: jimm)

References

()

Details

(Keywords: intermittent-failure, regression)

Attachments

(2 files, 1 obsolete file)

WINNT 6.2 mozilla-inbound pgo test mochitest-metro-chrome on 2013-09-26 00:26:51 PDT for push c3a044164b5f

slave: t-w864-ix-125

timed out after 1000 seconds of no output

https://tbpl.mozilla.org/php/getParsedLog.php?id=28393821&tree=Mozilla-Inbound
Blocks: 915620
So, what happened here was that for regular mochitest 330 seconds without output timeouts like comment 15, we hit the normal timeout, then we hit a mozprocess 1000 second timeout, then eventually we hit a 1200 second without output timeout, which is horrible and stupid and wasteful, and much better than

For metro-chrome, we just completely silently hit the 330 second timeout, logging nothing at all so that bug 915620 thought that it had disappeared, then hit the 1000 second and the 1200 second timeouts.

If I had the slightest idea who did what in what repo somewhere between the 24th and 26th of September, I'd be happily backing them out, because omg, we added *37 minutes* to every timeout on this platform, while completely removing any useful output from one entire test suite. That's *awful*.
Severity: normal → critical
I'm not really sure where the line between harness and automation is anymore to guess who might have done this, but if you're wondering where your win8 capacity has gone, it's gone down the tubes of taking 37 minutes to do a 330 seconds without output timeout.
09/23 was bug 746243, which is sorta plausible.
Blocks: 746243
Keywords: regression
Sounds likely that it is bug 746243.  Summit is distracting developers, but will have jhammel look at it as soon as he can.
Have been fighting failing wifi and batteries amidst sessions.  subprocess being undefined, like for comment 15, should be fixed by http://hg.mozilla.org/mozilla-central/rev/7d9905b9ba5d . I will look over the brunt of the issue now.
Interesting to note that all occurrences are on opt/pgo builds, none on debug builds.
All of the failures occur during the last test run, at different points.  Could this be a stdout buffering problem?
Opt-only isn't surprising, since we've only had a few non-metro-chrome instances, and we don't run metro-chrome on debug because it's full of fail. Then, "330 seconds without output" from a debug build? They're lucky to go 3.3 seconds without screaming about something totally unrelated, much less 330.

Being the last test to run may well explain the underlying and preexisting metro-chrome failure, but comment 15's mochitest-1 failure was not in the last test in that suite.
comment 15 was the "NameError: global name 'subprocess' is not defined" error, which I believe is fixed now; I believe the only remaining problem is with the metro tests.
(In reply to Jonathan Griffin (:jgriffin) from comment #35)
> All of the failures occur during the last test run, at different points. 
> Could this be a stdout buffering problem?

I wouldn't be surprised at all, given the number of problems recorded with mozprocess output buffering (see bug 924253 for the most recent filed).  Unfortunately, these are difficult to reproduce, difficult to debug, and have hitherto not been either a (big) problem or priority.
I've gone through the various apparatus in attempt to diagnose what is
going on here.  Take for example this example log:

https://tbpl.mozilla.org/php/getParsedLog.php?id=28788855&full=1&branch=mozilla-inbound#error0

The last output before the timeouts is at 9:18:23:
"""
09:18:23     INFO -  TEST-INFO |
chrome://mochitests/content/metro/browser/metro/base/tests/mochitest/browser_urlbar_trimURLs.js
| Console message: [JavaScript Error: "The character encoding of the
HTML document was not declared. The document will render with garbled
text in some browser configurations if the document contains
characters from outside the US-ASCII range. The character encoding of
the page must be declared in the document or in the transfer
protocol." {file: "http://example.com/" line: 0}]
"""

There are 3 different timeouts:

* 330s : From runtests.py:
  http://hg.mozilla.org/mozilla-central/file/64b497e6f593/testing/mochitest/runtests.py#l987
  For whatever reason, we do not print to log when we hit this timeout.

* 1000s [@ 9:35:03]: From mozharness:
  http://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/mozharness/base/script.py#l651
  http://hg.mozilla.org/build/mozharness/file/76d08a2d15f6/scripts/desktop_unittest.py#l398

* 1200s [@ 9:55:50]: From buildbot:
"""
========= Started run_script exception (results: 4, elapsed: 43 mins,
15 secs) (at 2013-10-06 09:12:34.393232) =========
'c:/mozilla-build/python27/python' '-u'
'scripts/scripts/desktop_unittest.py' '--cfg'
'unittests/win_unittest.py' '--mochitest-suite'
'mochitest-metro-chrome' '--download-symbols' 'ondemand'
 in dir C:\slave\test\. (timeout 1200 secs) (maxTime 7200 secs)
"""

----

Q: Why isn't 330s timeout recorded?

  The previous incarnation:
  http://hg.mozilla.org/mozilla-central/file/64b497e6f593/build/automation.py.in#l767

  The current incarnation:
  http://hg.mozilla.org/mozilla-central/file/64b497e6f593/testing/mochitest/runtests.py#l897
  http://hg.mozilla.org/mozilla-central/file/64b497e6f593/testing/mochitest/runtests.py#l1028

  Mozharness invokes the command

"""
09:15:32     INFO - Running command:
['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u',
'C:\\slave\\test\\build\\tests\\mochitest/runtests.py',
'--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe',
'--utility-path=tests/bin',
'--extra-profile-file=tests/bin/plugins',
'--symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1381064449/firefox-27.0a1.en-US.win32.crashreporter-symbols.zip',
'--certificate-path=tests/certs',
'--autorun',
'--close-when-done',
'--console-level=INFO',
'--browser-chrome',
'--metro-immersive'] in C:\slave\test\build
"""

  Why the log.info line doesn't get printed....I do not know.

----

Other interesting tidbits:

"""
09:15:32     INFO - Calling
['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u',
'C:\\slave\\test\\build\\tests\\mochitest/runtests.py',
'--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe',
'--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins',
'--symbols-path=http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32-pgo/1381064449/firefox-27.0a1.en-US.win32.crashreporter-symbols.zip',
'--certificate-path=tests/certs', '--autorun', '--close-when-done',
'--console-level=INFO', '--browser-chrome', '--metro-immersive'] with
output_timeout 1000
ProcessManager NOT managing child processes
ProcessManager UNABLE to use job objects to manage child processes
09:15:34     INFO -  ProcessManager UNABLE to use job objects to
manage child processes
09:15:34     INFO -  ProcessManager NOT managing child processes
"""

This is known, but bad.  Due to a whole slew of complex issues,
including how twisted native works as well as our patch for it,
mozprocess cannot manage job subprocesses via job objects on windows.

It is also worth noting that AFAIK mozprocess has not been rigorously
tested on metro.
> ProcessManager NOT managing child processes
> ProcessManager UNABLE to use job objects to manage child processes
> 09:15:34     INFO -  ProcessManager UNABLE to use job objects to
> manage child processes
> 09:15:34     INFO -  ProcessManager NOT managing child processes

Looking at the logs, this only seems to occur with Win8.  Also, I can't reproduce it locally with a Win8 VM, but that may be due to differences in execution...I'm running things from a console, and that may not be true in buildbot.
>  Why the log.info line doesn't get printed....I do not know.

We have a similar issue with B2G tests, bug 904839.
(In reply to Jonathan Griffin (:jgriffin) from comment #40)
> > ProcessManager NOT managing child processes
> > ProcessManager UNABLE to use job objects to manage child processes
> > 09:15:34     INFO -  ProcessManager UNABLE to use job objects to
> > manage child processes
> > 09:15:34     INFO -  ProcessManager NOT managing child processes
> 
> Looking at the logs, this only seems to occur with Win8.  Also, I can't
> reproduce it locally with a Win8 VM, but that may be due to differences in
> execution...I'm running things from a console, and that may not be true in
> buildbot.

Buildbot (actually twistd) will run scripts so that it can manage the breakaway processes and close them. We modified our scripts for this some time ago, and I don't believe we have this issue on other windows platforms when running under buildbot. So, unless win8 changed the way things work under the covers of the createProcess api then we should still be good. But if this is happening due to the interaction with the way twistd starts processes, you won't be able to repro this in any way except on a slave where the process has been started by buildbot.  

What is our current status here? Anyone have ideas on how to debug this? My only thought would be to have access to the sendchange script and send a change to the buildbot slave over and over until the thing hangs and then investigate why it hung.  The easiest way to do that is probably by sending a super-intensive logging patch to try server and retriggering the sh*t out of it.
I am hoping that the work I've done for bug 904839 will be applicable here as well.

And yes, I think the best way to test is to add some instrumentation to a patch and retrigger it on a try job many times.
I suspect the root cause of this bug is the screen saver mode.  We (jlund and myself) determined in Talos that the win8 boxes have a screen saver that starts with 15 minutes of inactivity.  If the browser is automated (metro mode only?) and the run takes >15 minutes the browser process will be suspended and the python script will time out.  We would run into all types of timeout problems and after changing the shut off screen timer from 15 minutes to 120 minutes (just picked a single number) our tests have continued to run reliably.

This might not be the problem here, but we really should make this change and see if this problem goes away.
Depends on: 929473
(In reply to Joel Maher (:jmaher) from comment #77)
> I suspect the root cause of this bug is the screen saver mode.

Thank you - filed bug 929473 so we can rule it out either way :-)
Whiteboard: [release28]
It looks like the screen saver mode wasn't the culprit here. Also bug 904839 has been marked fixed but this is still happening. Do you know what the next steps here are?
Flags: needinfo?(jgriffin)
I suspect there's some buffering issue at work here; since this only happens in the metro test suite, I've wondered if adding a flush() here would fix the problem:  

http://mxr.mozilla.org/mozilla-central/source/browser/metro/shell/testing/metrotestharness.cpp#202
Flags: needinfo?(jgriffin)
Well, we have multiple bugs (bug 914344 and bug 892990 leap to mind) where we have the same pattern of "330 seconds without output" "1000 seconds of no output" "1200 seconds without output" before finally managing to "kill" the job by disconnecting the slave and turning the job purple rather than the orange it should have been 2200 seconds earlier.

I'd agree that eating the 330 second timeout's message is metro-chrome-only, but that failure to log is probably the smallest part of the problem here, compared to the way we waste 2200 seconds by failing to kill after 330 seconds and failing to kill yet again after another 1000 seconds, the way we fail to provide a stack for the hang because we didn't ever manage to kill the browser, the way we set the job's result to look like infra, and the way we (I think) fail to reboot the slave, every time we have a test hang.
Depends on: 942662
See Also: → 942992
Reproduced on try: https://tbpl.mozilla.org/php/getParsedLog.php?id=31104304&tree=Try&full=1#error0 ; unfortunately, none of the diagnostics I added are particularly insightful.

We are using the FirefoxRunner from
https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/local.py#L208 ;

20:30:47     INFO -  Using mozrunner FirefoxRunner
20:30:47     INFO -  ProcessManager UNABLE to use job objects to
manage child processes
20:30:47     INFO -  ProcessManager NOT managing child processes
20:30:47     INFO -  INFO | runtests.py | Application pid: 3816
20:30:47     INFO -  INFO | metrotestharness.exe | Launching
browser...
20:30:47     INFO -  INFO | metrotestharness.exe | App model
id='E4CFE2E6B75AA3A3'
20:30:47     INFO -  INFO | metrotestharness.exe | Harness process id:
3816
20:30:47     INFO -  INFO | metrotestharness.exe | Using bin path:
'C:\slave\test\build\application\firefox\firefox.exe'
20:30:47     INFO -  INFO | metrotestharness.exe | Writing out
tests.ini to: 'C:\slave\test\build\application\firefox\tests.ini'
20:30:47     INFO -  INFO | metrotestharness.exe | Browser command
line args: 'C:\slave\test\build\application\firefox\firefox.exe
about:blank -foreground -profile
c:\users\cltbld~1.t-w\appdata\local\temp\tmp3awdu_'
20:30:49     INFO -  INFO | metrotestharness.exe | Activation
succeeded.
20:30:49     INFO -  INFO | runtests.py | metro browser sub process id
detected: 2476
20:30:49     INFO -  INFO | metrotestharness.exe |
METRO_BROWSER_PROCESS=2476
20:30:49     INFO -  INFO | metrotestharness.exe | Waiting on child
process...
20:30:53     INFO -  XRE_MetroCoreApplicationRun: IsMainThread:0 ThreadId:668

This uses in in-tree metrotestharness.exe:

09:49:44     INFO -  ['C:\\slave\\test\\build\\tests\\bin\\metrotestharness.exe', '-firefoxpath', 'C:\\slave\\test\\build\\application\\firefox\\firefox.exe', 'about:blank', '-foreground', '-profile', 'c:\\users\\cltbld~1.t-w\\appdata\\local\\temp\\tmpssyifs']

I'll try to add a flush to the metrotestharness.cpp ; even if this works, however, it won't really resolve any of the mystery
(In reply to Jeff Hammel [:jhammel] from comment #192)
> Reproduced on try:
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=31104304&tree=Try&full=1#error0 ; unfortunately, none of the
> diagnostics I added are particularly insightful.
> 
> We are using the FirefoxRunner from
> https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/local.
> py#L208 ;
> 
> 20:30:47     INFO -  Using mozrunner FirefoxRunner
> 20:30:47     INFO -  ProcessManager UNABLE to use job objects to
> manage child processes
> 20:30:47     INFO -  ProcessManager NOT managing child processes
> 20:30:47     INFO -  INFO | runtests.py | Application pid: 3816
> 20:30:47     INFO -  INFO | metrotestharness.exe | Launching
> browser...
> 20:30:47     INFO -  INFO | metrotestharness.exe | App model
> id='E4CFE2E6B75AA3A3'
> 20:30:47     INFO -  INFO | metrotestharness.exe | Harness process id:
> 3816
> 20:30:47     INFO -  INFO | metrotestharness.exe | Using bin path:
> 'C:\slave\test\build\application\firefox\firefox.exe'
> 20:30:47     INFO -  INFO | metrotestharness.exe | Writing out
> tests.ini to: 'C:\slave\test\build\application\firefox\tests.ini'
> 20:30:47     INFO -  INFO | metrotestharness.exe | Browser command
> line args: 'C:\slave\test\build\application\firefox\firefox.exe
> about:blank -foreground -profile
> c:\users\cltbld~1.t-w\appdata\local\temp\tmp3awdu_'
> 20:30:49     INFO -  INFO | metrotestharness.exe | Activation
> succeeded.
> 20:30:49     INFO -  INFO | runtests.py | metro browser sub process id
> detected: 2476
> 20:30:49     INFO -  INFO | metrotestharness.exe |
> METRO_BROWSER_PROCESS=2476
> 20:30:49     INFO -  INFO | metrotestharness.exe | Waiting on child
> process...
> 20:30:53     INFO -  XRE_MetroCoreApplicationRun: IsMainThread:0 ThreadId:668
> 
> This uses in in-tree metrotestharness.exe:
> 
> 09:49:44     INFO - 
> ['C:\\slave\\test\\build\\tests\\bin\\metrotestharness.exe', '-firefoxpath',
> 'C:\\slave\\test\\build\\application\\firefox\\firefox.exe', 'about:blank',
> '-foreground', '-profile',
> 'c:\\users\\cltbld~1.t-w\\appdata\\local\\temp\\tmpssyifs']
> 
> I'll try to add a flush to the metrotestharness.cpp ; even if this works,
> however, it won't really resolve any of the mystery

flush should be done here:http://hg.mozilla.org/mozilla-central/file/34f431863037/browser/metro/shell/testing/metrotestharness.cpp#l201
Attached patch fflush(stdout); (obsolete) — Splinter Review
pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=12b9eb302672

will retrigger a bunch of test jobs
Whiteboard: [release28]
(In reply to Jeff Hammel [:jhammel] from comment #194)
> Created attachment 8338742 [details] [diff] [review]
> fflush(stdout);
> 
> pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=12b9eb302672
> 
> will retrigger a bunch of test jobs

Did we give up on this? I see a bunch of retriggers, but none with this failure. Should we repush and try again?
I wonder if there's some way we can trigger a stack dump when this hang is detected? We added support for getting the right browser stacks in bug 905628.
(In reply to Jim Mathies [:jimm] from comment #229)
> (In reply to Jeff Hammel [:jhammel] from comment #194)
> > Created attachment 8338742 [details] [diff] [review]
> > fflush(stdout);
> > 
> > pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=12b9eb302672
> > 
> > will retrigger a bunch of test jobs
> 
> Did we give up on this? I see a bunch of retriggers, but none with this
> failure. Should we repush and try again?

Perhaps this actually fixes the problem.  I've requested 10 more retriggers; if this problem still doesn't show up I'd like to land this patch and see if it goes away. :)
(In reply to Jonathan Griffin (:jgriffin) from comment #236)
> (In reply to Jim Mathies [:jimm] from comment #229)
> > (In reply to Jeff Hammel [:jhammel] from comment #194)
> > > Created attachment 8338742 [details] [diff] [review]
> > > fflush(stdout);
> > > 
> > > pushed to try: https://tbpl.mozilla.org/?tree=Try&rev=12b9eb302672
> > > 
> > > will retrigger a bunch of test jobs
> > 
> > Did we give up on this? I see a bunch of retriggers, but none with this
> > failure. Should we repush and try again?
> 
> Perhaps this actually fixes the problem.  I've requested 10 more retriggers;
> if this problem still doesn't show up I'd like to land this patch and see if
> it goes away. :)

Builds were gone. :/
Note, if this does fix the problem, we should generate a build and update the test harness bin in talos as well - 

http://mxr.mozilla.org/build/source/talos/metro/
Yours is better in that you actually submitted your bug comment, but mine - https://tbpl.mozilla.org/?tree=Try&rev=e5957a807ffb - is better in that I already retriggered it like a monkey who just figured out exactly why those food pellets are landing in the bowl.
(In reply to Phil Ringnalda (:philor) from comment #248)
> Yours is better in that you actually submitted your bug comment, but mine -
> https://tbpl.mozilla.org/?tree=Try&rev=e5957a807ffb - is better in that I
> already retriggered it like a monkey who just figured out exactly why those
> food pellets are landing in the bowl.

heh, I just did the same on mine before I got this bug post. We should have a great data set to look at once all these jobs finish.
(In reply to Phil Ringnalda (:philor) from comment #248)
> Yours is better in that you actually submitted your bug comment, but mine -
> https://tbpl.mozilla.org/?tree=Try&rev=e5957a807ffb - is better in that I
> already retriggered it like a monkey who just figured out exactly why those
> food pellets are landing in the bowl.

So this is interesting, we're getting stacks now which is great. Looks like we're hanging in thread manager shutdown. My guess would be omtc related. Can we go ahead and get the patch here landed? We can file a follow up on the hang.
Assignee: nobody → jmathies
Attached patch patchSplinter Review
Not sure what happened to jeff, his email has gone bad. Taking to get this in.
Attachment #8338742 - Attachment is obsolete: true
Attachment #8348007 - Flags: review?(netzen)
Attachment #8348007 - Flags: review?(netzen) → review+
https://hg.mozilla.org/mozilla-central/rev/40ca2354d194
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Jordan, we should get this change into the cached talos metrotestharness.exe.
Flags: needinfo?(jlund)
This updates the exe in our talos repo to reflect recent metrotestharness.exe changes.

It also updates the txt file we use to track changes.
Attachment #8355100 - Flags: review?(jmathies)
Flags: needinfo?(jlund)
Attachment #8355100 - Flags: review?(jmathies) → review+
(Morhping summary to avoid false positives with TBPL bug suggestions)
Summary: Intermittent Win8 - timed out after 1000 seconds of no output → Flush metro test harness output to avoid timeouts due to no output
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: