Closed Bug 647126 Opened 11 years ago Closed 10 years ago

jetpack-win7 build fails tests due to test timeout

Categories

(Add-on SDK Graveyard :: General, defect, P1)

x86
Windows 7
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: myk, Assigned: myk)

References

Details

Attachments

(9 files, 3 obsolete files)

549 bytes, patch
aki
: review+
Details | Diff | Splinter Review
522 bytes, patch
myk
: review+
Details | Diff | Splinter Review
4.42 KB, patch
myk
: review+
Details | Diff | Splinter Review
953 bytes, patch
myk
: review+
Details | Diff | Splinter Review
704 bytes, patch
myk
: review+
Details | Diff | Splinter Review
761 bytes, patch
myk
: review+
Details | Diff | Splinter Review
41.71 KB, image/png
Details
3.51 KB, patch
warner
: review+
Details | Diff | Splinter Review
251.70 KB, text/plain
Details
After running some tests, the jetpack-win7 build fails tests because of a test timeout of unknown provenance.

http://tinderbox.mozilla.org/showlog.cgi?log=Jetpack/1300863114.1300863502.13823.gz&fulltext=1

Traceback (most recent call last):
  File "bin/cfx", line 29, in <module>
    cuddlefish.run()
  File "c:\talos-slave\jetpack-win7\jetpack-poller\addon-sdk\python-lib\cuddlefish\__init__.py", line 440, in run
    test_all(env_root, defaults=options.__dict__)
  File "c:\talos-slave\jetpack-win7\jetpack-poller\addon-sdk\python-lib\cuddlefish\__init__.py", line 280, in test_all
    test_all_packages(env_root, defaults)
  File "c:\talos-slave\jetpack-win7\jetpack-poller\addon-sdk\python-lib\cuddlefish\__init__.py", line 330, in test_all_packages
    defaults=defaults)
  File "c:\talos-slave\jetpack-win7\jetpack-poller\addon-sdk\python-lib\cuddlefish\__init__.py", line 695, in run
    addons=options.addons)
  File "c:\talos-slave\jetpack-win7\jetpack-poller\addon-sdk\python-lib\cuddlefish\runner.py", line 288, in run_app
    timeout)
Exception: Wait timeout exceeded (300s)
program finished with exit code 1

It isn't clear why this is happening.  It seems likely to be a bug in the SDK rather than the build scripts.  Diagnosing it may require releng to give a Jetpack developer access to a Windows 7 buildslave.  However, the first step to diagnosing it is to enable verbose logging by adding the --verbose flag to the "cfx testall" command on line 78 of run-jetpack.sh:

http://mxr.mozilla.org/build/source/tools/buildfarm/utils/run_jetpack.sh#78

Thus I am filing it in the Release Engineering component for now.
Attached patch add ---verbose to help debugging (obsolete) — Splinter Review
from irc with myk and ctalbert:

using --verbose will help verify if the problem is this test hanging, or the next test failing to start.
Attachment #523654 - Flags: review?(aki)
Comment on attachment 523654 [details] [diff] [review]
add ---verbose to help debugging

http://hg.mozilla.org/build/tools/rev/dbfbaa9e4554
Attachment #523654 - Flags: review?(aki)
Attachment #523654 - Flags: review+
Attachment #523654 - Flags: checked-in+
further irc from ctalbert, myk agrees ctalbert's suggestion is better.
Attachment #523654 - Attachment is obsolete: true
Attachment #523659 - Flags: review?(aki)
Attachment #523659 - Flags: checked-in?
(In reply to comment #4)
> After running some tests, the jetpack-win7 build fails tests because of a test
> timeout of unknown provenance.

> It isn't clear why this is happening.  It seems likely to be a bug in the SDK
> rather than the build scripts.  Diagnosing it may require releng to give a
> Jetpack developer access to a Windows 7 buildslave.  However, the first step to
> diagnosing it is to enable verbose logging by adding the --verbose flag to the
> "cfx testall" command on line 78 of run-jetpack.sh:
> http://mxr.mozilla.org/build/source/tools/buildfarm/utils/run_jetpack.sh#78

> Thus I am filing it in the Release Engineering component for now.
Patches live in production to generate verbose info. 
bug#647301 is tracking getting myk access to machines. 

Nothing left to do here, so pushing this bug to labs to track debugging the failing test.
Component: Release Engineering → Jetpack SDK
Flags: checked-in?
Flags: checked-in+
Product: mozilla.org → Mozilla Labs
QA Contact: release → jetpack-sdk
Version: other → unspecified
Comment on attachment 523659 [details] [diff] [review]
add --verbose to help debugging (v2)

http://hg.mozilla.org/build/tools/rev/111bf8895346
Attachment #523659 - Flags: review?(aki) → review+
Assignee: nobody → myk
Status: NEW → ASSIGNED
Component: Jetpack SDK → General
Priority: -- → P1
Product: Mozilla Labs → Add-on SDK
QA Contact: jetpack-sdk → general
Target Milestone: --- → 1.0b5
No longer blocks: 647302
Over in bug 647301, Aki gave me access to a Windows 7 machine.  I tried running run_jetpack.sh manually there, and it successfully downloaded the SDK, downloaded the latest mozilla-central build of Firefox, and ran the SDK tests.

There were a few test failures, but they didn't prevent the test run from completing, and there was no test timeout.  So either the machine on which I am testing does not have the timeout problem, or the problem is specific to the way the build process runs run_jetpack.sh.

Unfortunately, I'm not familiar with the build process.  I note from the logs that it seems to start with the directory c:\talos-slave\jetpack-win7\, but that directory doesn't exist on the machine I'm using to test (c:\talos-slave exists and contains a bunch of subdirectories, but it doesn't contain jetpack-win7).

Aki: can you point me towards info on kicking off a build process on the machine?  Seeing if that works seems like the right next step.
This ought to reduce some of the confused interleaving of test results. I'm
pretty sure that the python-side tests ("testcfx") finishes normally, then
the firefox/mozrunner-side tests start ("testall" which runs "test" in each
package), one of which hangs and triggers the timeout.

Here's what I think is happening:
 * test_all() is run, which first prints "Testing cfx..." to sys.stdout .
   This gets buffered and doesn't appear for a while.
 * python tests (using stdlib "unittest" package) run normally, write their
   results to sys.stderr, don't flush anything. They write enough data to
   fill the buffer, causing it to flush, so the very first thing we see in
   the logs is the first line of the tests: "Doctest:
   cuddlefish.packaging.validate_resource_hostname ... ok"
 * when they complete (successfully), about 2kB of text is still sitting in
   the sys.stderr buffer, starting in the middle of a test line (e.g.
   "_package_doc (cuddlefish.tests.test_webdocs.WebDocTests) ... ok"),
   continuing through the remaining tests, and finishing with "---\nRan 104
   tests in 8.712s\n\nOK\n".
 * now mozrunner tests begin, writing results to sys.stdout (which still has
   "Testing cfx..." buffered). The first thing written is "Testing
   annotator...", followed by some mozrunner "Using binary at.." lines. Then,
   as the test runs (on windows), cuddlefish.runner copies data out of a disk
   file and writes it to sys.stdout (including lots of noise from the JS
   console.info() calls). Eventually enough data is written to fill the
   buffer, causing it to flush, so we see "Testing cfx..." followed by
   everything else show up in the logs.
 * eventually we hit an item which hangs (probably in test-context-menu, as
   that's the last "info: executing ..." line in the logs). This causes
   cuddlefish/runner.py to timeout, signalled by raising an Exception().
   Python deals with uncaught exceptions by writing a stringified
   representation to sys.stderr then terminating the program with rc=1. All
   that data to sys.stderr fills the buffer (which still has the last 2kBish
   of text from testcfx, finish with "OK\n"), causing it to flush, so in the
   logs we see the abrupt transition from console.info() messages to python
   unittest messages. After the "OK" we see the exception string. Then the
   process exits, which flushes all buffers as it closes the filehandles. So
   we see the end of stderr ("Exception: Wait timeout exceeded (300s)"), and
   maybe any lingering stdout (but I don't think there is any).
 * finally, some parent process (run_jetpack.sh? the buildslave itself? some
   other buildstep?) writes "program finished with exit code 1" and
   "elapsedTime=384.307000".

So, this one-line patch adds a flush after the python-side tests (testcfx)
have finished. As this is the barrier between code which uses stderr and code
which uses stdout, this ought to fix the most obvious interleaving problem.
(I see some other weirdness in the logs, specifically a console.log() line
that says "em should be present in overflow submenu", which appears to have
dropped some characters, but we can investigate those later).

If this works, I'll expect to see the log finish with the console.info() line
from the hanging test-context-menu test, followed by the timeout traceback,
followed by the end of file. That should make it easier to spot which test is
hanging.
Attachment #525751 - Flags: review?(myk)
Comment on attachment 525751 [details] [diff] [review]
flush stderr after 'testcfx', avoid interleaving results

That sounds like a reasonable explanation; let's give it a shot!
Attachment #525751 - Flags: review?(myk) → review+
Brian landed attachment 525751 [details] [diff] [review]:

  https://github.com/mozilla/addon-sdk/commit/49363bbb6335f35c7dfa1525700e58c0eaa45c12

But it doesn't seem to have helped matters (much?):

  https://github.com/mozilla/addon-sdk/commit/49363bbb6335f35c7dfa1525700e58c0eaa45c12


Digging further, the package tests appear to be getting run twice:

  Testing all available packages: addon-kit, api-utils, test-harness, development-mode.
  Using binary at 'C:\talos-slave\jetpack-w764/jetpack-poller/firefox/firefox.exe'.
  Using profile at 'c:\users\cltbld\appdata\local\temp\tmpev-gxb.mozrunner'.
  Running tests on Firefox 6.0a1/Gecko 6.0a1 ({ec8030f7-c20a-464f-9b0e-13a3a9e97384}) under WINNT/x86_64-msvc.
  info: executing 'test-packaging.testPackaging'
  ...
  Running tests on Firefox 6.0a1/Gecko 6.0a1 ({ec8030f7-c20a-464f-9b0e-13a3a9e97384}) under WINNT/x86_64-msvc.
  info: executing 'test-packaging.testPackaging'
  ...


Also:

myk@myk:~$ grep 'info: pass' log.txt | wc
   6286   52146  314678

(That's about twice as many "info: pass" lines as would be expected.)


Interestingly, the "testing all available packages", "using binary", and "running tests" lines are not reprinted.  So the duplication appears to happen within the package test code rather than outside of it.

Brian: if I recall correctly, at one point the test runner was configured to run package tests twice, once with the experimental e10s support enabled and once with it disabled.  But I thought we changed that.  Is it possible that the change didn't take on Windows 7 for some reason?
(In reply to comment #9)
> Interestingly, the "testing all available packages", "using binary", and
> "running tests" lines are not reprinted.  So the duplication appears to happen
> within the package test code rather than outside of it.

Erm, the "running tests" line is actually reprinted.  I meant to say that the "using profile" line is not reprinted.

And if the runner is kicking off two package test runs, both using the same profile (since the duplication starts after the "using profile" line), and MOZ_NO_REMOTE is set, then that would explain why at least one of those runs times out, since the second Firefox process will never fully start up in that case, it'll just display a "Close Firefox" dialog complaining "Firefox is already running, but is not responding. To open a new window, you must first close the existing Firefox process, or restart your system."
(In reply to comment #10)
> And if the runner is kicking off two package test runs, both using the same
> profile (since the duplication starts after the "using profile" line), and
> MOZ_NO_REMOTE is set, then that would explain why at least one of those runs
> times out, since the second Firefox process will never fully start up in that
> case, it'll just display a "Close Firefox" dialog complaining "Firefox is
> already running, but is not responding. To open a new window, you must first
> close the existing Firefox process, or restart your system."

On the other hand, it's the test runner addon that writes the "running tests" line, while mozrunner is responsible for starting Firefox, so it might be the case that Firefox is started only once, but then the runner runs tests twice.
What I've learned so far, looking at the 'testall' sequence on my own box, is that all the little console.info() messages from the JS tests go to stderr, not stdout (at least on OS-X). However the progress messages ("Testing cfx...", "Testing annotator..", "Using binary at..", "Testing all available packages: ..", "Program terminated successfully", and "Some tests were unsuccessful") go to stdout. So there's lots of potential for confusion and output sequencing problems.

I'd like to get everything to write to a single place. I spoke to Atul briefly about it, and it sounds like getting the JS code to write to stdout instead of stderr is difficult, but that it might be straightforward to use the windows path of writing everything to a file, and then having cfx read the file. I'll look into that.
Here's another cleanup patch, to send all test output (including the banner lines like "Testing all packages..") to stderr, not stdout. Basically the JS-side console.info() writes to stderr, and Atul says it's hard to change that, so instead this changes the cfx side to match the JS (and unittest.TextTestRunner) side.

It also adds a few more banners, to make it easier to spot the border between testing the examples and testing the normal (addon-kit/api-utils) packages. This might make it a bit easier to investigate the apparent double-test problem.
Attachment #526104 - Flags: review?(myk)
Comment on attachment 526104 [details] [diff] [review]
send all test output to stderr, not stdout

I sure wish we could do it the other way around (make mozrunner output to stdout).

This reminds me that we do that funky logfile thing on Windows.  I wonder if that could be the source of this log wonkiness.
Attachment #526104 - Flags: review?(myk) → review+
this patch might help us see whether something is hanging, or if the buildslave is merely slow and the timeout value is too low
Attachment #526120 - Flags: review?(myk)
Comment on attachment 526120 [details] [diff] [review]
emit elapsed time once every 5 seconds

>+                next_chime += chime_interval

Nit: if `next_chime = elapsed + chime_interval`, it would ensure two chimes never occur less than chime_interval apart.
Attachment #526120 - Flags: review?(myk) → review+
yeah, I'd prefer to write everything to stdout too. From Atul it sounded like something in the mozilla logging machinery. The logfile thing on windows is a likely suspect: from what I can tell (in harness.js: https://github.com/mozilla/addon-sdk/blob/355d84e71665e748ce880d915af8ff954e275691/python-lib/cuddlefish/app-extension/components/harness.js#L548) the logfile is *in addition to* the normal dump() function. It's conceivable that the logFile code was originally written for an environment in which, e.g., the browser launches in a way that discards its normal stderr, but the buildslave is running it in a way which preserves stderr, and then the use of the logFile would result in duplicate messages (one from the real stderr, the other copied from the logfile). A simple way to check this would be to change runner.py to prefix each chunk of logfile data with "LOGFILE:", then see if only one of the two copies has the prefix.
(In reply to comment #18)
> yeah, I'd prefer to write everything to stdout too. From Atul it sounded like
> something in the mozilla logging machinery.

FWIW dump() in sandboxes hardcodes stderr:

http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/xpccomponents.cpp#3011

I haven't been following along but could you just open a stream to stdout instead of using dump()?

(dump() in windows on the other hand hardcodes stdout:)

http://mxr.mozilla.org/mozilla-central/source/dom/base/nsGlobalWindow.cpp#4469
Ok, it looks like the win7 tests are down to one failure, a timeout in
test-context-menu . Looking at the test-running code
(packages/api-utils/lib/unit-test.js, specifically
https://github.com/mozilla/addon-sdk/blob/8dd64c64abf7fc3eda997d35590bb9c5ac1ed10b/packages/api-utils/lib/unit-test.js#L230),
it appears that tests which use waitUntilDone() (including all of
test-context-menu) get 10 seconds (DEFAULT_PAUSE_TIMEOUT) to complete. If
they haven't finished in that time, waitUntilDone() marks them as failed and
starts the next test. waitUntilDone() doesn't have any way to stop the work
that's already been queued up: it's not so much halting the test as it is
hoping that the test has hung and can be safely ignored.

In our case, I think test-context-menu.testSortingOverflow takes more than 10
seconds, but it's not hung, it's just slow. So the next test
(testSortingMultipleModules) begins, and overlaps with testSortingOverflow. I
can't tell that this is actually causing problems, but I can easily imagine
other tests where it would.

So I propose to increase DEFAULT_PAUSE_TIMEOUT to 30 seconds, and see if that
helps.

The deeper question is why win7 is taking about seven times as long to run
the test suite as linux and OS-X.
It looks like test-context-menu tests are taking a long (but finite) time to run, exceeding the 10-second per-test timeout. By raising it to 30s, we might avoid this problem.
Attachment #526412 - Flags: review?(myk)
new version of the patch only changes the timeout for test-context-menu, leaving it at 10s for all other tests.
Attachment #526412 - Attachment is obsolete: true
Attachment #526412 - Flags: review?(myk)
Attachment #526413 - Flags: review?(myk)
Attachment #526413 - Flags: review?(myk) → review+
Could you make logging optional (and non-default). It took me some time and investigation before I found out why my jetpack started to produce rather confusing messages

(elapsed time: 5 seconds)
info: Message: cmd = CreateButtons, data = [object Object]
info: onMessage - incoming : msg.cmd = CreateButtons

(elapsed time: 10 seconds)

(elapsed time: 15 seconds)

(elapsed time: 20 seconds)

(elapsed time: 25 seconds)

(elapsed time: 30 seconds)

(elapsed time: 35 seconds)

(elapsed time: 40 seconds)

(elapsed time: 45 seconds)

(elapsed time: 50 seconds)

(elapsed time: 55 seconds)

(elapsed time: 60 seconds)

(elapsed time: 65 seconds)

(elapsed time: 70 seconds)

(elapsed time: 75 seconds)

(elapsed time: 80 seconds)

(elapsed time: 85 seconds)

(elapsed time: 90 seconds)

(elapsed time: 95 seconds)

(elapsed time: 100 seconds)

(elapsed time: 105 seconds)

(elapsed time: 110 seconds)

(elapsed time: 115 seconds)

for normal cfx run. I guess if it was printed only for cfx test -v, I wouldn't have any objections.
Matej: that logging of elapsed time is temporary, while we work out the kinks in our test automation.  We'll remove it once we're done.
the fedora64 buildslave stopped while finishing up a test, possibly because runner.py never saw the "OK" or "FAIL" results it was looking for. This patch prints out the contents when they are something unexpected.
Attachment #526851 - Flags: review?(myk)
Comment on attachment 526851 [details] [diff] [review]
show unexpected resultsfile contents

>+                    sys.stderr.write("Hrm, resultfile (%s) contained something weird (%d bytes)\n" % (resultfile, len(output)))
>+                    sys.stderr.write("'"+output+"'\n")

Nit: I'd add a colon to the end of the first line to make it more obvious that the weird thing will be output next.

r=myk!
Attachment #526851 - Flags: review?(myk) → review+
Just got timeouts like this on both XP and Win7, on a push that doesn't seem like it should have anything to do with JP:

http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1303835594.1303836477.11203.gz
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1303835633.1303836741.12365.gz
(In reply to comment #27)
> Just got timeouts like this on both XP and Win7, on a push that doesn't seem
> like it should have anything to do with JP:
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1303835594.1303836477.11203.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1303835633.1303836741.12365.gz

Those are debug builds, so I suspect bug 651665, although the signature is slightly different.  We should land the platform fix for that bug and then see if this still happens.
Those were the first run of debug on Windows, they were permared, now they're hidden.
Building debug now to diagnose this...
This screenshot shows that the problem on Windows debug builds is an assertion dialog that shows up and stops the proceedings.

As I recall, there's a way to turn off these dialogs.  Digging...
Ah, adw points me to <https://developer.mozilla.org/en/XPCOM_DEBUG_BREAK>.  Now to figure out where to plug it into the test harness...
Brian pointed out where to plug it into the test harness (runner.py), but it doesn't seem to work.  Maybe it doesn't apply to getenv.c, or perhaps, since getenv.c appears to be for getting environment variables, the variable hasn't been read yet.
The assertion on Firefox Windows debug builds appears to be caused by an overlarge HARNESS_OPTIONS.  When you install an SDK-based addon, harness options are read from a file, but when you do `cfx run` or `cfx test` they come in through an environment variable.  But they can take up 30-50K, too large for an environment variable on Windows, apparently.

Here's a patch that unconditionally writes them to a file (and then makes the harness read them therefrom).
Attachment #528750 - Flags: review?(warner-bugzilla)
Comment on attachment 528750 [details] [diff] [review]
don't use a HARNESS_OPTIONS environment variable

r+ by me. So crazy it just might work.

Some nits:
 - the remaining harness.js error message ("HARNESS_OPTIONS env var must exist") should probably mention harness-options.json file, instead
 - in runner.py: use atexit.register() to maybe-remove the options file even if the runner dies for some reason
Attachment #528750 - Flags: review?(warner-bugzilla) → review+
This patch removes the file atexit and cleans up references to HARNESS_OPTIONS in a comment and an error message.
Attachment #528750 - Attachment is obsolete: true
Attachment #528755 - Flags: review?(warner-bugzilla)
Comment on attachment 528755 [details] [diff] [review]
don't use a HARNESS_OPTIONS environment variable, take 2

look ok. In the long run (say, before 1.0), I want to find another solution, because this makes it impossible to use a single SDK to run two unit tests (for different addons) at the same time. But it's fine for now. r+.
Attachment #528755 - Flags: review?(warner-bugzilla) → review+
Committed:

https://github.com/mozilla/addon-sdk/commit/dee3fed1b709db8c4e191ec3c0c49d7e98720941

I ended up reverting the atexit change because it queued up multiple attempts to delete the file and then complained that it couldn't find it on the second and subsequent attempts when I ran `cfx testall`.  So the patch I checked in is a combination of the first and second patches.
Target Milestone: 1.0b5 → 1.0
I'm not sure why we didn't resolve this before, but in any case the most recent commit resolved the problem in question.  There are still some timeouts due to bug 651665, and it's possible that bug obscures other issues, but in that case we should file new bugs on those issues once that bug is fixed.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
OS: Windows 7 → All
Resolution: FIXED → ---
I am sorry, but this bug (if it is this bug) seems to be still doing just fine with commit e3a7dc6ab6493d078b04bf900950f8e5f6a7747b and
Mozilla/5.0 (X11; Linux x86_64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Matej: that particular test failure (Package Not Found in test_generate_static_docs_does_not_smoke) is now part of bug 659478.. not a timeout, but still something we're trying to figure out.
(In reply to comment #41)
> Matej: that particular test failure (Package Not Found in
> test_generate_static_docs_does_not_smoke) is now part of bug 659478.. not a
> timeout, but still something we're trying to figure out.

Just to note that both this and other bugs are labeled as happening on Windows. This is Linux (Fedora 15) here. Joining the merry party on the other bug.
Reresolving, as mcepl's issue is a different bug.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
OS: All → Windows 7
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.