Intermittent dom/network/tests/unit/test_tcpsocket.js | Test timed out

RESOLVED DUPLICATE of bug 1329245

Status

()

defect
P3
normal
RESOLVED DUPLICATE of bug 1329245
7 years ago
2 years ago

People

(Reporter: philor, Unassigned)

Tracking

({intermittent-failure})

Trunk
x86_64
macOS
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox24 affected, firefox25 affected, firefox26 affected)

Details

(Whiteboard: [test disabled on OSX][leave open][necko-backlog])

Attachments

(3 attachments)

https://tbpl.mozilla.org/php/getParsedLog.php?id=14988385&tree=Firefox
Rev5 MacOSX Mountain Lion 10.8 mozilla-central debug test xpcshell on 2012-09-05 11:55:21 PDT for push 647fabcaa951
slave: talos-mtnlion-r5-060

TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | running test ...

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
Added in last week or two by bug 733573.
Blocks: 733573
Assignee: nobody → josh
So there's no output at *all* from the test? That seems very strange, it seems like the tests should at least start running and then hang, not just hang without any output at all...
I think Honza may have been wrong about having the xpcshell test self-kill itself after all.  The python script that runs the xpcshell tests buffers the output, only printing the output if there is a failure.  (Although the output should end up on disk too.)  I think buildbot is killing that python process (and thereby its children) and so the buffered data does not get dumped.

Barring some other obvious solutoin, we should but the self-kill back in the test so you can get useful log data to figure out what's going on.  We might also try triggering a number of try-server runs on just OS X 10.8 where this seems to be happening.  Donovan, if you haven't gotten your level 1 approval in yet, I'd do that so you can do the try-server stuff.
Good call, Andrew. I think you are right.
It is a known problem (bug 597064) that xpcshell tests that time out show no output.
Here is the manual timing out code that was in older versions of the tcpsocket patch. This would at least give us some context about how far the process is getting before it hangs.
Attachment #659911 - Flags: review?(fabrice)
Attachment #659911 - Flags: review?(fabrice) → review+
I've landed the patch on mozilla-inbound so we get better errors:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e78c3efd115
Whiteboard: [orange] → [orange][leave open]
Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill → Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed.
The interesting bit:
TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | Starting badConnect
TCPSocket: window init: undefined
TCPSocket: startup called
TCPSocket: Host info: 127.0.0.1:2
TCPSocket: SSL: false


The moral seems to be that either we are:
a) experiencing some type of race related to a failure to connect to a ridiculous local port that only happens on OS X 10.8.  For example, if the necko callback fires synchronously?  In theory, this race could also be on the (bad) success case if something is somehow on port 2.
b) There's some kind of weird firewall on localhost for OS X 10.8.  Previously we were dying after 20 minutes; now we kill the test after 10 seconds, so normal TCP socket timeouts could operate differently and the test could actually be failing for different reasons.
I'm guessing the failure is happening so fast that it is happening synchronously, which prevents the "connection refused" event from firing properly, but that's just a guess.

Figuring out when Connection Refused was happening was one of the hardest parts of writing the original tcpsocket patch.
It seems strange that this is only happening on Mac OS X 10.8 Mountain Lion -- are we running bots with 10.7 or 10.6 and it is not happening there? Strange indeed.
Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed. → Intermittent "test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed", followed by "head.js | exception thrown from do_timeout callback: 2147500036"
Whiteboard: [orange][leave open] → [leave open]
Donovan, any other ideas on this? As you can see, we still hit this pretty frequently.
Flags: needinfo?(dpreston)
Yes, this is obviously happening quite frequently. If I could reproduce it locally then I would actually have a shot at fixing it. I will look into it. Perhaps I can write a script which will run just the tcpsocket tests in a tight loop until it fails, and then keep the process around so I can examine it.
Flags: needinfo?(dpreston)
I have managed to repro this on my machine by running the tests locally over and over again until they fail. It's not perfect, because it usually takes about 1000 iterations for them to fail, and if I manage to fix it I still won't be 100% sure that the tests running for a long time means that it is fixed, but at least I can have a little more certainty.

I have a few ideas for things to try to fix it, and now that I can repro locally I won't be making blind guesses in the dark.