Closed Bug 788960 Opened 12 years ago Closed 7 years ago

Intermittent dom/network/tests/unit/test_tcpsocket.js | Test timed out

Categories

(Core :: Networking, defect, P3)

x86_64
macOS
defect

Tracking

()

RESOLVED DUPLICATE of bug 1329245
Tracking Status
firefox24 --- affected
firefox25 --- affected
firefox26 --- affected

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [test disabled on OSX][leave open][necko-backlog])

Attachments

(3 files)

https://tbpl.mozilla.org/php/getParsedLog.php?id=14988385&tree=Firefox
Rev5 MacOSX Mountain Lion 10.8 mozilla-central debug test xpcshell on 2012-09-05 11:55:21 PDT for push 647fabcaa951
slave: talos-mtnlion-r5-060

TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | running test ...

command timed out: 1200 seconds without output, attempting to kill
process killed by signal 9
Added in last week or two by bug 733573.
Blocks: 733573
Assignee: nobody → josh
So there's no output at *all* from the test? That seems very strange, it seems like the tests should at least start running and then hang, not just hang without any output at all...
I think Honza may have been wrong about having the xpcshell test self-kill itself after all.  The python script that runs the xpcshell tests buffers the output, only printing the output if there is a failure.  (Although the output should end up on disk too.)  I think buildbot is killing that python process (and thereby its children) and so the buffered data does not get dumped.

Barring some other obvious solutoin, we should but the self-kill back in the test so you can get useful log data to figure out what's going on.  We might also try triggering a number of try-server runs on just OS X 10.8 where this seems to be happening.  Donovan, if you haven't gotten your level 1 approval in yet, I'd do that so you can do the try-server stuff.
Good call, Andrew. I think you are right.
It is a known problem (bug 597064) that xpcshell tests that time out show no output.
Here is the manual timing out code that was in older versions of the tcpsocket patch. This would at least give us some context about how far the process is getting before it hangs.
Attachment #659911 - Flags: review?(fabrice)
Attachment #659911 - Flags: review?(fabrice) → review+
I've landed the patch on mozilla-inbound so we get better errors:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e78c3efd115
Whiteboard: [orange] → [orange][leave open]
Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill → Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed.
The interesting bit:
TEST-INFO | /builds/slave/talos-slave/test/build/xpcshell/tests/dom/network/tests/unit/test_tcpsocket.js | Starting badConnect
TCPSocket: window init: undefined
TCPSocket: startup called
TCPSocket: Host info: 127.0.0.1:2
TCPSocket: SSL: false


The moral seems to be that either we are:
a) experiencing some type of race related to a failure to connect to a ridiculous local port that only happens on OS X 10.8.  For example, if the necko callback fires synchronously?  In theory, this race could also be on the (bad) success case if something is somehow on port 2.
b) There's some kind of weird firewall on localhost for OS X 10.8.  Previously we were dying after 20 minutes; now we kill the test after 10 seconds, so normal TCP socket timeouts could operate differently and the test could actually be failing for different reasons.
I'm guessing the failure is happening so fast that it is happening synchronously, which prevents the "connection refused" event from firing properly, but that's just a guess.

Figuring out when Connection Refused was happening was one of the hardest parts of writing the original tcpsocket patch.
It seems strange that this is only happening on Mac OS X 10.8 Mountain Lion -- are we running bots with 10.7 or 10.6 and it is not happening there? Strange indeed.
Summary: Intermittent test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed. → Intermittent "test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed", followed by "head.js | exception thrown from do_timeout callback: 2147500036"
Whiteboard: [orange][leave open] → [leave open]
Donovan, any other ideas on this? As you can see, we still hit this pretty frequently.
Flags: needinfo?(dpreston)
Yes, this is obviously happening quite frequently. If I could reproduce it locally then I would actually have a shot at fixing it. I will look into it. Perhaps I can write a script which will run just the tcpsocket tests in a tight loop until it fails, and then keep the process around so I can examine it.
Flags: needinfo?(dpreston)
I have managed to repro this on my machine by running the tests locally over and over again until they fail. It's not perfect, because it usually takes about 1000 iterations for them to fail, and if I manage to fix it I still won't be 100% sure that the tests running for a long time means that it is fixed, but at least I can have a little more certainty.

I have a few ideas for things to try to fix it, and now that I can repro locally I won't be making blind guesses in the dark.
I was kind of hoping that my fix for bug 861196 might improve whatever is happening here, but comments 313 - 315 show it was unrelated.  Can we just modify the xpcshell.ini to just not run the test on OS X machines?  This failure rate seems out of control and mozTCPSocket on OS X is really not an important thing.  badConnect is a useful test case of error mapping, so I would prefer not to just remove that sub-test in general.
I agree that we should probably just disable the test for now. I believe I did see some errors from other OSes than OS X in this huge thread, though. I think it just happens on OS X much more frequently.

I do have a simple fix -- call setTimeout(5) before opening the input or output channel -- but it is more of a bandaid than a fix. Instead of getting a connection refused error immediately when the race to connect beat the registering of the callback to catch the error, the connection refused happens five seconds later. Also when this fix it seems to cause a different crash bug to occur much more frequently on my machine (although this particular test never fails any more)
5 seconds is potentially quite short when on a device like the unagi running on just 2g.  roundtrips can be super sucky.
Yeah probably something like 25 seconds is a better timeout value. The timeout that causes this bug's failure would have to be extended to longer than that.
Maybe there's just an attribute we can check on the connection to check its state immediately after adding the listener?  Presumably the state of the connection isn't a secret.
Yes, this is a good idea. I believe I tried it already and it was not easy to tell the state of the socket, but I may not have tried hard enough.

I think the correct thing would be to check input.available() to see if it raises or not (like in onInputStreamReady), but the problem is it may also raise if the connection simply has not yet been made in addition to raising after the connection has failed.
At this point, I'm probably going to just disable this test on OSX since we aren't making any headway on a fix.
Donovan, remind me what the latest status of your investigation was?
The patch that I had improved the situation significantly, but I was still able to reproduce the crash after running the tests for many many hours.

It might be a good idea to just apply the patch anyway if we can get someone to review it.

I'll attach the patch so I don't lose it.
Comment on attachment 771477 [details] [diff] [review]
Don't actually start opening the socket until the next tick, to give callers of this function a chance to set callbacks before the open operation is initiated.

Theoretically bug 885982 should give us the same benefit as this patch, so it might make more sense to just disable the test on OS X for now.
I'm ok with disabling the test for now. Will there be a time when it can be turned on again, though? When bug 885982 lands maybe?
Yes, that would be a good point at which to try re-enabling it on OS X again.
Summary: Intermittent "test_tcpsocket.js | command timed out: 1200 seconds without output, attempting to kill | The test should never take this long unless the system is hosed", followed by "head.js | exception thrown from do_timeout callback: 2147500036" → Intermittent dom/network/tests/unit/test_tcpsocket.js | Test timed out
In case attachment 771477 [details] [diff] [review] is not the right fix, here is a patch which disables this test on OS X.
Attachment #771477 - Flags: review?(honzab.moz)
Honza, I believe attachment 771477 [details] [diff] [review] is the correct fix for this, but am not 100% sure. Can you take a look at it and see if it makes sense, or suggest someone else who can review it if you are too busy? Thanks.
Comment on attachment 771477 [details] [diff] [review]
Don't actually start opening the socket until the next tick, to give callers of this function a chance to set callbacks before the open operation is initiated.

Review of attachment 771477 [details] [diff] [review]:
-----------------------------------------------------------------

I don't believe this is going to help.  What you are adding a delay for is just asyncWait that will for sure trigger not sooner then until the next thread loop (cannot happen sooner the content script is done, so listener will for sure add them self before the socket starts doing anything) and asynccopier for the mux stream that has nothing to do with callbacks.

r+ just to let you try whether this helps or not.

::: dom/network/src/TCPSocket.js
@@ +54,5 @@
> +function next_turn(func) {
> +  Timer.createInstance(
> +    nsITimer
> +  ).initWithCallback({notify: func}, 0, TYPE_ONE_SHOT);
> +}

Use dispatch to current thread.  Get inspired with:
http://hg.mozilla.org/mozilla-central/annotate/b3fcd828cadc/testing/xpcshell/head.js#l435

but since this patch is not going to fix the issue probably, just land it as it is to confirm.
Attachment #771477 - Flags: review?(honzab.moz) → review+
Donovan, any updates here? Should we just land the disabling patch?
Flags: needinfo?(dpreston)
Yeah, I think Honza is right in comment 421 that my fix won't work. Landing the patch to just disable the test is the right way forward for now I think.
Flags: needinfo?(dpreston)
https://hg.mozilla.org/integration/mozilla-inbound/rev/80c3c49c015e
Keywords: checkin-needed
Whiteboard: [leave open] → [test disabled on OSX][leave open]
Assignee: josh → nobody
Whiteboard: [test disabled on OSX][leave open] → [test disabled on OSX][leave open][necko-backlog]
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Disabling on OSX fixed this, but when I refactored the tests in bug 1087145 and was more clever about the badConnect() test, I apparently reintroduced the same hazard which has now resurfaced with a vengeance.  I'm duping forward to bug 1329245 where the intermittent is currently terrorizing the tree and where I'm going to fix it by avoiding checking for connect errors on OS X.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: