Closed Bug 732363 Opened 12 years ago Closed 6 years ago

Intermittent (on Firefox) or Permanent (on Thunderbird Windows): TEST-UNEXPECTED-FAIL | test_net_addr.js | test failed (with xpcshell return code: 0), see following log from (NS_ERROR_NOT_CONNECTED) [nsISocketTransport.getScriptableSelfAddr]

Categories

(Core :: Networking, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE
mozilla14
Tracking Status
firefox13 --- verified

People

(Reporter: standard8, Assigned: drice)

References

()

Details

(Keywords: intermittent-failure, regression, Whiteboard: [test disabled on Windows on mozilla13+; patch 603236 and 606194 fixed in mozilla13] [necko-backlog])

Attachments

(2 files, 1 obsolete file)

Since bug 720846 landed, we've been seeing this failure on our Windows virtual machines, which are the Thunderbird builders:

http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTrunk/1330655617.1330658305.8294.gz#err0

TEST-INFO | e:\buildbot\comm-central-win32-opt-unittest-xpcshell\build\xpcshell\tests\netwerk\test\unit\test_net_addr.js | running test ...
TEST-UNEXPECTED-FAIL | e:\buildbot\comm-central-win32-opt-unittest-xpcshell\build\xpcshell\tests\netwerk\test\unit\test_net_addr.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>

TEST-INFO | (xpcshell/head.js) | test 1 pending

TEST-INFO | e:/buildbot/comm-central-win32-opt-unittest-xpcshell/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | server: listening on

TEST-INFO | (xpcshell/head.js) | test 2 pending

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-INFO | (xpcshell/head.js) | running event loop

TEST-INFO | (xpcshell/head.js) | test 2 pending
TEST-INFO | e:/buildbot/comm-central-win32-opt-unittest-xpcshell/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | Starting testIpv4

TEST-INFO | (xpcshell/head.js) | test 2 finished

TEST-INFO | e:/buildbot/comm-central-win32-opt-unittest-xpcshell/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | server: got client connection

TEST-UNEXPECTED-FAIL | e:/buildbot/comm-central-win32-opt-unittest-xpcshell/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | in TestServer.onSocketAccepted - Unexpected exception [Exception... "Component returned failure code: 0x804b000c (NS_ERROR_NOT_CONNECTED) [nsISocketTransport.getScriptableSelfAddr]"  nsresult: "0x804b000c (NS_ERROR_NOT_CONNECTED)"  location: "JS frame :: e:/buildbot/comm-central-win32-opt-unittest-xpcshell/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js :: <TOP_LEVEL> :: line 134"  data: no], see following stack:
undefined



I'm not quite sure what is going on here, I don't see why it would have got that far and then not be connected...
Whiteboard: [tb-orange]
This isn't a problem with getScriptableSelfAddr.  The error is coming from the non-scriptable getSelfAddr which it uses.  I may have made some incorrect assumptions in my test case, but it seems to me that if the socket was accepted, and the output and input streams are open, we should not be seeing errors about not being connected.  There's a chance that it disconnects (race condition) but if that were the case, this error would be sporadic.

This error is consistent, right?
Does seem to be permanent on Tb Windows, but then based on my memory from when Firefox ran tests on Win2K3 VMs, they are permanently brutally slow, and prone to losing any race condition you offer them.

Sort of surprising that the first Firefox failure would be a Win7 opt, though, typically WinXP debug is the closest match for the slowness of a 2K3 VM.
Blocks: 438871
Summary: Permanent Orange(?) Thunderbird Windows: TEST-UNEXPECTED-FAIL | test_net_addr.js | test failed (with xpcshell return code: 0), see following log → Intermittent (on Firefox) or Permanent (on Thunderbird Windows): TEST-UNEXPECTED-FAIL | test_net_addr.js | test failed (with xpcshell return code: 0), see following log from (NS_ERROR_NOT_CONNECTED) [nsISocketTransport.getScriptableSelfAddr]
Whiteboard: [tb-orange] → [tb-orange][orange]
Version: unspecified → Trunk
so we should backout 720846, right?
Sorry, I haven't been ignoring this; I've just been busy with other things.  The test is failing, but I don't have any reason to believe that the code change is problematic.  I'll provide a patch to back out the test and try to come back and create one that does not cause these failures.

jdm / biesi: Is there anything wrong with removing the test for the time being?  Does this violate some "feature changes must have working tests" rule?
Can you instead disable the test just on Windows? Seems like that would fix this problem while still keeping some test coverage.
I think I figured out what's going on.  I may propose a patch instead of removing it, but I'd like someone to chime in with "yeah, that sounds plausible" or something else to confirm my hypothesis is reasonable.

My assumption is that the result of transport.openOutputStream is being cleaned up because I am not keeping a reference to it.  When that happens, the client side closes the connection.

My patch would be to hold onto the output stream in a var alongside |transport|.
That sounds incredibly likely. nsSocketTransport2::OpenOutputStream looks like it confirms this hypothesis, since in the blocking case it creates a pipe and doesn't store it anywhere but in the return value.
Attached patch test fix 01 (obsolete) — Splinter Review
Patch attached.  Verified that it still passes for me (though it always did) and I'll watch http://build.mozillamessaging.com/tinderboxpushlog/?tree=ThunderbirdTrunk for success.

jdm for review/checkin.
Attachment #602969 - Flags: review?(josh)
Attachment #602969 - Flags: checkin?(josh)
Comment on attachment 602969 [details] [diff] [review]
test fix 01

Review of attachment 602969 [details] [diff] [review]:
-----------------------------------------------------------------

This will have the same problem. The last reference to the stream disappears when the function returns; we need to store a reference to the stream outside of the function.

You can also leave off the checkin? flag in future; I'll ensure your patches are checked in.
Attachment #602969 - Flags: review?(josh)
Attachment #602969 - Flags: review-
Attachment #602969 - Flags: checkin?(josh)
The inner function (set as |acceptCallback|) holds a reference to it.  As long as that function is reachable, so is |output|.  That function is only unreachable when |TestServer.reset| clears out the |acceptCallback|, and we no longer need the stream or socket.
I'm pretty confident that that's an implementation detail that could be defeated by optimizations in Spidermonkey. Let's just take the safe route and avoid any chance of oranges.
As requested.  Passes locally (as always) and stores objects of importance in a global, which should alleviate the optimization concern.
Attachment #602969 - Attachment is obsolete: true
Attachment #603236 - Flags: review?(josh)
Attachment #603236 - Attachment is patch: true
Attachment #603236 - Flags: review?(josh) → review+
https://hg.mozilla.org/mozilla-central/rev/c39fd8e1fb45
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla13
SeaMonkey Windows builds were unbroken and 1st run had this bug :-/

http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1331235913.1331241194.13591.gz
WINNT 5.2 comm-central-trunk debug test xpcshell on 2012/03/08 11:45:13
s: cb-seamonkey-win32-03
{
WARNING: CheckLinkStatus called on main thread! No check performed. Assuming link is up, status is unknown.: file e:/builds/slave/comm-cen-trunk-w32-dbg/build/mozilla/netwerk/system/win32/nsNotifyAddrListener.cpp, line 414

TEST-INFO | e:/builds/slave/test/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | server: listening on

TEST-INFO | e:/builds/slave/test/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | Starting testIpv4

TEST-INFO | e:/builds/slave/test/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | server: got client connection

TEST-UNEXPECTED-FAIL | e:\builds\slave\test\build\xpcshell\tests\netwerk\test\unit\test_net_addr.js | test failed (with xpcshell return code: 0), see following log:

TEST-UNEXPECTED-FAIL | e:/builds/slave/test/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js | in TestServer.onSocketAccepted - Unexpected exception [Exception... "Component returned failure code: 0x804b000c (NS_ERROR_NOT_CONNECTED) [nsISocketTransport.getScriptableSelfAddr]"  nsresult: "0x804b000c (NS_ERROR_NOT_CONNECTED)"  location: "JS frame :: e:/builds/slave/test/build/xpcshell/tests/netwerk/test/unit/test_net_addr.js :: <TOP_LEVEL> :: line 145"  data: no], see following stack:
undefined
}
Assignee: nobody → derrick.rice
Flags: in-testsuite+
http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTry/1331689984.1331693837.25294.gz
WINNT 5.2 try-comm-central opt test xpcshell on 2012/03/13 18:53:04

Still happening: not (fully) fixed.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I need this test run on a build system which is currently failing, but for a debug build (or FORCE_PR_LOG) with NSPR_LOG_MODULES=nsSocketTransport:5 and then see the PR_LOG output.  Ideally run solo, rather than with all tests.

Maybe I can get a windows thunderbird tryserver on loan?

I have not been able to reproduce this by introducing sleeps.  So I'm a bit stumped if someone cannot help me with the above.
This is outside my expertise and ability to debug, specifically because of the Mozilla build infrastructure.  biesi says we can always skip windows to avoid oranges, which this patch does.

I can't commit much more time to this issue, especially because I do not see the value.  No one's been able to point out a flaw in the test, yet it fails anyway.  The failure is unrelated to the changes it is trying to test, so it would be wasteful to throw away the change with the test.  With this skip, we can at least see the advantage of the original change (nsNetAddr).
Attachment #606194 - Flags: review?(josh)
Comment on attachment 606194 [details] [diff] [review]
skip windows
[Checked in: Comment 25 & 30]

Review of attachment 606194 [details] [diff] [review]:
-----------------------------------------------------------------

Agreed on all counts. Thanks for the work you've sunk into this!
Attachment #606194 - Flags: review?(josh) → review+
While here, some nits:

32   do_print('server: listening on', this.port);
shoule be
32   do_print("server: listening on port " + this.port);

139   }
shoule be
139   };

178   testDataStore.output = testDataStore.transport.openOutputStream(Ci.nsITransport.OPEN_BLOCKING,0,0);
shoule be
178   testDataStore.output = testDataStore.transport.openOutputStream(Ci.nsITransport.OPEN_BLOCKING, 0, 0);

163     do_execute_soon(run_next_test);
run_next_test() already does do_execute_soon(_run_next_test).
Does it need to be doubled? (If so, a comment might be welcome.)

***

49       this.selfAddr = trans.getScriptableSelfAddr();
50       this.peerAddr = trans.getScriptablePeerAddr();
succeed,
then
52       this.acceptCallback();
...
141   serv.acceptCallback = function() {
142     // disable the timeoutCallback
143     serv.timeoutCallback = function(){};
144 
145     var selfAddr = testDataStore.transport.getScriptableSelfAddr();
fails.
(which is odd, as all these are supposed to happen synchronously.)

(Trying to guess, facts and code...)

Fwiw, should you pass 'trans' and add something like
do_check_true(trans.isAlive());
do_check_eq(trans.getScriptableSelfAddr(), serv.selfAddr);
do_check_true(testDataStore.transport.isAlive());
do_check_eq(trans, testDataStore.transport);
? (At least as a debug try.)

Should you do_print() in
63   onStopListening: function(socket) {} ,
fwiw?
(In reply to Serge Gautherie (:sgautherie) from comment #23)
> Fwiw, should you pass 'trans' and add something like
> do_check_true(trans.isAlive());
> do_check_eq(trans.getScriptableSelfAddr(), serv.selfAddr);
> do_check_true(testDataStore.transport.isAlive());
> do_check_eq(trans, testDataStore.transport);
> ? (At least as a debug try.)

And probably add "same" checks in onSocketAccepted(), so we have even more details.

***

Afaict this is still Windows (near-)perma-orange on both Dbg SeaMonkey and Opt Thunderbird.
I can understand/support disabling the test (after we run out of (debug) idea).

Yet, the issue should be easily reproducible on TB Try and I don't think we should give up (as in closing bug) on finding out what is happening as this result seems very unexpected.
http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTrunk/1331908000.1331910589.8032.gz
WINNT 5.2 comm-central test xpcshell on 2012/03/16 07:26:40
+
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey/1331918260.1331923715.3372.gz
WINNT 5.2 comm-central-trunk debug test xpcshell on 2012/03/16 10:17:40

V.Fixed wrt to this disabling patch (only).
Whiteboard: [tb-orange][orange] → [test disabled on mozilla14; patch 603236 fixed in mozilla13] [tb-orange][orange]
Comment on attachment 606194 [details] [diff] [review]
skip windows
[Checked in: Comment 25 & 30]

[Approval Request Comment]
Regression caused by (bug #): Bug 720846.
User impact if declined: None, but perma-orange on SeaMonkey and Thunderbird.
Testing completed (on m-c, etc.): m-i, m-c.
Risk to taking this patch (and alternatives if risky): None, test-only.
String changes made by this patch: None.
Attachment #606194 - Flags: approval-mozilla-aurora?
Attachment #603236 - Attachment description: test fix 02 → test fix 02 [Checked in: Comment 15]
Attachment #606194 - Attachment description: skip windows → skip windows [Checked in: Comment 25]
Comment on attachment 606194 [details] [diff] [review]
skip windows
[Checked in: Comment 25 & 30]

[Triage Comment]
Bug 720846 was first in Firefox as of 13, so if disabling this test is the right path forward, we should do the same on Aurora.
Attachment #606194 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Keywords: checkin-needed
Whiteboard: [test disabled on mozilla14; patch 603236 fixed in mozilla13] [tb-orange][orange] → [c-n: 6801ef0d83b8 to m-a] [test disabled on mozilla14; patch 603236 fixed in mozilla13] [tb-orange][orange]
Landed on Aurora.
https://hg.mozilla.org/releases/mozilla-aurora/rev/8b1899e39fa7
Keywords: checkin-needed
Whiteboard: [c-n: 6801ef0d83b8 to m-a] [test disabled on mozilla14; patch 603236 fixed in mozilla13] [tb-orange][orange] → [test disabled on mozilla13/14; patch 603236 fixed in mozilla13] [tb-orange][orange]
Attachment #606194 - Attachment description: skip windows [Checked in: Comment 25] → skip windows [Checked in: Comment 25 & 30]
Whiteboard: [test disabled on mozilla13/14; patch 603236 fixed in mozilla13] [tb-orange][orange] → [test disabled on mozilla13+; patch 603236 and 606194 fixed in mozilla13] [tb-orange][orange]
Target Milestone: mozilla13 → mozilla14
http://tinderbox.mozilla.org/showlog.cgi?log=SeaMonkey-Aurora/1333835055.1333840344.14793.gz
WINNT 5.2 comm-aurora debug test xpcshell on 2012/04/07 14:44:15

firefox13: verified, wrt disabling the test.
Whiteboard: [test disabled on mozilla13+; patch 603236 and 606194 fixed in mozilla13] [tb-orange][orange] → [test disabled on mozilla13+; patch 603236 and 606194 fixed in mozilla13]
Ftr,
Comment 32 to comment 45 are all on "Linux".
This seems a "rare" intermittent-failure there : no permanent orange on TB (or SM).
Whiteboard: [test disabled on mozilla13+; patch 603236 and 606194 fixed in mozilla13] → [test disabled on Windows on mozilla13+; patch 603236 and 606194 fixed in mozilla13]
Whiteboard: [test disabled on Windows on mozilla13+; patch 603236 and 606194 fixed in mozilla13] → [test disabled on Windows on mozilla13+; patch 603236 and 606194 fixed in mozilla13] [necko-backlog]
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P3 → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
no instances in 4 months, lets close this out
Status: REOPENED → RESOLVED
Closed: 12 years ago6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: