Closed Bug 669819 Opened 13 years ago Closed 13 years ago

[TI] Permaorange on WinXP debug, frequent on 10.5 debug and Win7 debug: test_websocket_basic.html | Test timed out. and test_websocket_hello.html | Connection closed cleanly

Categories

(Core :: JavaScript Engine, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 obsolete file)

Came in on a TM -> JM merge, needs to go out before we can merge TM -> m-c.

The assertion stacks in the Windows logs are clearly crap; if the Mac ones are, it's not so obvious.

http://tinderbox.mozilla.org/showlog.cgi?log=TraceMonkey/1310004908.1310008532.27157.gz

39252 INFO TEST-PASS | /tests/content/base/test/test_websocket_basic.html | Wrong message - "test message100" should equal "test message100"
39253 ERROR TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_websocket_basic.html | Test timed out.
39254 INFO TEST-PASS | /tests/content/base/test/test_websocket_basic.html | Didn't receive all the messages! - 100 should equal 100
39255 INFO TEST-PASS | /tests/content/base/test/test_websocket_basic.html | Connection closed cleanly
39256 INFO TEST-END | /tests/content/base/test/test_websocket_basic.html | finished in 300987ms

39257 INFO TEST-START | /tests/content/base/test/test_websocket_hello.html
39258 INFO TEST-PASS | /tests/content/base/test/test_websocket_hello.html | 'this' should point to the WebSocket. (2) - [object WebSocket @ 0x27acca60 (native @ 0x29f67110)] should equal [object WebSocket @ 0x27acca60 (native @ 0x29f67110)]
39259 ERROR TEST-UNEXPECTED-FAIL | /tests/content/base/test/test_websocket_hello.html | Connection closed cleanly
WARNING: NS_ENSURE_TRUE(mState == STATE_TRANSFERRING) failed: file /builds/slave/tm-osx-dbg/build/netwerk/base/src/nsSocketTransport2.cpp, line 1889
before 991232, after 987136, break 07b34000
++DOMWINDOW == 297 (0x257040c8) [serial = 827] [outer = 0x1f6176b0]
39260 INFO TEST-PASS | /tests/content/base/test/test_websocket_hello.html | Wrong data - "Hello world!" should equal "Hello world!"
###!!! ASSERTION: nsLoadGroup not thread-safe: '_mOwningThread.GetThread() == PR_GetCurrentThread()', file /builds/slave/tm-osx-dbg/build/netwerk/base/src/nsLoadGroup.cpp, line 213
nsLoadGroup::Internal::Release [netwerk/base/src/nsLoadGroup.cpp:213]
nsLoadGroup::Release [netwerk/base/src/nsLoadGroup.cpp:213]
nsCOMPtr<nsILoadGroup>::~nsCOMPtr [nsCOMPtr.h:533]
mozilla::net::BaseWebSocketChannel::~BaseWebSocketChannel [netwerk/protocol/websocket/BaseWebSocketChannel.h:57]
mozilla::net::WebSocketChannel::~WebSocketChannel [netwerk/protocol/websocket/WebSocketChannel.cpp:571]
mozilla::net::WebSocketChannel::Release [netwerk/protocol/websocket/WebSocketChannel.cpp:81]
nsCOMPtr<nsIInputStreamCallback>::assign_assuming_AddRef [nsCOMPtr.h:518]
nsCOMPtr<nsIInputStreamCallback>::assign_with_AddRef [nsCOMPtr.h:1204]
nsCOMPtr<nsIInputStreamCallback>::operator= [nsCOMPtr.h:664]
nsInputStreamReadyEvent::Run [xpcom/io/nsStreamUtils.cpp:117]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:617]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:245]
nsSocketTransportService::Run [netwerk/base/src/nsSocketTransportService2.cpp:639]
nsThread::ProcessNextEvent [xpcom/threads/nsThread.cpp:617]
NS_ProcessNextEvent_P [obj-firefox/xpcom/build/nsThreadUtils.cpp:245]
nsThread::ThreadFunc [xpcom/threads/nsThread.cpp:271]
_pt_root [nsprpub/pr/src/pthreads/ptthread.c:190]
libSystem.B.dylib + 0x32155
libSystem.B.dylib + 0x32012
39261 INFO TEST-END | /tests/content/base/test/test_websocket_hello.html | finished in 2135ms
Blocks: LandTIOnMC
it appears WebSocketChannel::mLoadGroup need to be proxy released on the main thread.

I'll want to checkout tracemonkey and run the patch through (at least os x) try though.. so it will take some time.
the real cause is a regression from 537787, mLoadGroup was supposed to be moved to a base class but was instead duplicated in both the base and derived classes.

It should also be proxy released as I mentioned in comment 6, but that's apparently only relevant on a corner case that isn't being tested.

I'll attach a patch in a minute.
Depends on: 537787
Keywords: regression
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #544505 - Flags: review?(josh)
Assignee: general → nobody
Component: JavaScript Engine → Networking: WebSockets
QA Contact: general → networking.websockets
Comment on attachment 544505 [details] [diff] [review]
patch v1

Review of attachment 544505 [details] [diff] [review]:
-----------------------------------------------------------------
Attachment #544505 - Flags: review?(josh) → review+
Comment on attachment 544505 [details] [diff] [review]
patch v1

The e10s patch landed in aurora, and this fixes a relatively important yet simple bug in it.
Attachment #544505 - Flags: approval-mozilla-aurora?
crud - this patch didn't fix the orange with try. (it did clear up all those assertions, which are legit problems so I'll push the patch.)
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mcmanus@ducksong.com-1761e7290d98/try-macosx-debug/try_leopard-o-debug_test-mochitests-1-build414.txt.gz

JS team - the portion of the test that is failing deals with a very large (225KB) message. Are there any diffs between TM and M-C that might be relevant to that?
Assignee: nobody → general
Component: Networking: WebSockets → JavaScript Engine
QA Contact: networking.websockets → general
Can you attach the message or link to it?  TM recently pulled the TypeInference project (bug 608741) from the JM repo, which has a large effect on how JS is processed.
I was thinking it was the large message in test part 4 (with the huge message, which is both sent and recvd through ws) of this test:

http://mxr.mozilla.org/mozilla-central/source/content/base/test
/test_websocket_basic.html?force=1#22

But it actually looks like part 3 of the same test didn't quite complete.. it moved all 100 of its messages but doesn't move onto the close until a timeout kicks in. Note all those calls to forcegc(). [the test is just to an echo server, so each time a message is sent one is recvd].

hth.
Attachment #544505 - Attachment is obsolete: true
Attachment #544505 - Flags: approval-mozilla-aurora?
I've moved the ASSERTION(loadgroup not threadsafe) patch to a different bug 669975 as it is just a distraction from the TM issue.
I've tried to repro this on both OSX debug and Win7 debug and they both work fine.
Have to wonder if this test is doing several hundred GCs and just hitting a timeout before it finishes.  Will reduce the testCount and tryserver that.
Reduced the testCount to 50 on tryserver and this test was green on OSX, Win7 and WinXP.  I don't think it gets stuck, it just takes longer and usually hits the timeout.
Well, if it hits the timeout when testCount isn't modified, there is probably 
some significant performance regression which should be fixed.
How do we determine that?  It seems like this test has been the cause of random timeout orange in the past, and it may have been close to the timeout threshold already.  There are GC performance regressions we're working on (bug 669815), but if fixing those does not fix this then I'd say the problem is with the test and/or timeout value.
(In reply to comment #26)
> How do we determine that?  It seems like this test has been the cause of
> random timeout orange in the past, and it may have been close to the timeout
> threshold already. 

I'm not a fan of the tests, which certainly rely too much on timeouts and assumptions about processing speeds. but afaik it has not had an appreciable level of timeout flakiness in the last few months at all.

perhaps a measurement of time to completion before and after the merge would be a useful data point.
This hasn't happened on any platform in the JM branch since the patch for bug 670185 went in (which was, in large part, cutting out work done during GC).  How should this get marked as resolved?
JM's the only place it was broken, so if it's fixed there it's FIXED.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Whiteboard: [orange]
You need to log in before you can comment on or make changes to this bug.