Closed Bug 465921 Opened 16 years ago Closed 16 years ago

httpd.js fails to deliver after about 35 reloads of test_seek1.html

Categories

(Testing :: General, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9.1b3

People

(Reporter: cpearce, Assigned: Waldo)

References

Details

(Keywords: fixed1.9.1)

Attachments

(3 files)

If you run the mochitests content/media/video/test/test_seek1.html, wait for it to finish, then reload the page, and repeat approximately 35 time, httpd.js stops delivering data to the browser. I am always able to reproduce this on my Windows XP box, but the number of reloads required varies between 30 and 40. I got up to 60 reloads on AndLinux running inside Vista, so it may not occur on Linux. Once the httpd.js stops delivering data to the browser, if you point another instance of a browser at the mochitest URL, you get no response either, so it's definately a problem with the httpd.js server. When you exit the mochitest browser after httpd.js has stopped delivering, usually (but not always) runtests.py doesn't terminate.

Note that test_seek1.html is doing media seeks, which use byte-range requests. Each seek does at least one extra byte-range request, possibly several. We suspect that this *may* be what is causing the intermittent test timeouts on tinderbox video mochitests.

Note that the test may fail sometimes. This is probably a separate bug.
I reloaded a "bunch" of times on OS X and couldn't get any server failures.  I did occasionally notice errors about "ok" not being a function in test_seek1.html (lines 60 and 71 as I recall, the numbers got blown away when my laptop lost power due to my flipping the switch controlling a power outlet) when I copied the test URL into an iframe and then duplicated the iframe 15x in a data: URL to simulate load, but that wouldn't explain server errors.  Maybe I'll try firing up my long-unused Vista VM to see if it can be reproduced there...
I've been sitting here reloading test_seek1.html on my Vista SP1 laptop for about half an hour, and it's not hung yet, so it may be WinXP specific. Mochitests are a lot slower on Vista than WinXP, so it could be not running fast enough for it to show up.
Could you try adding |var DEBUG = true;| to the top of $OBJDIR/_tests/testing/mochitest/server.js and seeing what sort of error messages get spewed when and after failure happens?
I'll run it now. I've just done 60 reloads of current trunk on Ubuntu 7.1 with no problems.
Attached file Server.js debug output
Output of mochitest when run with |var DEBUG=1;| added to server.js. The server stops delivering data at line 21097 of the log. I don't see an obvious error messages, it just accepts the connection, and doesn't send any data down it.
Attached file Testcase
Testcase. STR:

Steps to reproduce, on WinXP SP2

1. Run mochitests runtests.py, but without --autorun (this just starts httpd.js)
3. Run mozilla-central Firefox build.
4. Load seek_hang.html, this will repeatedly seek video served by from httpd.js.
5. Wait while test runs, once "Seeks Remaining" counts down from 300 to about 260 (after 40 seeks), the server will stop sending data, and the ogg decode thread will be stuck waiting for more data. Stack:

ntdll.dll!_KiFastSystemCallRet@0() 	
ntdll.dll!_ZwWaitForSingleObject@12()  + 0xc bytes	
kernel32.dll!_WaitForSingleObjectEx@12()  + 0x8b bytes	
kernel32.dll!_WaitForSingleObject@8()  + 0x12 bytes	
nspr4.dll!_PR_MD_WAIT_CV(_MDCVar * cv=0x075a21f4, _MDLock * lock=0x065abb04, unsigned int timeout=0xffffffff)  Line 280 + 0x14 bytes	C
nspr4.dll!_PR_WaitCondVar(PRThread * thread=0x06603260, PRCondVar * cvar=0x075a2180, PRLock * lock=0x065abae8, unsigned int timeout=0xffffffff)  Line 204 + 0x17 bytes	C
nspr4.dll!PR_Wait(PRMonitor * mon=0x04731db8, unsigned int ticks=0xffffffff)  Line 175 + 0x1d bytes	C
xpcom_core.dll!nsAutoMonitor::Wait(unsigned int interval=0xffffffff)  Line 340 + 0x11 bytes	C++
xpcom_core.dll!nsPipeInputStream::Wait()  Line 654	C++
xpcom_core.dll!nsPipeInputStream::ReadSegments(unsigned int (nsIInputStream *, void *, const char *, unsigned int, unsigned int, unsigned int *)* writer=0x002d3cf0, void * closure=0x06c86088, unsigned int count=0x00001000, unsigned int * readCount=0x0746fc5c)  Line 778 + 0x8 bytes	C++
xpcom_core.dll!nsPipeInputStream::Read(char * toBuf=0x06c86088, unsigned int bufLen=0x00001000, unsigned int * readCount=0x0746fc5c)  Line 828	C++
gklayout.dll!nsHttpStreamStrategy::Read(char * aBuffer=0x06c86088, unsigned int aCount=0x00001000, unsigned int * aBytes=0x0746fc5c)  Line 488 + 0x28 bytes	C++
gklayout.dll!nsMediaStream::Read(char * aBuffer=0x06c86088, unsigned int aCount=0x00001000, unsigned int * aBytes=0x0746fc5c)  Line 785 + 0x24 bytes	C++
gklayout.dll!nsChannelReader::io_read(char * aBuffer=0x06c86088, unsigned int aCount=0x00001000)  Line 80 + 0x17 bytes	C++
gklayout.dll!oggplay_channel_reader_io_read(void * aReader=0x0660c8c0, void * aBuffer=0x06c86088, unsigned int aCount=0x00001000)  Line 122	C++
gklayout.dll!oggz_io_read(_OGGZ * oggz=0x0660e438, void * buf=0x06c86088, unsigned int n=0x00001000)  Line 71 + 0x16 bytes	C
gklayout.dll!oggz_get_next_page(_OGGZ * oggz=0x0660e438, ogg_page * og=0x0660e468)  Line 218 + 0x12 bytes	C
gklayout.dll!oggz_get_next_start_page(_OGGZ * oggz=0x0660e438, ogg_page * og=0x0660e468)  Line 283 + 0xd bytes	C
gklayout.dll!oggz_get_prev_start_page(_OGGZ * oggz=0x0660e438, ogg_page * og=0x0660e468, __int64 * granule=0x0746fd6c, long * serialno=0x0746fd80)  Line 339 + 0xd bytes	C
gklayout.dll!oggz_seek_set(_OGGZ * oggz=0x0660e438, __int64 unit_target=0x00000000000009c3)  Line 771 + 0x15 bytes	C
gklayout.dll!oggz_seek_units(_OGGZ * oggz=0x0660e438, __int64 units=0x00000000000009c3, int whence=0x00000000)  Line 886 + 0x11 bytes	C
gklayout.dll!oggplay_get_duration(_OggPlay * me=0x06603138)  Line 651 + 0x16 bytes	C
gklayout.dll!oggplay_seek(_OggPlay * me=0x06603138, __int64 milliseconds=0x00000000000007d0)  Line 54 + 0x9 bytes	C
gklayout.dll!nsOggDecodeStateMachine::Run()  Line 962 + 0x1c bytes	C++
xpcom_core.dll!nsThread::ProcessNextEvent(int mayWait=0x00000001, int * result=0x0746ff08)  Line 511	C++
xpcom_core.dll!NS_ProcessNextEvent_P(nsIThread * thread=0x06602c50, int mayWait=0x00000001)  Line 227 + 0x16 bytes	C++
xpcom_core.dll!nsThread::ThreadFunc(void * arg=0x06602c50)  Line 254 + 0xb bytes	C++
nspr4.dll!_PR_NativeRunThread(void * arg=0x06603260)  Line 436 + 0xf bytes	C
nspr4.dll!pr_root(void * arg=0x06603260)  Line 122 + 0xf bytes	C
msvcr80d.dll!_callthreadstartex()  Line 348 + 0xf bytes	C
msvcr80d.dll!_threadstartex(void * ptd=0x06603c30)  Line 331	C
kernel32.dll!_BaseThreadStart@8()  + 0x37 bytes	

This problem is a problem with httpd.js. When I run this testcase reading data served by lighttpd, I don't get this problem... I get a different one in liboggplay, but after more seeks have succeeded, so there's definitely a problem with httpd.js.
I had a look at the system event viewer on my WinXP box, and it's not reporting that connections are being refused due to WinXP SP2's connection limiting, so that's not the problem.
Jeff, have you tried the new testcase? Do you have any insights as to what's going wrong?
I'll try and give this a run later tonight.
I get failures after varying numbers of seeks, and it's not consistent.  Also, the server still responds to requests, and I can load paths like /trace perfectly fine; I don't know yet that this is an httpd.js bug.  I tend to think it might be a bug in the media downloading stuff.

However, there does seem to be at least one bug here that might potentially cause this, recording here as my terminal scrollback is probably going to disappear when I retest with some changes: I'm seeing this output after a long enough time (httpd.js has been modified somewhat, hence the presence of dumps not in the pristine version):

>>> new connection on 127.0.0.1:62977
**count: 11
*** onInputStreamReady(input=[xpconnect wrapped (nsISupports, nsIInputStream, nsIAsyncInputStream, nsISeekableStream, nsISearchableInputStream) @ 0x22397a40 (native @ 0x223978bc)]) on thread [xpconnect wrapped (nsISupports, nsIThread, nsIThreadInternal, nsIEventTarget, nsISupportsPriority) @ 0x45bfc0 (native @ 0x409950)] (main is [xpconnect wrapped (nsISupports, nsIThread, nsIThreadInternal, nsIEventTarget, nsISupportsPriority) @ 0x45bfc0 (native @ 0x409950)])
*** this._state == 0
*** error thrown: [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: ./httpd.js :: anonymous :: line 1064"  data: no]
for path

The error being dumped here suggests that there's a race between the input stream being closed and onInputStreamReady being called.  This could conceivably be the problem, but I need to do more testing.
Correctly catching that exception and destroying the connection (rather than leaving it hanging) results in all connections opened being processed to completion and then closed in a controlled manner.  I used lsof to verify that the only socket left open after the hang was the server socket itself; without catching the exception, lsof showed that a few sockets would remain open until some implicit timeout was hit.  (I don't know the origin of this timeout, but I could definitely observe it given enough delay after the premature cessation of seeks.)  It does not solve the problem of all seeks not finishing.

Given that httpd.js does remain responsive after seeks stop happening, that reopening the test page works fine (but the run of seeks still stops early), and that there are no abandoned connections (with the change to catch and destroy if available() throws in RequestReader.onInputStreamReady), I think the early-seek ending is a video implementation bug.  I tried testing other browsers, but Opera doesn't seem to support onloadedmetadata and Safari doesn't support Ogg, so it's impossible to compare against other implementations.  I don't know why I'm not getting your decode hang in the ogg-decoding thread, but I've never been able to reproduce that.

Next step: get the available() fix cleaned up and landed so that we can eliminate at least one problem found here, even if it doesn't seem to materially affect the problems described here.
This change should be fairly uncontroversial, I think -- biesi, can you review this small patch?

If you happen to know anything offhand about the connection timeout thing I mentioned in a previous comment, I'd be interested to know -- but only if you know offhand; it's not worth spending time investigating.
Attachment #350521 - Flags: review?(cbiesinger)
Thanks Jeff, strange that you can't reproduce the hang in httpd.js, but this patch fixes the Win XP hang for me.

Biesi can you review please? This issue is possibly what's causing intermittent test failures on video tests.

With your patch applied, I do see one connection in the video being sent a StopRequest inside nsInputStreamPump::OnInputStreamReady() while seeking, which is causing a seek failure. Though I don't see this without your patch applied, httpd.js just hangs, I wonder if something related to this was triggering whatever was breaking in httpd.js?
Well, if that's fixing it, yay!  Moving into the shiny new httpd.js component and taking...
Assignee: nobody → jwalden+bmo
No longer blocks: 451958
Status: NEW → ASSIGNED
Component: Testing → httpd.js
Product: Core → Testing
QA Contact: testing → httpd.js
Version: Trunk → unspecified
Blocks: 451958
Comment on attachment 350521 [details] [diff] [review]
Watch out for a closed input, destroy the connection if so

r=bzbarsky with a comment about why this is using destroy() instead of end().
Attachment #350521 - Flags: review?(cbiesinger) → review+
Checked into trunk:

http://hg.mozilla.org/mozilla-central/rev/68395c6df140

...and now 191 (as a test-only change it doesn't need approval):

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/ac175b0457f1

It's probably possible to write a test for this, although it's going to be a little less than fun because that probably means manual state-maintenance to close the connection at the right time, no abstractions to hide channel mucking.  :-\  Should get to that eventually, probably before keep-alive gets implemented in bug 469228, as that has some interesting interactions with this.
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Flags: in-testsuite?
Keywords: fixed1.9.1
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.1b3
Component: httpd.js → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: