Closed Bug 513854 Opened 15 years ago Closed 14 years ago

httpd.js should write its response data asynchronously

Categories

(Testing :: General, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED
mozilla1.9.3a4

People

(Reporter: cpearce, Assigned: Waldo)

References

Details

Attachments

(1 file, 8 obsolete files)

httpd.js is writing data destined to travel over the network to a pipe on the main thread using blocking writes (in WriteThroughCopier.onInputStreamReady()).

This is a problem because when the <video> media cache reckons it doesn't need to keep reading channel, it suspends its nsHttpChannel. This will cause the connection's buffer to fill up, and eventually httpd.js' output pipe's buffer will also fill up as the stuff it sends stop's being ack'ed. Then when httpd.js writes data on a suspended channel, if the buffer is full or nearly full, it will block its main thread until all the data can be sent. I'm not sure why, but when the nsHttpChannel is resumed by the media cache, the pending data never arrives. We never receive OnDataAvailable callbacks, and httpd.js never unblocks. httpd.js shouldn't block and be unresponsive to other connections while handling a suspended connection.

I tried serving the mochitests with LightHTTPd, and the seek mochitests didn't hang, so I think the problem must lie somewhere in httpd.js' handling of our suspend/resume.

We have had to disable the <video> seek mochitests on Windows, and I narrowed the problem down to this. I have only been able to reproduce this on Windows, not on any other platform, though this could be causing intermittent test failures on other platforms. We are not hitting Windows' or Firefox's connection limits.

STR (On Windows):
1. Edit content/media/test/Makefile.in to enable test_seek.html on Windows (add test_seek.html to _TEST_FILES in the Makefile.in). Rebuild.
2. Repeatedly run mochitest content/media/test/test_seek.html, eventually it will stall "Waiting for localhost". You may need to reload a few times.
3. If you try to open a new connection to httpd.js, it won't respond, it's hung.

You can verify that the write is never unblocking by putting a print/dump before and after "this._output.writeByteArray(data, data.length);", httpd.js:3998. You may also want to enable nsPipe:5 debug logging.

You may also want to add a gc() call to _connectionClosed() to prevent huge memory use, ala bug 508128, else you'll get OOM crashes.

Getting this fixed is important, so that we can fix a large cause of random [orange], and so we can re-enable the seeks tests on Windows.
Simpler STR:

1. Serve a large video with httpd.js, without autobuffer (so you'll need to serve it through an html page, as videodocuments autobuffer by default).
2. Let the video load for a few seconds, so that httpd.js' buffer fills up. (the video will be read enough to read the meta data, and then the connection is suspended).
3. Try to open another thing served off httpd.js in another tab. Note that the new connection doesn't respond.
4. Play the large video, note that the other thing being loaded now starts to load, as the main thread is no longer blocks.
The bug as filed is unfixable given the nature of XPConnect and its utter trashing of any threadsafety guarantees, modifying the summary to something implementable.  (I'm not actually sure if I'm bitter about this any more.)

I really wish I knew why this problem hadn't occurred to me previously when thinking about the previously-unresolved video issues.  It's kind of an embarrassing lapse of foresight, actually, that I haven't noticed and fixed this already...

I haven't thought much about the writing code recently, but I suspect it isn't much work to modify all writes to be asynchronous; past state-machining efforts have made it generally much easier to even contemplate things like this.  I'll aim to fix this sometime within a week or so -- good news (for you) is I'll have more time to do this (and be more likely to be able to make that goal) than usual due to a temporary lack of transportation.
Assignee: nobody → jwalden+bmo
Status: NEW → ASSIGNED
OS: Windows XP → All
Hardware: x86 → All
Summary: httpd.js shouldn't do blocking writes on its main thread → httpd.js should write its response data asynchronously
Thanks Jeff! When you fix this we need to make sure that we test that if a connection queues up a lot of data to write but then the connection closes before it can be written, the unwritten data does not constitute a memory leak.
I believe this should theoretically fix the problems, but I'm having trouble getting the simpler STR posted here to reproduce the problem or to demonstrate its being fixed; perhaps I should try the not-so-simple ones.

Also, while the server tests did an admirable job of catching a large number of mistakes and omissions, this really wants direct API-level testing of the copier, not just indirect testing through the server.
Attachment #398011 - Attachment is obsolete: true
(In reply to comment #5)
> Created an attachment (id=398900) [details]
> Better, more ready patch, still needs testing

Thanks Jeff! The "Simpler STR" is fixed for me with this patch, but I still saw one intermittent failure on Windows of test_seek.html; though that test seems to be failing a lot less with this patch, I still saw it fail exactly once (after many runs).

The mochitest content/media/test/test_bug493187.html also intermittently times out on Windows with this patch.

(In reply to comment #5)
> I'm having trouble getting the simpler STR posted here to reproduce the
> problem or to demonstrate its being fixed;

Oops, the simpler STR really needs an new step between 2 and 3 to improve its reliability:

2.5. Play the video for a few seconds, then pause it. When paused, the <video> will stop reading from its connection after a few more seconds, causing httpd.js' blocking write to block.
Bringing "Simpler STR" all together for easier reading:

Simpler STR:

1. Serve a large video with httpd.js, without autobuffer (so you'll need to
   serve it through an html page, as videodocuments autobuffer by default).
2. Let the video load for a few seconds, so that httpd.js' buffer fills up.
   (the video will be read enough to read the meta data, and then the connection
   is suspended).
3. Play the video for a few seconds, then pause it. When paused, the
   <video> will stop reading from its connection after a few more seconds,
   causing httpd.js' blocking write to block.
4. Try to open another thing served off httpd.js in another tab. Note that
   the new connection doesn't respond.
5. Play the large video, note that the other thing being loaded now starts
   to load, as the main thread is no longer blocks.
Step 3 should not be necessary, AFAIK.
We really should get this fixed; we know it causes some of our video tests to fail randomly, and it could potentially be causing other random failures.
This patch is bigger than the previous one, I think (or at least the copier code appears to take up more lines), but eliminating the previous version's nested try/catch/finally craziness and reducing early-exit code to a very few methods greatly aids readability.

The full run of xpcshell tests pass, but even despite this version's greater simplicity I will not be convinced of sufficient correctness to be willing to entrust the tree (and flaking tinderboxen) with it -- not without methodically written tests that exercise specific modes of copying interaction.  I'm going to work on those in the next couple days and hopefully get this completely wrapped up very soon.
Attachment #398900 - Attachment is obsolete: true
Attached patch Tests proceeding apace (obsolete) — Splinter Review
Whee, I have one test written!  This is perhaps the most convoluted code I've ever written, but I have hopes that it will be flexible enough to make writing lots and lots of weird copying interactions testable in essentially step-by-step declarative format.  There are likely still a ton of bugs left in the testing framework code itself, but this is a huge step forward.
Attachment #411896 - Attachment is obsolete: true
Comment on attachment 413748 [details] [diff] [review]
Tests proceeding apace

Tests proceed...slowly...much more slowly than I'd like.  But:

>+    // The day is ours!  Quantum written, now let's see if we have more data
>+    // still to write.
>+    try
>+    {
>+      if (pendingData.length > 0)
>+        this._waitToWriteData();
>+    }
>+    catch (e)
>+    {
>+      dumpn("!!! unexpected error waiting to write pending data: " + e);
>+      this._doneWritingOutput(Cr.NS_ERROR_UNEXPECTED);
>+      return;
>+    }
>+
>+    // Okay, we have no more pending data to write -- but might we get more in
>+    // the future?  If we know we won't, then we know it's time to notify of
>+    // copy completion.  Victory!
>+    if (this._input === null)
>+    {
>+      this._output = null;
>+      this._cancelOrDispatchCancelCallback(Cr.NS_OK);
>+    }

The tests have discovered their first bug!  WriteThroughCopier.prototype.onOutputStreamReady in this section improperly handles the input stream being closed with more than one segment of data remaining to be written: there's a missing return at the end of the consequent for |pendingData.length > 0|.  If the return's not there we fall through to canceling the copy if input-reading is complete, but _cancelOrDispatchCallback includes an assertion that we have no more pending data -- boom.  None of the existing xpcshell tests caught this; a Mochitest run might or might not (probably wouldn't have, because it requires the copier to be backed up more than 8K bytes).  One potential intermittent failure eliminated, still more testing to go...
I just hit that bug when running mochitests :-). I guess I'll wait for you to get done with your tests before I carry on with more orange analysis.
Attached patch Testing further along (obsolete) — Splinter Review
The new test file actually passes in this patch, for a change of pace.  However, there's still much more testing to do, and more importantly, work remains on getting the testing API in shape to actually do everything that needs to be done.  The current setup is a bit hackish, and while some of that will remain and is even necessary, I don't like the way control flow is passed from one method to another -- closeSource() in particular is a moderately bad hack because it spins the event loop, and it feels to me like there's a better way to do this that doesn't involve random notifications getting processed at that time.
Attachment #413748 - Attachment is obsolete: true
Another bug, long known (since the first test or two I wrote here, even before it was possible to run them, found by examination): the copier doesn't wait on the output end, so if the output end closes before any data has been read from input, and if nothing is ever read from input (and input's never closed) it'll hang.  I don't *think* this is actually an issue in practice, because if no data is to be copied the async copier is never invoked.  Nevertheless it seems better to eliminate this frailty than to leave it in place, even if I'm right that it doesn't actually result in problems.
I'm sure there are more tests I could write here, but I have a good mixture as it stands, and I found the new tests, as I added them, were generally Just Working rather than exposing flaws, once I got to the point where they weren't exposing flaws in the test mocks themselves.  I have one or two more minor ideas to follow up on as a result of the changes made in this iteration to how data is copied to the socket output; aside from those I consider this sufficiently well-tested to be happy siccing it on the tree.

The last remaining issue is that, previously, I relied on synchronous write-out of response preambles to get exactly specified behavior when a response being processed asynchronously threw an exception.  Switching over to writing out headers asynchronously happens to break that, and at least right now preserving the behavior seems difficult.  The easiest solution, since this is very edge-casey, is just to revert to underspecifying behavior in that case.  Once I do that and follow up on the last one or two test ideas, I think this will be good to go.  (And finally!)
Attachment #415282 - Attachment is obsolete: true
Attached patch Final patchSplinter Review
On second look it's entirely possible to fix behavior so that what I'd previously mentioned as specified -- that an async/seized response where the handler threw an exception would have its headers and written data at that point written, and no more -- would still happen.  Having already made changes to the necessary tests, I'm not sure it's worth preserving the behavior, nor is it worth having to add in that extra complexity.  People will deal, and the response is aborted in a timely manner either way, just with less data written out.  I can always add it back later if it becomes necessary for some reason.

Regarding the test: I would spend the majority of review time reading the test functions section that describes the steps each test goes through.  The rest of it is interesting, to an extent, but it's extremely convoluted how stepping through actions occurs, exactly when an action is permitted, and how final wrapup happens.  I would say that code's close to, if not the most, complex code I've ever written.  It's going to take an awful long time to read through that complexity, for not a whole lot of gain, so I'd just read method descriptions to get a feel for the API, figure out whether the tests themselves are reasonably comprehensive and suggest new tests if you think they're needed, and leave it at that.
Attachment #416152 - Attachment is obsolete: true
Attachment #416316 - Flags: review?(honzab.moz)
We need this patch to land, not rot. Jeff, if Honza can't review it, can you get someone else?
This patch is definitly a step in the right direction, but with this patch applied, mochitests sometimes fail to start. It looks like httpd.js sometimes drops connections while loading Mochitkit/packed.js while the initial mochitest page is loading. See the attached log of httpd.js debug logging and nsHTTP logging for details. You can see that the nsHttpTransaction for packed.js (address 0x3f1df38) is closed when download progress reaches 143360/150640. You can see a few lines in the log after the CloseTransaction call for 0x3f1df38, that there's a JS error "JavaScript error: http://localhost:8888/MochiKit/packed.js, line 5874: missing } after property list", presumably this is Firefox failing to parse a truncated JS file.

I recorded this run in my VM (Windows XP Pro SP3 VM running on Win7 64bit Pro host), so I can replay-debug it and can provide more details if that would help. I can attach to both xpcshell.exe or firefox.exe if required. I'm happy to help!

Interestingly, I've observed this dropped connection a number of times, and it seems like the packet.js nsHttpTransaction is always being closed when it reaches 143360/150640. Other connection drops occur too, it's not restricted to packed.js of course. Connection drops are not uncommon while running the <video> mochitests, which is a real pain; I can't ensure the new ogg decoder backend I'm working on doesn't break mochitests, as http.js either drops connections with this patch, or locks up without this patch.
Attachment #426636 - Attachment description: Debug log showing dropped connection → Mochitest debug log showing dropped connection
Yeah, I'll look into getting someone else to look at this.
It would be helpful to know what the JS stack is in the xpcshell process when nsSocketOutputStream::CloseWithStatus is hit for the packed.js connection.  That would tell me how it is that the server's closing the connection, and I suspect that would be enough to figure out the problem or at least make some educated guesses.  As far as identifying some meaningful condition on which to set the breakpoint...not sure yet, trying to think of something reasonable.
Also helpful would be catching this with server debug output, since the log here seems to only be client-side debug logging.
Obtaining a JS stack during replay is very hard to do, because you can't run program functions there. You can crawl the JS engine data structures and make some guesses...

Reproducing with server debug output should be straightforward.
(In reply to comment #23)
> Obtaining a JS stack during replay is very hard to do, because you can't run
> program functions there. You can crawl the JS engine data structures and make
> some guesses...

It's not that bad to do by examining data structures.  Find the relevant JSContext, look at cx->fp, $->down, $->down, ... (until null), and for each stack frame look at fp->script->filename and fp->script->lineno if fp->script is non-null.  (See also the source for js_DumpStackFrame.)
What dbaron said.  (Incidentally, this was the reason I asked about calling side-effect-free functions while record/replay-debugging, if you remember the question from last December's all-hands.)
I'd thought I had httpd.js logging on, but I must have lost it while shuffling around patches in my patch queue, sorry! Here's a log with httpd.js logging on.

The same JS errors are still there, it looks like the same problem.

I guess the error on line 1961 might give you some hints. I'll try and get a stack trace for that error from xpcshell.exe.
Here's the C++ stack when nsBinaryOutputStream::WriteByteArray fails. It's returning failure from here: http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsBinaryStream.cpp#253 note rv==0 ; the Write() at nsBinaryStream.cpp#250 didn't fail.

xul.dll!nsBinaryOutputStream::WriteBytes(const char * aString=0x04411088, unsigned int aLength=8192)  Line 253	C++
xul.dll!nsBinaryOutputStream::WriteByteArray(unsigned char * aBytes=0x04411088, unsigned int aLength=8192)  Line 261	C++
xul.dll!NS_InvokeByIndex_P(nsISupports * that=0x04400df8, unsigned int methodIndex=21, unsigned int paramCount=2, nsXPTCVariant * params=0x0012d1d0)  Line 103	C++
xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx={...}, XPCWrappedNative::CallMode mode=CALL_METHOD)  Line 2727 + 0x29 bytes	C++
xul.dll!XPC_WN_CallMethod(JSContext * cx=0x02171be8, JSObject * obj=0x0a4dc540, unsigned int argc=2, int * argv=0x02550304, int * vp=0x0012d4a0)  Line 1756 + 0xe bytes	C++
mozjs.dll!js_Invoke(JSContext * cx=0x02171be8, unsigned int argc=2, int * vp=0x025502fc, unsigned int flags=2)  Line 1377 + 0x1a bytes	C++
mozjs.dll!js_Interpret(JSContext * cx=0x02171be8)  Line 2304 + 0x16 bytes	C++
mozjs.dll!js_Invoke(JSContext * cx=0x02171be8, unsigned int argc=1, int * vp=0x025502e8, unsigned int flags=0)  Line 1385 + 0x9 bytes	C++
xul.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper=0x044009c0, unsigned short methodIndex=3, const XPTMethodDescriptor * info=0x0109e948, nsXPTCMiniVariant * nativeParams=0x0012dfdc)  Line 1696 + 0x1b bytes	C++
xul.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex=3, const XPTMethodDescriptor * info=0x0109e948, nsXPTCMiniVariant * params=0x0012dfdc)  Line 571	C++
xul.dll!PrepareAndDispatch(nsXPTCStubBase * self=0x04400a28, unsigned int methodIndex=3, unsigned int * args=0x0012e09c, unsigned int * stackBytesToPop=0x0012e08c)  Line 114 + 0x21 bytes	C++
xul.dll!SharedStub()  Line 142	C++
xul.dll!nsOutputStreamReadyEvent::Run()  Line 192	C++
xul.dll!nsOutputStreamReadyEvent::Run()  Line 192	C++
xul.dll!nsThread::ProcessNextEvent(int mayWait=1, int * result=0x0012e358)  Line 527 + 0x19 bytes	C++
xul.dll!NS_InvokeByIndex_P(nsISupports * that=0x00fa2bb8, unsigned int methodIndex=8, unsigned int paramCount=2, nsXPTCVariant * params=0x0012e348)  Line 103	C++
xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx={...}, XPCWrappedNative::CallMode mode=CALL_METHOD)  Line 2727 + 0x29 bytes	C++
xul.dll!XPC_WN_CallMethod(JSContext * cx=0x02171be8, JSObject * obj=0x024df5a0, unsigned int argc=1, int * argv=0x025502c4, int * vp=0x0012e618)  Line 1756 + 0xe bytes	C++
mozjs.dll!js_Invoke(JSContext * cx=0x02171be8, unsigned int argc=1, int * vp=0x025502bc, unsigned int flags=2)  Line 1377 + 0x1a bytes	C++
mozjs.dll!js_Interpret(JSContext * cx=0x02171be8)  Line 2304 + 0x16 bytes	C++
mozjs.dll!js_Execute(JSContext * cx=0x02171be8, JSObject * chain=0x024b0b60, JSScript * script=0x025d39d0, JSStackFrame * down=0x00000000, unsigned int flags=0, int * result=0x0012fd9c)  Line 1619 + 0x9 bytes	C++
mozjs.dll!JS_ExecuteScript(JSContext * cx=0x02171be8, JSObject * obj=0x024b0b60, JSScript * script=0x025d39d0, int * rval=0x0012fd9c)  Line 4949 + 0x19 bytes	C++
xpcshell.exe!ProcessFile(JSContext * cx=0x02171be8, JSObject * obj=0x024b0b60, const char * filename=0x0043f88c, _iobuf * file=0x10311d60, int forceTTY=0)  Line 970 + 0x16 bytes	C++
xpcshell.exe!Process(JSContext * cx=0x02171be8, JSObject * obj=0x024b0b60, const char * filename=0x0043f88c, int forceTTY=0)  Line 1049 + 0x19 bytes	C++
xpcshell.exe!ProcessArgs(JSContext * cx=0x02171be8, JSObject * obj=0x024b0b60, char * * argv=0x0043f80c, int argc=6)  Line 1173 + 0x19 bytes	C++
xpcshell.exe!main(int argc=6, char * * argv=0x0043f80c, char * * envp=0x00433d30)  Line 1879 + 0x15 bytes	C++
xpcshell.exe!__tmainCRTStartup()  Line 597 + 0x19 bytes	C
xpcshell.exe!mainCRTStartup()  Line 414	C
KERNEL32.dll!_BaseProcessStart@4()  + 0x23 bytes	

The only JS frame I could extract was httpd.js: 4144, an onOutputStreamReady call. Looks to me like the "!!! unknown error" at httpd.js:4198 (in onOutputStreamReady) is what's causing the log message at line 1961 of the log.

nsBinaryOutputStream::WriteBytes() calls nsBinaryOutputStream::Write() which calls into nsPipeOutputStream::WriteSegments(), which at http://mxr.mozilla.org/mozilla-central/source/xpcom/io/nsPipe3.cpp#1112 is finding that the output pipe is full, and is returning NS_OK, but it's only written half as many bytes as was asked for. nsBinaryOutputStream::WriteBytes() interprets that as failure (bytesWritten != aLength), and presumably that's throwing an exception in JS.

Does that make sense? The comments at httpd.js:4159 says this exception shouldn't be a problem, but I'm not sure how to verify that...
(In reply to comment #21)
> It would be helpful to know what the JS stack is in the xpcshell process when
> nsSocketOutputStream::CloseWithStatus is hit for the packed.js connection.

If I just break on all calls of CloseWithStatus(), I see it's not running on the main thread, and the main thread appears to be in JS, waiting for a new event, and JS has a null cx->fp. I'd guess the calls to CloseWithStatus() are normal "I've finished with this socket" type calls. They're always either called with reason==0 or reason==NS_BASE_STREAM_CLOSED.
What patch are you applying and testing?  The latest one here doesn't include any writeByteArray calls as far as I can tell, because that method has a useless mode of failure when data is only partially writable (it's not atomic, and there's no way to determine how much wasn't written).
(In reply to comment #29)
> What patch are you applying and testing?

Rats, the patch I was using is indeed out of date. I will update and retest.
(Incidentally, the problem your log reveals is explicitly noted and fixed in the latest patch near the comment containing the sentence "We're only using this because writeByteArray is unusably broken for asynchronous output streams; see bug 532834 for details.")
Waiting on feedback on whether the patch actually has intermittency problems other than the one present in an obsolete patch and since fixed...
Sorry ... AFAIK we have not found any problems in the latest patch. Proceed!
(In reply to comment #32)
> Waiting on feedback on whether the patch actually has intermittency problems
> other than the one present in an obsolete patch and since fixed...

Now that I'm back in the office, I've tested the latest patch on mochitest content/media/test/test_seek.html (this is the test which initially exposed the problem). I looped on this test for about an hour, and it ran fine. Let's get this reviewed and checked in!
Attachment #426636 - Attachment is obsolete: true
Attachment #427261 - Attachment is obsolete: true
Attachment #416316 - Flags: review?(honzab.moz) → review?(sayrer)
Attachment #416316 - Flags: review?(sayrer) → review+
http://hg.mozilla.org/mozilla-central/rev/6880df5b2a3c
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a4
Component: httpd.js → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: