Closed Bug 65146 Opened 24 years ago Closed 24 years ago

deadlock potential in proxy code

Categories

(Core :: XPCOM, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla0.9

People

(Reporter: waterson, Assigned: waterson)

References

Details

Attachments

(4 files)

I am seeing a deadlock between the socket transport thread and the UI thread; I'll attach stack traces. The socket transport thread trying to make a synchronous, proxied call (from nsProxyObject::Release) while still holding on to its monitor. Meanwhile, the UI thread is wedged, unable to respond to the proxy event, trying to enter the socket transport's monitor. I believe this is a proxy bug: specifically, I do not think the proxy code should be synchronously posting an event in nsProxyObject::Release.
This may be the best solution to the problem (provided it doesn't break something else -- doug?) b/c we definitely cannot release the socket transport monitor while processing eSocketState_Done.
Locking should be "leafy" -- non-nesting, short critical sections. Java-style monitors considered harmful. There's almost always a way to convert non-leafy locking to leafy locking (see however http://lxr.mozilla.org/mozilla/source/js/src/jsobj.c#218 and below for a rare hard case where nested locks are not avoidable). /be
cc'ing jud, who sez that we don't need to proxy the nsHTTPChannel's mEventSink anymore (almost).
Depends on: 65160
I am not sure what my reason was to have a sync release. There are many cases in necko where the only remaining reference to an object exists in a message which will be punted across threads. The orignal bug that I fixed this for is: http://bugzilla.mozilla.org/show_bug.cgi?id=47134 If things test okay, r=dougt Oh, one hl test that you should run is -> 1. delete your IMAP password. 2. Connect to your IMAP server. 3. Note that the password dialog does come up. 4. Enter your info 5. Ensure that you see mail in your inbox.
For the record, I'm seeing situations now where HTTP `wraps around' a cache record (sending a PRUint32 that represents `channel size' past zero and into the gazillion range). I don't know if this is another bug or not: things seem to work okay once I evict proxies from HTTP altogether (bug 65160).
in any case, I like the async fix. Can you check it in, or please reassign it to me with a fat sr=waterson. :-)
Assignee: dougt → waterson
mscott, could you give this patch a try? I've not had any problems with it, but would like your feedback...
Status: NEW → ASSIGNED
Ok, so ``fixing'' this bug (which I'm pretty sure is the right thing to do) exposes another deadlock. I'll attach stack traces below, but the general idea is (stacks grow down): [Main Thread] Gets OnDataAvailable Reads from pipe Calls out to writer routine WHILE HOLDING PIPE MONITOR in nsPipe::nsPipeInputStream::ReadSegments() Does some PSM detection voodoo Cancels HTTP request Blocks trying to aquire nsSocketTransport's monitor [SocketTransport] Acquire nsSocketTransport monitor in Process() Does async read WHILE HOLDING SOCKET TRANSPORT MONITOR Calls through to pipe write routine Blocks trying to acquire pipe's monitor in WriteSegments() With the above patch installed, this happens consistently at http://www.iwin.com/home/start.asp. It seems very wrong to me that the main thread is holding the pipe's monitor for such an enormous amount of time! Here are the full stack traces... [MainThread] NTDLL! 77f8a122() NTDLL! 77f8ecf1() PR_EnterMonitor(PRMonitor * 0x04bb0ea0) line 79 + 14 bytes nsAutoMonitor::nsAutoMonitor(PRMonitor * 0x04bb0ea0) line 184 + 13 bytes nsSocketTransport::Cancel(nsSocketTransport * const 0x04bb2a04, unsigned int 0x804b0002) line 1823 nsHTTPPipelinedRequest::Cancel(unsigned int 0x804b0002) line 1034 + 30 bytes nsHTTPRequest::Cancel(nsHTTPRequest * const 0x04be1dd0, unsigned int 0x804b0002) line 164 + 15 bytes nsHTTPChannel::Cancel(nsHTTPChannel * const 0x04be4db0, unsigned int 0x804b0002) line 193 + 22 bytes nsLoadGroup::Cancel(nsLoadGroup * const 0x04be2e70, unsigned int 0x804b0002) line 256 + 16 bytes nsDocLoaderImpl::Stop(nsDocLoaderImpl * const 0x04be2ee0) line 282 + 31 bytes nsURILoader::Stop(nsURILoader * const 0x0279b690, nsISupports * 0x04be2ef8) line 850 + 23 bytes nsDocShell::Stop(nsDocShell * const 0x04be0080) line 1565 nsWebShell::StopDocumentLoad(nsWebShell * const 0x04be019c) line 664 nsMyObserver::Notify(nsMyObserver * const 0x04bf0e00, const char * 0x010461f4, nsDetectionConfident eSureAnswer) line 98 + 18 bytes nsXPCOMDetector::Report(const char * 0x010461f4) line 667 nsPSMDetector::HandleData(const char * 0x00c7b4f0, unsigned int 0x00000288) line 376 nsXPCOMDetector::DoIt(nsXPCOMDetector * const 0x00c3d360, const char * 0x00c7b4f0, unsigned int 0x00000288, int * 0x04bf0f54) line 653 nsDetectionAdaptor::RawBuffer(nsDetectionAdaptor * const 0x04bf0f40, const char * 0x00c7b4f0, unsigned int * 0x0012f7cc) line 275 + 35 bytes ParserWriteFunc(nsIInputStream * 0x04bf1dd0, void * 0x0012f8cc, const char * 0x00c7b4f0, unsigned int 0x00000000, unsigned int 0x00000288, unsigned int * 0x0012f844) line 2280 InterceptStreamListener::IntercepterWriter(nsIInputStream * 0x04bf1dd0, void * 0x0012f88c, const char * 0x00c7b4f0, unsigned int 0x00000000, unsigned int 0x00000288, unsigned int * 0x0012f844) line 1260 + 33 bytes nsPipe::nsPipeInputStream::ReadSegments(nsPipe::nsPipeInputStream * const 0x04bf1dd0, unsigned int (nsIInputStream *, void *, const char *, unsigned int, unsigned int, unsigned int *)* 0x01e9e0c0 InterceptStreamListener::IntercepterWriter(nsIInputStream *, void *, const char *, unsigned int, unsigned int, unsigned int *), void * 0x0012f88c, unsigned int 0x000005b4, ...) line 402 + InterceptStreamListener::ReadSegments(InterceptStreamListener * const 0x04bf5734, unsigned int (nsIInputStream *, void *, const char *, unsigned int, unsigned int, unsigned int *)* 0x0182486a ParserWriteFunc(nsIInputStream *, void *, const char *, unsigned int, unsigned int, unsigned int *), void * 0x0012f8cc, unsigned int 0x000005b4, unsigned int * 0x0012f8dc) line 1279 nsParser::OnDataAvailable(nsParser * const 0x04bf4238, nsIChannel * 0x04be4db0, nsISupports * 0x00000000, nsIInputStream * 0x04bf5734, unsigned int 0x00000000, unsigned int 0x000005b4) line 2334 + 29 bytes nsDocumentOpenInfo::OnDataAvailable(nsDocumentOpenInfo * const 0x04be3380, nsIChannel * 0x04be4db0, nsISupports * 0x00000000, nsIInputStream * 0x04bf5734, unsigned int 0x00000000, unsigned int 0x000005b4) line 261 + 46 bytes nsHTTPFinalListener::OnDataAvailable(nsHTTPFinalListener * const 0x04be1d20, nsIChannel * 0x04be4db0, nsISupports * 0x00000000, nsIInputStream * 0x04bf5734, unsigned int 0x00000000, unsigned int 0x000005b4) line 1215 + 46 bytes InterceptStreamListener::OnDataAvailable(InterceptStreamListener * const 0x04bf5730, nsIChannel * 0x04be4db0, nsISupports * 0x00000000, nsIInputStream * 0x04bf1dd0, unsigned int 0x00000000, unsigned int 0x000005b4) line 1221 nsHTTPServerListener::OnDataAvailable(nsHTTPServerListener * const 0x04bf3d10, nsIChannel * 0x04bb2a04, nsISupports * 0x04be4db0, nsIInputStream * 0x04bf1dd0, unsigned int 0x00007d78, unsigned int 0x000005b4) line 558 + 67 bytes nsOnDataAvailableEvent::HandleEvent(nsOnDataAvailableEvent * const 0x03fffbd0) line 160 + 70 bytes nsStreamObserverEvent::HandlePLEvent(PLEvent * 0x03fffbd4) line 78 PL_HandleEvent(PLEvent * 0x03fffbd4) line 576 + 10 bytes PL_ProcessPendingEvents(PLEventQueue * 0x0055b240) line 509 + 9 bytes _md_EventReceiverProc(HWND__ * 0x000e03d6, unsigned int 0x0000c0ff, unsigned int 0x00000000, long 0x0055b240) line 1054 + 9 bytes USER32! 77e13eb0() USER32! 77e1401a() USER32! 77e192da() nsAppShellService::Run(nsAppShellService * const 0x00577200) line 408 main1(int 0x00000001, char * * 0x004b5e30, nsISupports * 0x00000000) line 978 + 32 bytes main(int 0x00000001, char * * 0x004b5e30) line 1272 + 37 bytes mainCRTStartup() line 338 + 17 bytes KERNEL32! 77e87903() [SocketTransportThread] NTDLL! 77f8a122() NTDLL! 77f8ecf1() PR_EnterMonitor(PRMonitor * 0x04bf7350) line 79 + 14 bytes nsAutoMonitor::nsAutoMonitor(PRMonitor * 0x04bf7350) line 184 + 13 bytes nsPipe::nsPipeOutputStream::WriteSegments(nsPipe::nsPipeOutputStream * const 0x04bf1de0, unsigned int (nsIOutputStream *, void *, char *, unsigned int, unsigned int, unsigned int *)* 0x1005f4f0 nsReadFromInputStream(nsIOutputStream *, void *, char *, unsigned int, unsigned int, unsigned int *), void * 0x04bf1fd0, unsigned int 0x00002000, unsigned int * 0x02f6fe68) line 661 nsPipe::nsPipeOutputStream::WriteFrom(nsPipe::nsPipeOutputStream * const 0x04bf1de0, nsIInputStream * 0x04bf1fd0, unsigned int 0x00002000, unsigned int * 0x02f6fe68) line 827 nsStreamListenerProxy::OnDataAvailable(nsStreamListenerProxy * const 0x04bf50e4, nsIChannel * 0x04bb2a04, nsISupports * 0x04be4db0, nsIInputStream * 0x04bf1fd0, unsigned int 0x0000832c, unsigned int 0x00002000) line 272 + 38 bytes nsSocketTransport::doReadAsync(short 0x0001) line 1290 + 96 bytes nsSocketTransport::Process(short 0x0001) line 808 + 13 bytes nsSocketTransportService::Run(nsSocketTransportService * const 0x0276b854) line 424 + 13 bytes nsThread::Main(void * 0x0276b2b0) line 106 + 26 bytes _PR_NativeRunThread(void * 0x0276b090) line 399 + 13 bytes _threadstartex(void * 0x0276ced0) line 212 + 13 bytes KERNEL32! 77e92ca8()
Yeah, looking at this a bit more, it seems like brendan's already hacked around one deadlock in nsPipe::nsPipeInputStream::ReadSegments(). What is the state that really needs to be protected here?
Leaving the monitor before calling out to `writer()' seems to fix the second deadlock. But it's kinda scary... Index: nsPipe2.cpp =================================================================== RCS file: /cvsroot/mozilla/xpcom/io/nsPipe2.cpp,v retrieving revision 1.33 diff -u -2 -r1.33 nsPipe2.cpp --- nsPipe2.cpp 2001/01/27 01:28:00 1.33 +++ nsPipe2.cpp 2001/02/06 05:25:42 @@ -400,5 +400,7 @@ while (readBufferLen > 0) { PRUint32 writeCount = 0; + mon.Exit(); // XXX avoid deadlock better rv = writer(this, closure, readBuffer, *readCount, readBufferLen, &writeCount); + mon.Enter(); if (NS_FAILED(rv) && rv != NS_BASE_STREAM_WOULD_BLOCK) goto done;
I told warren to avoid monitors and callouts inside monitors, and look what happened! Who claims to own nsPipe2.cpp these days? That file needs a monitor enema. Perhaps the quick fix is to exit and re-enter (using the nsAutoMonitor) around the call-outs? The only invariants that might vary are the read-buffer state variables, but a pipe is single-producer, single-consumer (thread-wise, that is), so the read-buffer state vars should be touched only by this code. /be
Both the socket transport and the pipe implementations use large-scoping monitors for thread synchronization. Fixing nsPipe2 is only part of the problem. We need to do the same thing with the socket transport.
darin: brendan has given me new confidence in relaseing the monitor while calling out through `writer()'. What do you think? Is the change right?
See bug 26511 for removing some of the locking in pipe2.
darin: I totally agree (no surprise) that monitors blithely interlocking large, nested, module-callout-full "critical sections" (hah! I say hah! critical only to lazy coding ;-) need to be fixed, everywhere. But we can divide and conquer, especially with nsPipe2.cpp, in order to break the deadly embrace; then fix the socket transport later. mozilla0.8 freezes tonight -- can we get a patch from an ownerly person? /be
Brendan's suggestion looks safe to me. An overlapped write would only expand the amount that could be read and/or change the value of nsPipe::mCondition. Unless the writer writes zero bytes or returns NS_BASE_STREAM_WOULD_BLOCK, these changes would never come into play. With the way nsPipe2 is written, I think this special case would be handled correctly.
I'm all for waterson's patch. r=darin.
I think I'll try to get this in once the tree opens for m0.9. That'll give it plenty of time to cook.
Target Milestone: --- → mozilla0.9
We should make the same fix inside WriteSegments where we call out to the reader.
For what it's worth, I also noticed about a 150ms speed up on my linux box with this patch applied (including the one for WriteSegments). This was using John Morrison's loader test at http://jrgm.mcom.com/page-loader/loader.pl (sorry this is only a netscape internal link).
I've been running with this patch for a while now, and everything seems good. r=darin
sr=brendan@mozilla.org. Can we have a bug on further improvements, e.g., to use a lock rather than a monitor? In general, to remove these doubt-inducing XXX comments altogether. /be
Yes, but it would require splitting the blocking pipe impl from the non-blocking pipe impl, since the blocking impl will definitely need to wait on a monitor.
A monitor is a lock, a condition variable, and a reentrancy count. If we keep critical sections leafy, whether blocking or non-blocking, then I see no need for the reentrancy count. That leaves a condition, which can be null for non-blocking and non-null, pointing to a valid PRCondVar, for blocking, pipes. I don't think we need to split impls just to support blocking while avoiding a monitor; there may of course be other reasons to split impls. /be
Created bug 68614 to cover removing the monitors from the non-blocking pipe implementation.
brendan: good point.
IOW, you wait on a condition, not (necessarily, or primitively) on a monitor. Java-style monitors are a plague on the land and an abomination in mine eyes! /be
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: