Closed Bug 206947 Opened 22 years ago Closed 21 years ago

Synchronous XMLHttpRequest crashes the browser on approx. 500th call on Mac and Linux [@ nsEventQueueImpl::GetYoungestActive]

Categories

(Core :: XML, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla1.6alpha

People

(Reporter: arysin, Assigned: keeda)

References

Details

(Keywords: crash, testcase, Whiteboard: patch, need review)

Crash Data

Attachments

(10 files, 3 obsolete files)

4.54 KB, text/plain
Details
647 bytes, text/html
Details
43.47 KB, text/plain
Details
835 bytes, text/html
Details
8.44 KB, patch
Details | Diff | Splinter Review
496 bytes, patch
timeless
: review-
Details | Diff | Splinter Review
701 bytes, text/html
Details
1.19 KB, patch
hjtoi-bugzilla
: review+
Details | Diff | Splinter Review
9.42 KB, patch
Details | Diff | Splinter Review
6.90 KB, patch
keeda
: review+
Details | Diff | Splinter Review
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030521 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.4b) Gecko/20030521 XMLHttpRequest crashes the browser on approx 500th call, this number is cumulative, i.e. if call made 300 times then the window closed and the page is opened in another the browser will crash after ~200 calls. The same effect in Mozilla on Linux and Windows. On Windows it also says there's a problem in XPCOM library. Reproducible: Always Steps to Reproduce: 1.open the page test.html 2.wait until counter gets to ~500 Actual Results: browser crashes Expected Results: browser should keep working this bug does not depend on setTimeout() - with for(;;) loop it gives the same crash, and this bug does not depend on setting innerHTML of the element. <html><head> </head><body onload="changeSize();"> <script> var count = 0; var xml = null; if(document.all) xml = new ActiveXObject('Microsoft.XMLHTTP'); else xml = new XMLHttpRequest(); function changeSize(){ xml.open("GET", "about:blank", false); xml.send(null); document.getElementById("count").innerHTML = eval(++count); setTimeout("changeSize()", 50); } </script> <br> Count: <span id="count">0</span> </body> </html>
Attached file stacktrace
Confirmed using FizzillaMach/2003-05-20-08-trunk. Setting All/All.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: crash, testcase
OS: Linux → All
Hardware: PC → All
Summary: XMLHttpRequest crashes the browser on approx. 500th call → XMLHttpRequest crashes the browser on approx. 500th call [@ nsEventQueueImpl::GetYoungestActive]
Crash occurred at iteration 575. Also, TalkBack did not execute following crash.
Blocks: 201894
I'm also seeing this on RH Linux 9.0 / Mozilla 1.3.1 Heikki - This is critical for us to get fixed. Is there a way to get some luvin' on this bug?? Thanks!!! - Bill Edney - Technical Pursuit Inc.
I landed some memory leak fixes on 5/20, and I don't see any crashes on Win2k debug build even with 3500 loads of the testcase, and the memory fluctuated between 40 and 60 MB. Please try this with a recent nightly.
Still crashing using FizzillaMach/2003-05-25-03-trunk, again at iteration 575.
While my Linux and Mac builds are cranking, can you try some variations to see if it still crashes? Specifically: 1. Make sure that you don't have more than one XMLHttpRequest connection active at any time 2. Reuse the same XMLHttpRequest object instead of creating new objects
Well, I've been trying different variations and all of them crash. As to your 2 points I think they're covered in the "HTML tescase from comment 0". There only one XMLHttpRequest is created and no new call until previous is completed.
I made the testcase use asynchronous loading and it does not seem to crash anymore (at least on Mac OS X). Please try this as a workaround while I am looking at the issue with sync loading and Mac/Linux.
Summary: XMLHttpRequest crashes the browser on approx. 500th call [@ nsEventQueueImpl::GetYoungestActive] → Synchronous XMLHttpRequest crashes the browser on approx. 500th call on Mac and Linux [@ nsEventQueueImpl::GetYoungestActive]
yes, with asynchronous calls it does not crash neither on linux nor on Windows thanks, now at least we can do some workaround
bug 208127 is a dupe of this bug. Looking at this issue, heres what I came up w/ith: In nsXMLHttpReqeust, the call in ::Send -> PushThreadEventQueue is the cause of the problem. If PushThreadEventQueue is called exactly 501 times, we crash. This appears to exercize the limit to the EventQueue and appears PushThreadEventQueue is not properly dequeue'ing. I tested explicitly calling PopThreadEventQueue, but that didn't help. I have a backtrace i'll post here.
PR_EXTERN(PRBool) PL_IsQueueNative(PLEventQueue *queue) { return queue->type == EventQueueIsNative ? PR_TRUE : PR_FALSE; } ********** CREATE (0x8c92c30) ********** out: modalEventQueue(0x8c8fe90) Document http://linuxbox/leak/page1.html loaded successfully ********** DESTROY (0x8c92c30) ********** Document http://linuxbox/leak/page2.html loaded successfully ********** CREATE (0x8c8b918) ********** Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 1080897920 (LWP 7575)] 0x40252679 in PL_IsQueueNative (queue=0x0) at plevent.c:1279 /backups/MOZILLA_1_2_1_RELEASE/mozilla/xpcom/threads/plevent.c:1279:33657:beg:0x40252679 Current language: auto; currently c (gdb) where #0 0x40252679 in PL_IsQueueNative (queue=0x0) at plevent.c:1279 #1 0x402542f5 in nsEventQueueImpl::IsQueueNative(int*) (this=0x87e4a70, aResult=0xbfffcea8) at nsEventQueue.cpp:350 #2 0x41464f03 in nsAppShellService::Observe(nsISupports*, char const*, unsigned short const*) (this=0x81274e8, aSubject=0x87e4a70, aTopic=0x402c35d8 "nsIEventQueueActivated") at nsAppShellService.cpp:1067 #3 0x401e31bf in nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*) (this=0x80de618, aSubject=0x87e4a70, aTopic=0x402c35d8 "nsIEventQueueActivated", someData=0x0) at nsObserverService.cpp:211 #4 0x40253e6d in nsEventQueueImpl::NotifyObservers(char const*) (this=0x87e4a70, aTopic=0x402c35d8 "nsIEventQueueActivated") at nsEventQueue.cpp:224 #5 0x40253a81 in nsEventQueueImpl::InitFromPRThread(PRThread*, int) (this=0x87e4a70, thread=0x809dfd8, aNative=1) at nsEventQueue.cpp:176 #6 0x402560a3 in nsEventQueueServiceImpl::MakeNewQueue(PRThread*, int, nsIEventQueue**) (this=0x80a4320, thread=0x809dfd8, aNative=1, aQueue=0xbfffd13c) at nsEventQueueService.cpp:166 #7 0x4025665a in nsEventQueueServiceImpl::PushThreadEventQueue(nsIEventQueue**) (this=0x80a4320, aNewQueue=0xbfffd48c) at nsEventQueueService.cpp:285 #8 0x44f452d7 in nsXMLHttpRequest::Send(nsIVariant*) (this=0x8c8b918, aBody=0x8c35910) at nsXMLHttpRequest.cpp:1261 #9 0x40282467 in XPTC_InvokeByIndex () at xptcinvoke_gcc_x86_unix.cpp:66 #10 0x4098eb6e in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) (ccx=@0xbfffd7d0, mode=CALL_METHOD) at xpcwrappednative.cpp:2011 #11 0x40998d99 in XPC_WN_CallMethod(JSContext*, JSObject*, unsigned, long*, long*) (cx=0x86589e0, obj=0x83be218, argc=1, argv=0x8ca208c, vp=0xbfffd970) at xpcwrappednativejsops.cpp:1281 #12 0x400cbc2d in js_Invoke (cx=0x86589e0, argc=1, flags=0) at jsinterp.c:839 #13 0x400d9d6c in js_Interpret (cx=0x86589e0, result=0xbfffdf8c) at jsinterp.c:2803 #14 0x400cc3dd in js_Execute (cx=0x86589e0, chain=0x85bd210, script=0x8c3f0a8, down=0x0, special=0, result=0xbfffdf8c) at jsinterp.c:1020 #15 0x4009c77c in JS_EvaluateUCScriptForPrincipals (cx=0x86589e0, obj=0x85bd210, principals=0x8c7e988, chars=0x8c921b8, length=448, filename=0xbfffe0c0 "http://linuxbox/leak/page1.html", lineno=7, rval=0xbfffdf8c) at jsapi.c:3382 #16 0x42b12d5c in nsJSContext::EvaluateString(nsAString const&, void*, nsIPrincipal*, char const*, unsigned, char const*, nsAString&, int*) (this=0x8658970, aScript=@0xbfffe240, aScopeObject=0x85bd210, aPrincipal=0x8c7e984, aURL=0xbfffe0c0 "http://linuxbox/leak/page1.html", aLineNo=7, aVersion=0x40120cc8 "default", aRetValue=@0xbfffe100, aIsUndefined=0xbfffe0ac) at nsJSEnvironment.cpp:694 #17 0x4124b7de in nsScriptLoader::EvaluateScript(nsScriptLoadRequest*, nsAFlatString const&) (this=0x8c862b8, aRequest=0x8c0edd0, aScript=@0xbfffe240) at nsScriptLoader.cpp:582 #18 0x4124b177 in nsScriptLoader::ProcessRequest(nsScriptLoadRequest*) (this=0x8c862b8, aRequest=0x8c0edd0) at nsScriptLoader.cpp:493 #19 0x4124ae91 in nsScriptLoader::ProcessScriptElement(nsIDOMHTMLScriptElement*, nsIScriptLoaderObserver*) (this=0x8c862b8, aElement=0x8a41224, aObserver=0x8a41228) at nsScriptLoader.cpp:436 #20 0x40f7bac8 in nsHTMLScriptElement::MaybeProcessScript() (this=0x8a41200) at nsHTMLScriptElement.cpp:422 #21 0x40f7b053 in nsHTMLScriptElement::SetDocument(nsIDocument*, int, int) (this=0x8a41200, aDocument=0x8c0e3f8, aDeep=0, aCompileEventHandlers=1) at nsHTMLScriptElement.cpp:195 #22 0x40efdce3 in nsGenericHTMLContainerElement::AppendChildTo(nsIContent*, int, int) (this=0x8c30ad0, aKid=0x8a41200, aNotify=0, aDeepSetDocument=0) at nsGenericHTMLElement.cpp:4093 #23 0x40fbd424 in HTMLContentSink::ProcessSCRIPTTag(nsIParserNode const&) (this=0x89d5940, aNode=@0x8c8f090) at nsHTMLContentSink.cpp:5690 #24 0x40fb65df in HTMLContentSink::AddLeaf(nsIParserNode const&) (this=0x89d5940, aNode=@0x8c8f090) at nsHTMLContentSink.cpp:3706 #25 0x41ea823e in CNavDTD::AddLeaf(nsIParserNode const*) (this=0x8c8b7e8, aNode=0x8c8f090) at CNavDTD.cpp:3799 #26 0x41ea8452 in CNavDTD::AddHeadLeaf(nsIParserNode*) (this=0x8c8b7e8, aNode=0x8c8f090) at CNavDTD.cpp:3862 #27 0x41ea4a06 in CNavDTD::HandleStartToken(CToken*) (this=0x8c8b7e8, aToken=0x8ca01b0) at CNavDTD.cpp:1757 #28 0x41ea2d84 in CNavDTD::HandleToken(CToken*, nsIParser*) (this=0x8c8b7e8, aToken=0x0, aParser=0x8c9f5d0) at CNavDTD.cpp:910 #29 0x41ea1f9f in CNavDTD::BuildModel(nsIParser*, nsITokenizer*, nsITokenObserver*, nsIContentSink*) (this=0x8c8b7e8, aParser=0x8c9f5d0, aTokenizer=0x8c2d6c0, anObserver=0x0, aSink=0x89d5940) at CNavDTD.cpp:521 #30 0x41ebcdb9 in nsParser::BuildModel() (this=0x8c9f5d0) at nsParser.cpp:1885 #31 0x41ebc9f2 in nsParser::ResumeParse(int, int, int) (this=0x8c9f5d0, allowIteration=1, aIsFinalChunk=0, aCanInterrupt=1) at nsParser.cpp:1752 #32 0x41ebe4f6 in nsParser::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned, unsigned) (this=0x8c9f5d0, request=0x8b96bd8, aContext=0x0, pIStream=0x8bdccb0, sourceOffset=0, aLength=928) at nsParser.cpp:2386 #33 0x4295a700 in nsDocumentOpenInfo::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned, unsigned) (this=0x8a19a38, request=0x8b96bd8, aCtxt=0x0, inStr=0x8bdccb0, sourceOffset=0, count=928) at nsURILoader.cpp:244 #34 0x40bdc841 in nsHttpChannel::OnDataAvailable(nsIRequest*, nsISupports*, nsIInputStream*, unsigned, unsigned) (this=0x8b96bd8, request=0x8c4636c, ctxt=0x0, input=0x8bdccb0, offset=0, count=928) at nsHttpChannel.cpp:3096 #35 0x40b84e6b in nsOnDataAvailableEvent::HandleEvent() (this=0x444370d8) at nsStreamListenerProxy.cpp:193 #36 0x40b64ab5 in nsARequestObserverEvent::HandlePLEvent(PLEvent*) (plev=0x444370dc) at nsRequestObserverProxy.cpp:115 #37 0x402520ad in PL_HandleEvent (self=0x444370dc) at plevent.c:644 #38 0x40251edf in PL_ProcessPendingEvents (self=0x80de6b8) at plevent.c:574 #39 0x4025446e in nsEventQueueImpl::ProcessPendingEvents() (this=0x80a4278) at nsEventQueue.cpp:388 #40 0x414b449a in event_processor_callback (data=0x80a4278, source=6, condition=GDK_INPUT_READ) at nsAppShell.cpp:184 #41 0x414b3ed5 in our_gdk_io_invoke (source=0x82600e0, condition=G_IO_IN, data=0x825ff38) at nsAppShell.cpp:76 #42 0x404ea0a6 in g_io_add_watch () from /usr/lib/libglib-1.2.so.0 #43 0x404eb9ae in g_get_current_time () from /usr/lib/libglib-1.2.so.0 #44 0x404ebe89 in g_get_current_time () from /usr/lib/libglib-1.2.so.0 #45 0x404ec124 in g_main_run () from /usr/lib/libglib-1.2.so.0 #46 0x403f727f in gtk_main () from /usr/lib/libgtk-1.2.so.0 #47 0x414b49bc in nsAppShell::Run() (this=0x8141530) at nsAppShell.cpp:332 #48 0x41463657 in nsAppShellService::Run() (this=0x81274e8) at nsAppShellService.cpp:471 #49 0x0805dfd4 in main1 (argc=3, argv=0xbffff3a4, nativeApp=0x809f838) at nsAppRunner.cpp:1541 #50 0x0805ec48 in main (argc=3, argv=0xbffff3a4) at nsAppRunner.cpp:1902 #51 0x420156a4 in __libc_start_main () from /lib/tls/libc.so.6 (gdb)
What happens is on the 501'st call, PL_IsQueueNative param "queue" is nil. Starting at PushThreadEventQueue and ending at PL_IsQueueNative.
I can confirm this still crashes (linux) on a new trunk build from this morning.
Doug, do you know anything about limitations on the number of calls to PushThreadEventQueue()? It seems like popping the queue does not help, and we crash on 501st call.
Attached patch FIX (obsolete) — Splinter Review
This seems to fix the crash. My test running is at 3,000 and climbing. So the crash at 501 is fixed by this patch. --pete
Man this is leaking like a siv. Using, my test case posted in bug (208127), after about 3,031 iterations. Start memory aprox 22Megs End Memory aprox 50Megs Serious leakage here.
Assignee: heikki → petejc
Whiteboard: patch, need review
Target Milestone: --- → mozilla1.5beta
Status: NEW → ASSIGNED
Priority: -- → P1
Leaked 30 Megs in about 4 minutes.
*** Bug 208127 has been marked as a duplicate of this bug. ***
Anyone around to review this one line patch? CC'ing Brendan for some review love.
Attachment #127820 - Flags: superreview?(brendan)
Attachment #127820 - Flags: review?(dougt)
Why is adding NS_RELEASE a crash fix? Wouldn't that cure a memory leak? /be
It fixes the crash. Remove it and we crash at 50st1 call to PushThreadEventQueue. The memory leak here is actually a different issue. This bug is about crashing on the 501st call. My guess is perhaps there is a limit to how many ADDREFS can be made on a single class? I really don't know this code,. just that the explicit RELEASE in the patch fixes this crasher.
There's nothing magic about 500. Dougt, lend us a clue. /be
Brendan, if Dougt is too busy who else knows this code well enough to review? It's a one liner, I've been running w/ it for some time and haven't seen any related problems. Can you help me get some traction on this nasty crasher fix so I can check it in? Thanks
ok, here's the testcase: const eqs=Components.classes["@mozilla.org/event-queue-service;1"].getService(Components.interfaces.nsIEventQueueService) for (a=0; a<10000; print(++a)) eqs.popThreadEventQueue(eqs.pushThreadEventQueue()); On raistlin-beos [cvs from now] (only tested once because it took too long) $ ./run-mozilla.sh ./xpcshell /tmp/206947 ... 5821 Kill Thread I don't know if this was killed for resource reasons or what. On cobra-darwin [cvs from this week] (only tested once because it took too long) ... 3029 ./run-mozilla.sh: line 72: 13142 Segmentation fault "$prog" ${1+"$@"} On boffo-linux [cvs from last week + some minor changes] (tested a few times) ... 509 ./run-mozilla.sh: line 72: 16248 Segmentation fault "$prog" ${1+"$@"} On deathstarii-solaris [cvs from a long time ago + various changes that i can't remember] ... 125 Segmentation Fault - core dumped From this we can conclude that 500 is definitely not a magic number. 512 might be on linux (128 on solaris?), but i'm not sure how reasonable that is. Anyway. I see a whole bunch of things i don't like in the eqs/eq so i'm going to poke this stuff today and maybe sunday. For the time being i'm going to side w/ brendan
> For the time being i'm going to side w/ brendan Meaning? There is nothing special about 500? Anyway, did you try your test w/ the release to aQueue? NS_RELEASE(aQueue);
BTW: Timless's test case doesn't crash on FreeBSD Moz 1.3b.
Well, put it simply, i claim there are a few problems. 1. nsEventQueue::*Init don't check the nspr create event method for failure ^ I think i had a bug or at least plans to change this a long time ago, but i can't find the tree with those changes and am not going to search for a bug about it (this bug will be hijackable to suit my needs) 2. Pop doesn't trigger Unlink ^ the documentation in the two classes talks a lot about Unlink and lists a certain place where it'd be called. except well ... it isn't called. Running on cobra-darwin w/ a patch to make the init methods return failures. ... 4027 Program received signal EXC_BAD_ACCESS, Could not access memory. 0x00f2aa40 in nsCOMPtr_base::~nsCOMPtr_base() (this=0xbff80070, __in_chrg=0) at /Users/timeless/mozilla/xpcom/glue/nsCOMPtr.cpp:62 #0 0x00f2aa40 in nsCOMPtr_base::~nsCOMPtr_base() (this=0xbff80070, __in_chrg=0) at /Users/timeless/mozilla/xpcom/glue/nsCOMPtr.cpp:62 #1 0x00ef8db0 in nsEventQueueImpl::GetYoungestActive(nsIEventQueue**) (this=0x8fde10, aQueue=0xbff800f0) at ../../dist/include/xpcom/nsCOMPtr.h:1063 ... #4027 0x00ef8db0 in nsEventQueueImpl::GetYoungestActive(nsIEventQueue**) (this=0x1b5a20, aQueue=0xbfffde70) at ../../dist/include/xpcom/nsCOMPtr.h:1063 #4028 0x00f4b19c in nsEventQueueServiceImpl::GetYoungestEventQueue(nsIEventQueue*, nsIEventQueue**) (this=0xbff80070, queue=0xf57980, aResult=0xbfffdf30) at ../../dist/include/xpcom/nsCOMPtr.h:1063 Remember according to the obvious interpretation of my testcase event queues should not linger in the chain (and Unlink is the function which would make that happen). From the crash we can see that they're lingering. and from the code we can see there isn't a call to Unlink anywhere near this.
Well, it appears on Linux, nsEventQueueImpl::InitFromPRThread -> mEventQueue = PL_CreateNativeEventQueue("Thread event queue...", thread); after the 509th asssignment, PL_CreateNativeEventQueue returns (nil) So, mEventQueue is assigned a null value and the chain of events leading to the crash starts: Specifically, CheckForDeactivation(). Crash, Boom.
So it appears, that after 508 successful calls to PL_CreateNativeEventQueue all on the same thread things break. I notice Unlik is called in the nsEventQueueImpl destructor, and this works if you set the test to 508 times. All 508 objects allocated, are properly released when the service is realeased.
Running Timeless's test case, using PL_CreateMonitoredEventQueue instead of PL_CreateNativeEventQueue, it all seems to work peachy. I reach 1000 no probes and the queue is properly destroyed when the service is unloaded. Any reason we need to use PL_CreateNativeEventQueue?
Here's an alternative testcase: const eqs=Components.classes["@mozilla.org/event-queue-service;1"].getService(Components.interfaces.nsIEventQueueService) try { for (a=0; a<10000; print(++a)) { eqs.popThreadEventQueue(eqs.pushThreadEventQueue()); gc(); } } finally { quit(); } with this testcase and the patch i'm attaching things work up to 4099 (and growing, but very slowly. gc at each iteration is expensive).
Timeless, this patch just does error checking and throws out an NS_ERROR_OUT_OF_MEMORY On Linux it still dies at 509 509 uncaught exception: [Exception... "Component returned failure code: 0x8007000e (NS_ERROR_OUT_OF_MEMORY) [nsIEventQueueService.pushThreadEventQueue]" nsresult: "0x8007000e (NS_ERROR_OUT_OF_MEMORY)" location: "JS frame :: test.js :: <TOP_LEVEL> :: line 4" data: no] I think the root of the problem is this: On some systems, PL_CreateNativeEventQueue after "n" calls on the same thread, will assign nil to mEventQueue. Right here: http://lxr.mozilla.org/seamonkey/source/xpcom/threads/nsEventQueue.cpp#170 Slap a printf right after this assignment and you'll see something like this (on Linux): 504 mEventQueue(0x813a120) 505 mEventQueue(0x813a358) 506 mEventQueue(0x813a590) 507 mEventQueue(0x813a7c8) 508 mEventQueue(0x813aa00) 509 mEventQueue((nil)) Still seems like some kind of limit is reached here. Error checking indeed needs to be there, but the root of the problemstill needs a fix'n.
Traced it a little further to here: http://lxr.mozilla.org/seamonkey/source/xpcom/threads/plevent.c#852 Looks like a system limitation, when filedes[1] == 1024 pipe(self->eventPipe); When self->eventPipe[1] == 1024 pipe() returns -1 $ ulimit -a open files (-n) 1024 It looks like the system allows only 1024 file descs . . . So, it appears that threadEventQueues that are poped are not really fully poped until the service is released and the nsEventQueue destructors are then called. Since the service is only released at shutdown, poped queues are not destroyed, still holding on to their file descriptors. So this would explain the disparity of resutls between the various systems. Everything is being properly addrefed and released, but like I said, only destroyed when the service is unloaded.
Attached patch FIXSplinter Review
This two liner correctly resolves the problem I believe. Using Timeless's test case, I ran it up to 10,000. 9992 9993 9994 9995 9996 9997 9998 9999 10000 Timeless, can you verify this patch on some other systems. If good, then "r=" it and maybe Brendan can give me some "sr=" love. Gotta close those open files folks.
Attachment #127820 - Attachment is obsolete: true
Attachment #129029 - Attachment is obsolete: true
Well patch #129066 fixes the test case but breaks other things. So, still looking for the right fix here . . .
Timeless, your GC solution may fix your specific test case, but from what I can tell, it would have no impact on the first test case posted here. Is that test case still valid? So which test case are we currently trying to fix?
Comment on attachment 129066 [details] [diff] [review] FIX this is wrong and breaks darwin(macosx)
Attachment #129066 - Flags: review-
cc:ing a few folks who might be more clueful than I about why nsXMLContentSink::DidBuildModel is never called. (Did I get your attention?) You guys have already noticed, but, closing the event pipes after each ProcessPendingEvents would not be a good thing (tm). But I think you are right, though, that the problem is leaking event queues holding on to their pipe file descriptors until the system chokes itself to death. However this is not caused by the event queue service, which does not hold a reference to event queues. I see this: each iteration of nsXMLHttpRequest::Send in timeless' test case pushes a new event queue (line 1283). Though this queue is properly popped and cleaned up it is never actually destroyed because of a persistent reference held in nsParser::mEventQueue, set in nsParser's ctor. This ref would be dropped in nsParser's dtor but nsParser itself is leaking. That's because of a circular reference between nsParser::mSink, an nsXMLContentSink, and that nsXMLContentSink::mParser. That circular reference would be broken in nsXMLContentSink::DidBuildModel, but DidBuildModel is never called, ever. That's as far as I've traced it. This sucks. Old event queues are building up in a never-ending stack and when they get 1,000 deep the app goes down. Can't blame it. I feel pretty confident this crash will go away if the leaking nsParser can be plugged.
From what I can tell, nsParser is holding onto them. It's getting constructed, grabs one of the event queues for the thread. I'm unsure why the parser isn't getting destructed. I see the destructor hit, but not for all instances.
the timeless patch attachment.cgi?id=129042 applied to a cvs pulled at around 20030802 20:00 UTC+02:00 gives the following results: Mac OS X 10.2.6 with all the whistes and bells to compile the beasts 1) the counter stops at 65535 = 2^16 - 1 2) memory use : PID COMMAND %CPU TIME #TH #PRTS #MREGS RPRVT RSHRD RSIZE VSIZE 7404 MozillaFir 0.0% 41:58.71 16 254 2701 517M 29.0M 334M 809M which is way about normal: 17968 mozilla-bi 0.0% 0:34.56 16 254 512 21.3M 24.5M 29.9M 280M 5543 mozilla-bi 0.0% 25:16.92 16 260 1051 46.1M 30.5M 67.9M 441M 766 mozilla-bi 0.0% 2:43:59 16 266 1677 61.5M 32.5M 85.6M 480M 3) the test-browser is unresponsive to loading pages in new tabs, javascript console won't start; but opening and closing tabs works, resizing of window works. Closing the window doesn't help things - I get an arrow with a busy-indicator ; the window menu indicates that all the windows I've requested are active, but no windows get drawn
Attachment #127820 - Flags: superreview?(brendan)
Attachment #127820 - Flags: review?(dougt)
I see the ref counts go up on both the parser and the content sink when SetContentSink is called in nsXMLDocument::StartDocumentLoad.
Comment on attachment 129042 [details] [diff] [review] Error check and Unlink danm: i know that this doesn't fix all of the problems, but it fixes the testcase in comment 34, and some other crashers. I know we still want some fix for parser, and I still plan to write some code so that xpconnect won't hold a strong reference to eventqueues, but would you please review these changes?
Attachment #129042 - Flags: review?(danm-moz)
Restating comment 41 because it seems I wasn't clear: we have rampant leaks and it looks to me that several of them, including the important nsEventQueue leak, will be plugged if the parser (?) would only call nsXMLContentSink::DidBuildModel. That ain't happening, and it's a problem. timeless: I'd like to see the attachment 129042 [details] [diff] [review] patch in a separate (perhaps dependent) bug because I consider it a distraction from the core, leak problem. The tight push/pop loop in the comment 34 testcase unfortunately pushes N simultaneous queues. It looks to me that there's no long-term leak: they are all properly unwound though only after the loop finishes. I imagine this is a problem like the comment 0 testcase for similar reasons. So obviously the two cases are related but the fixes will be different and the cc:list on this bug is growing, and I'd like to compartmentalize the two issues and avoid spamming potential help on this leak. If you'll put that patch in a new bug (sorry!) I have a 100-line comment ready for it :).
I don't believe nsXMLContentSink is being used at all in test case #124114. I put printf's in the class ctor and dtor and don't see them showing up. The nsParser is certainly used and appears to be properly destroyed. nsXMLHTTPRequest is properly being created and destroyed as well.
Actually, I am seeing the parser leak now. It appears to be 2 to 1. nsParser CREATE (0x43b92aa8) nsParser CREATE (0x43bc68d8) nsParser DESTROY (0x43b92aa8) Document http://linuxbox/leak/page1.html loaded successfully nsParser CREATE (0x43b0c950) nsParser CREATE (0x43b9f1c0) nsParser DESTROY (0x43b0c950) Document http://linuxbox/leak/page1.html loaded successfully nsParser CREATE (0x43bb81c0) nsParser CREATE (0x43bcedd0) nsParser DESTROY (0x43bb81c0) Document http://linuxbox/leak/page1.html loaded successfully nsParser CREATE (0x43bb79b8) nsParser CREATE (0x43bde3d0) nsParser DESTROY (0x43bb79b8) Document http://linuxbox/leak/page1.html loaded successfully Where the second create is never released.
The XML sink will not be created because XMLHttpRequest will only try to parse things if the document mime type is XML, or if you explicitly override the mime type. It should not even create the parser in this testcase. Have a look at XMLHttpRequest::OnStartRequest: http://lxr.mozilla.org/seamonkey/source/extensions/xmlextras/base/src/nsXMLHttpRequest.cpp#952
Right. that's what I thought. I'm using text/plain in my test case. I do see the parser leak. So obviously it's not being caused by nsXMLContentSink. There is only one reference from nsXMLHTTPRequest to nsIParser here: http://lxr.mozilla.org/seamonkey/source/extensions/xmlextras/base/src/nsXMLHttpRequest.cpp#987
Nope they *are* being created. I did a compile from the tip and now I see the printfs. nsXMLDocument and nsXMLContentSink are indeed being created. Doh!
Try this. Now we call overrideMimeType() to try and force the creation and use for the XML sink, which should result in proper cleanup. I also realized we do have a parser even in the case where there is no XML data, and we do not force the mime type. If this workaround actually works (haven't tried), then we should find a way to clean things up without requiring the sink in cases like this where it's pointless trying to parse the data as XML.
The XML Content sink object is being created w/ a null docshell. http://lxr.mozilla.org/seamonkey/source/content/xml/document/src/nsXMLDocument.cpp#671 Maybe that is setting off a chain of events that leads the the leak in nsXMLContentSink.
Attached patch Proposed FIXSplinter Review
W/ this patch, there are no leaks, no crash and everything seems to work peachy. --pete
I can verify that both overrideMimeType (comment 52) and not setting the content sink when there's no docshell (comment 54) prevent event queues from leaking in the comment 0 testcase. (Note I just checked in a patch to nsEventQueue.cpp that makes verification easy.) pete's patch looks interesting but the right reviewer, that guy isn't me. (jst? heikki?)
Attachment #129191 - Flags: superreview?(jst)
Attachment #129191 - Flags: review?(hjtoi-bugzilla)
Changing target to 1.6 alpha.
Target Milestone: mozilla1.5beta → mozilla1.6alpha
I think we should try to avoid creating the objects we don't really need. In this case it seems like we shouldn't need the XML content sink. Please try to change the code so we won't create the XML sink in this case. If you can get that to work I'd rather approve that kind of change. (If it won't work I can re-evaluate this patch as well.)
> Please try to change the code so we won't create the XML sink in this case. > If you can get that to work I'd rather approve that kind of change. No it doesn't work, that's the first thing i tried and other things started to break. There are lots of instances where there is no docshell, but the new content sink object is required. That's why the conditionals in this patch are where they are.
Comment on attachment 129191 [details] [diff] [review] Proposed FIX Okay, r=heikki
Attachment #129191 - Flags: review?(hjtoi-bugzilla) → review+
Thanks Heikki. jst or ping any super reviewers cc'd on this bug. I haven't heard from jst in a while.
Comment on attachment 129191 [details] [diff] [review] Proposed FIX It seems to me that there's another problem lurking here that should IMO be fixed in stead of simply not handing the sink to the parser when there's no docshell around. If there's a leak in that case, we should fix the leak, not hack around it for this single case. If this is needed for 1.5, or whatever the target is, I'll sr, but please do open a new bug about backing out this change and fixing the actual leak.
Attachment #129191 - Flags: superreview?(jst) → superreview+
jst, thanks for the quick review. I'll make sure I clearly comment this and refer to this bug number before checking in. Then whoever has time can track this down even further.
Checked in.
this caused camino to hang at startup parsing its bookmarks. backing it out fixed it. what should i do?
Ok, i guess we need to probe deeper to find the root source of the leak like jst suggested.
Attached patch Another proposed fix (obsolete) — Splinter Review
The core problem here seems to be in the parser API's. Once you start an async parse by calling Parse(), the parser assumes that you will send it the appropriate nsIStreamListener events. There seems to be no way of getting it to shutdown properly otherwise. I tried calling nsIParser::Terminate(), but it doesn't do anything useful in this case. I even tried to send it just a OnStopRequest() without sending it any data, but it doesnt like that very much either. Some scary looking assertions fired inside the parser. In this particular case, it looks like the Parse() call is set off by the call to nsXMLDocument::StartDocumentLoad() in Send(). However, as Heikki pointed out we may subsequently decide, in OnStartRequest(), to not forward the nsIStreamListner events into the parser for non-xml mimetypes etc. (BTW, that was bug 129607). I don't know enough about the code here, but it seems to me that one resonable way to avoid this situation would be delay notifying the document that a load has started till such a time that nsXMLHttpRequest has made a decision about whether it wants to try building an xml document. This would mean that Parse() is not called, and the contentsink is not created unless nsXMLHttpRequest is sure that it intends to send forward nsIStreamListener events into the parser. This patch is my stab at taking this approach of lazily calling StartDocumentLoad(). It appears to work OK. The testcase in comment 0 doesnt leak/crash anymore. I tested this a bit (basically the online tests up at http://www.mozilla.org/xmlextras/tests.html) Nothing appears to be broken.
This is something that I started with and abandonded since it was getting into areas that I really dont understand. I can finish/test it if the approach sounds better than the earlier one. I was trying to delay the construction of the document itself till it is absolutely needed. It also fixed the leak, but is untested beyond that. It turns out the document is used to dispatch events when the request completes and I was not sure that my changes in RequestCompleted() made sense, or how to test them.
Attachment #130523 - Flags: superreview?(jst)
Attachment #130523 - Flags: review?(hjtoi-bugzilla)
Comment on attachment 130523 [details] [diff] [review] Another proposed fix Seems like a good idea! >Index: nsXMLHttpRequest.cpp >=================================================================== >+ if (!mDocument) { >+ // We were probably aborted. So no point in attempting to parse >+ // incoming data. >+ mState &= ~XML_HTTP_REQUEST_PARSEBODY; >+ } Did you actually hit this condition? How is it possible to be aborted before starting data transfer? >+ } else if (!parser || (parser && parser->IsParserEnabled())) { >+ // If we do have a parser, then it needs to be enabled for us to >+ // safely call RequestCompleted(). If the parser is not enabled, >+ // it means it was blocked, by xml-stylesheet PI for example, and >+ // is still building the document. RequestCompleted() must be > // called later, when we get the load event from the document. This comment seems lacking. Why do we need to treat not having parser the same as having enabled parser? >- rv = document->StartDocumentLoad(kLoadAsData, mChannel, >- loadGroup, nsnull, >- getter_AddRefs(listener), >- PR_TRUE); > mChannel->SetNotificationCallbacks(this); > rv = mChannel->AsyncOpen(this, nsnull); I am most wary of this change. Is it really ok to open a channel without telling the document first to start loading?
> Is it really ok to open a channel without telling the document first to start > loading? Should be -- that's how normal content area loads work. In fact, telling the document to start loading before the channel calls OnStartRequest is very much wrong -- StartDocumentLoad asks the channel for things like the content type, etc. Since the channel's listener is 'this', you just want to call StartDocumentLoad from the OnStartRequest of 'this' and then forward the other notifications to the document's listener (if any). That's essentially what the URILoader does for normal content area loads.
> Did you actually hit this condition? How is it possible to be aborted > before starting data transfer? Yes I did. The reason I put that in was that I saw this condition fairly regularly on the following two tests. http://www.mozilla.org/xmlextras/xgetabort.html http://www.mozilla.org/xmlextras/xgetopen.html I presumed that it would be quite natural for that to happen when async requests are aborted. I'm fairly certain that the current code was also ending up in OnStartRequest() after the Abort() call, before I touched it at all. Was that not supposed to happen? > This comment seems lacking. Why do we need to treat not having parser the > same as having enabled parser? Indeed it is lacking. Is the following better? // If we don't have a parser, we never attempted to parse the // incoming data, and we can proceed to call RequestCompleted(). // Alternatively, if we do have a parser, its possible that we // have given it some data and thus caused it to block e.g. by a // by a xml-stylesheet PI. In this case, we will have to wait till // it gets enabled again and RequestCompleted() must be called // later, when we get the load event from the document. If the // parser is enabled, it is not blocked and we can still go ahead // and call RequestCompleted() and expect everything to get // cleaned up immediately. I am open to suggestions about improving that further. And I guess bz adressed your last point, it does appear to be safe to open the channel without starting the load. Thats kinda the whole point of the patch :-)
Comment on attachment 130523 [details] [diff] [review] Another proposed fix With the new comment, r=
Attachment #130523 - Flags: review?(hjtoi-bugzilla) → review+
Attachment #130523 - Attachment is obsolete: true
Comment on attachment 131388 [details] [diff] [review] Same patch. Updated comment foo. Marking heikki's review. Tranferring sr request to updated patch. jst, can you please have a look at this.
Attachment #131388 - Flags: superreview?(jst)
Attachment #131388 - Flags: review+
Attachment #130523 - Flags: superreview?(jst)
Comment on attachment 131388 [details] [diff] [review] Same patch. Updated comment foo. - In nsXMLHttpRequest::OnStartRequest(): + return mXMLParserStreamListener->OnStartRequest(request,ctxt); While you're at it, please add a space after ','. - In nsXMLHttpRequest::OnStopRequest(): + } else if (!parser || (parser && parser->IsParserEnabled())) { Unnecessary parser check, simply: + } else if (!parser || parser->IsParserEnabled()) { would do (the rhs of an || is only executed if the lhs expression evaluates to false, and in that case, we know parse is non-null here). sr=jst
Attachment #131388 - Flags: superreview?(jst) → superreview+
Taking bug since I have the lastest patch, and I don't want to forget about checking this in. jst, thanks for the sr. I can't believe I wrote that redundant parser check :-) Will update and check this in soon.
Assignee: petejc → keeda
Status: ASSIGNED → NEW
Patch checked in. Hopefully, this time, this really is fixed :-)
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Attachment #129042 - Flags: review?(danm-moz)
Crash Signature: [@ nsEventQueueImpl::GetYoungestActive]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: