Closed Bug 69145 Opened 24 years ago Closed 17 years ago

leak in repeated document.load calls off of timer in javascript

Categories

(Core :: DOM: Core & HTML, defect)

defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: jeffr, Unassigned)

References

()

Details

(4 keywords, Whiteboard: (pseudo-leak, have workaround))

Attachments

(5 files)

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.01; Windows 95)
BuildID:    2001010901

In http://209.192.190.33/dtd/leak.htm an intervaltimer fires off repeatedly and 
each time uses a 'hidden' document's load method to load a small xml document.  

In the document's onload event callback, the following line extracts an element 
from the xml and displays it in a <div>:
document.getElementById('display').innerHTML=XMLDom.getElementById
('TimeStamp').childNodes[0].nodeValue;

The page in question polls rapidly to highlight the leak, but the leak occurs 
even if you poll less rapidly.  The javascript uses a global flag to prevent 
attempting a new load while the previous one is still pending (i.e. has not yet 
fired the 'load' event.)

This leak makes Mozilla look very bad compared to the 'other' browser, which 
does not exhibit this leak.

Reproducible: Always
Steps to Reproduce:
1. Open up page http://209.192.190.33/dtd/leak.htm
2. Monitor the Mozilla process and watch it's working set grow without bound
3.

Actual Results:  eventually your system runs out of virtual memory and becomes 
semi-catatonic

Expected Results:  the aforementioned page http://209.192.190.33/dtd/leak.htm 
should run and run and run without fail and without leak!

the same leak also happens on a Win95 machine and a Win2K machine, so not 
likely specific just to WinNT.
Confirmed
Platform: PC
OS: Linux 2.2.16
Mozilla Build: 2001021408


Leaks badly on Linux too. Marking NEW. Adding Keywords.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: footprint, testcase
OS: Windows NT → All
Hardware: PC → All
It seems to me like a DOM issue - reassigning. Please assign back if this is
actually JS Engine...
Assignee: rogerl → jst
Component: Javascript Engine → DOM Level 0
QA Contact: pschwartau → desale
cc'ing jband for advice. Do you think this is a DOM or JS Engine issue? 
Thanks - 
Certainly have to figure this for a DOM problem unless Johnny can show otherwise.
Keywords: dom0
Severity: blocker → major
Keywords: mlk, nsbeta1+
Heikki offered to look into this.
Assignee: jst → heikki
Target Milestone: --- → mozilla0.9
Target Milestone: mozilla0.9 → mozilla0.9.1
It looks like this is not a real leak per se, because when you move to other
page the memory usage drops down to what it was. So it looks like it is a
problem where the JS GC would need to be run more often. Or maybe there is
something holding a ref which prevents the GC from clearing memory until a new
page is loaded. Unfortunately I don't know too much about the JS engine...

On jst's suggestion I placed a "JS_MaybeGC(cx);" call in
GlobalWindowImpl::RunTimeout() right after line

  cx = (JSContext *) mContext->GetNativeContext();

but this did not have any effect. I am not sure if it even should do anything,
since just before that we addreffed the window and script context.

So any help would be appreciated...


Additionally I can not run the attached testcase, it gives me

************************************************************
* Call to xpconnect wrapped JSObject produced this error:  *
[Exception... "'Permission denied to access property' when calling method:
[nsIDOMEventListener::handleEvent]"  nsresult: "0x8057001e
(NS_ERROR_XPC_JS_THREW_STRING)"  location: "<unknown>"  data: no]
************************************************************

I suspect this is to be expected since the script (on bugzilla.mozilla.org) is
trying to do document.load and access the properties of a file from another
host. Mitch, is this the desired behaviour?


The testcase in the URL still runs, and I placed another sample at

http://blueviper/heikki/bug69145.html

This is a slightly modified version of the provided testcase: this loads a big
(277 kB) XML document as data, increments a counter and outputs that along with
an element it located in the document. Each load consumes something like 1 meg
of RAM so it is easy to see that we "leak". Wait a while and open another page
and you will see that the memory usage drops back down.

By the way, it appears that when we load the 277 kB doc as data it takes about 1
second on my computer, normal load takes about 12 seconds and cached load about
8 seconds (wall clock measurements).
Ok, running JS_GC() clears the "leak".

I tested with my own tetscase, running JS_GC every time we entered the
GlobalWindowImpl::RunTimeout() function versus running it if over 10 seconds had
passed since we last run JS_GC.

Running JS_GC() every time seems like we might leak for real a little, since
after going to another page and coming back to the testcase, we were 3 megs
higher than what we started with.

Running JS_GC after 10 secs since last, we slowly seem to leak (about the same 3
megs), but when I visit another page and come back the 3 megs is reclaimed.

Does JS_GC() clear only stuff in the context that was passed in, or globally? If
it clears only that context I could explain the latter test. I do not know why
the first test seems to leak for real while the latter does not.

I have not yet run page load time test or anything, 'cos I'd like to know if
what I have done seems feasible or if it is such a horrible hack that we want
something else.
Status: NEW → ASSIGNED
Whiteboard: [fixinhand?]
+  LL_ADD(lastGCPlus,lastGC,PR_USEC_PER_SEC);

Portability requires an LL_I2L(usecPerSec, PR_USEC_PER_SEC) first, then use
usecPerSec as LL_ADD's second input parameter -- but don't you want to multiply
PR_USEC_PER_SEC by 10 to do what the comment says?

You *do not* want to run JS_GC on every timeout -- that will kill performance.

GC is a global (per JSRuntime, which contains the GC heap), not window-wise (per
JSContext) operation.  But contexts can hang onto newborn objects and strings
via roots.  JS_GC clears those newborn roots.  We could add an API to clear just
those newborns for the context at hand, but first let's get some traces that
show those are entraining the transient garbage.

/be
I was unclear, or just inaccurate:

> JS_GC clears those newborn roots.

I should have said "... *after* running the full GC."  That means that you might
need two GCs to collect garbage entrained by a newborn root.  The idea of a
JS_ClearNewbornRoots(cx) API is that you could call it first, then call JS_GC,
and be sure that the newborns entrained nothing.

But again, let's be sure the newborns are to blame here.

/be
Yeah, the time interval stuff is still open (and I also thought I should do what
you suggested but for some reason I get 10 secs the way it is implemented now).

I did some measuring of how many times we enter GlobalWindowImpl::RunTimeout().
Normally it appears only about once per page load, if that. Typing in the URLbar
gets us there once per keystroke. Didn't check mailnews yet. DHTML pages are the
exception, they can run timeouts with really short intervals.

jst suggested I measure this in case we might want to run the GC based on some
clever combination of the number of runs in in certain time frame. It looks like
we would want to run GC more often if we get several calls into this function in
a small time frame. So for example if we got over 10 calls in 10 seconds, run
GC, otherwise run GC if it has been over a minute since we last run it (these
numbers are just examples, I am open to suggestions).

jband mused that JS_MaybeGC() probably didn't clear anything in this case
because we have just a few small JS objects. Unfortunately they reference large
native objects (whole documents) of which the JS engine knows nothing about.

One option might be to use the memory flushers (?) to check if we are low on
memory and run GC only then. rpotts thinks they are not enabled, various
problems (crashes with thread issues, how to measure low memory etc.).

I believe we would not need to introduce new functions. If it takes 2 GCs to
collect everything, we just need to keep that in mind when choosing the time
interval after which to force GC. As I was testing with the 10 secs interval (as
measured by wall clock) and the testcase that loaded a large file about once a
second I got 10 megs bloat which was cleared every 10 seconds.

Is it perfectly safe to run the GC here, stability wise?
You need to use PR_Now, not PR_IntervalNow, to get microseconds since the epoch.
PR_IntervalNow returns system-dependent "tick" units.

See and update bug 66381 for JS_MaybeGC improvements.  Shaver kindly took it
from me, not sure when he's going to work on it.

It's safe to run JS_GC here, but we don't want to run it too often, or too
infrequently.  But GlobalWindowImpl::RunTimeout calls mContext->EvaluateString
or mContext->CallEventHandler, and those both call ScriptEvaluated, which may
call JS_MaybeGC.  Rather than directly adding a JS_GC call to RunTimeout, could
we improve the existing unified mechanism, ScriptEvaluated?  In addition, of
course, to working on bug 66381.

/be
Ok, based on Brendan's suggestion I tried how this looks like in
nsJSContext::ScriptEvaluated(). That is not yet a fix candidate, though.

That function is called a lot. Hundreds of times during startup, dozens of times
when typing in URL bar, dozens of times when mousing over links, and so on.

I am a bit baffled, though. The old implementation there has data member
mNumEvaluations, and we run JS_MaybeGC() when the count goes over 20. So we are
keeping a per context counter and calling a global GC when the per context
counter goes over. So we could basically have dozens of contexts, all of which
call JS_MaybeGC() at almost the same time, and all the calls to JS_MaybeGC() are
useless after the first one?

Shouldn't we keep a global counter instead, and modify the number after which to
run JS_MaybeGC()?
"So we are keeping a per context counter and calling a global GC when the per
context counter goes over."

Not a "global" GC (full vs. incremental, better word), but a "maybe" GC.  The
heap has to have appeared to have grown to 1.5x the last known population (known
by the GC when it last ran).  As I said, a lot of the smarts arguably should go
in JS_MaybeGC (that's what bug 66381 asks for).

Certainly, just counting (whether with one counter or many) is inferior to some
attempt to measure the garbage generation rate and adapt the GC schedule to a
filtered prediction based on recent rate history.  Shaver, you got anything yet?

/be
I don't seem to be able to find any good ways to decide when to trigger JS_GC()
in ScriptEvaluated(). Using the testcase on blueviper (or nowadays green since
blueviper is being retired) I got about 150 calls to ScriptEvaluated() in about
15 seconds. Having www.mozilla.org open and mousing up and down the TOC on the
left called ScriptEvaluated() over a thousand times in approximately 15 seconds.
Any key press also produced at least one call. Starting up produced several
hundreds of calls. The aTerminated argument didn't seem to help that much
either. It was false when mousing over links, and one cycle to load a document
in the testcase got 4 or 6 calls with it being true and one call it being false.

So, I am leaning more towards the original solution which was to put this stuff
in GlobalWindowImpl::RunTimeout().
The latest patch simply forces us to run GC if we got over 15 timeout calls
since the last forced GC, or if it has been over a minute since we last force
run GC.

I don't think I would know how to track down if newborns are to blame. Brendan's
comment "The idea of a JS_ClearNewbornRoots(cx) API is that you could call it
first, then call JS_GC, and be sure that the newborns entrained nothing."
doesn't make sense to me, unless there was a typo and JS_GC was supposed to be
JS_MaybeGC. If so, then I could see that the following would perhaps be the
closest to perfect solution here:

Index: src/base/nsJSEnvironment.cpp
===================================================================
RCS file: /cvsroot/mozilla/dom/src/base/nsJSEnvironment.cpp,v
retrieving revision 1.139
diff -u -r1.139 nsJSEnvironment.cpp
--- nsJSEnvironment.cpp 2001/05/18 07:25:56     1.139
+++ nsJSEnvironment.cpp 2001/05/22 21:43:02
@@ -1334,6 +1334,7 @@

   if (mNumEvaluations > 20) {
     mNumEvaluations = 0;
+    ::JS_ClearNewbornRoots(mContext);
     ::JS_MaybeGC(mContext);
   }

In any case, writing that function is beyond my knowledge.

So, are we going to proceed with my latest patch (with possible tune-ups), or do
something else (do JS_ClearNewbornRoots, fix JS_MaybeGC, drop this from 0.9.1)?
Jeff Robbins and I had some email conversations, and one of his questions
prompted an approach that I can not dismiss outright...

When a JS object is placed in the list that can be garbage collected, could we
release the native data at that point, instead of waiting and releasing it when
the JS object is freed?
is the content representative of real pages?  can we move this off to
0.9.2 w/o shooting ourself in the foot?
heikki: yes, I meant JS_MaybeGC in the context you understood (which I did not:
viz, calling JS_ClearNewbornRoots were it to exist, just before calling the best
JS GC API, from ScriptEvaluated).  But there is no list of objects to be GC'd,
that's not how garbage collection works.  Until you run the GC, you don't know
what is garbage and what is not.  It may be that a clever script writer knows
that the old document is garbage once a new one has loaded (until some other
hacker retains a ref to the old one behind the new one's back!).  But the JS
engine has no idea.  There are no reference counts.

GC requires examining the live object graph, starting from the root set, and
following strong refs.  There is no quick and dirty substitute, in general.

If an embedding "knows" for sure that some object is garbage, I bet real money
it will be broken next month by a scripted extension, because once you hand out
JS refs (which are strong by definition), you never know how long they'll live. 
The longest-lived must keep the referred-to object alive at least that long. 
After that, it's garbage, but you have to mark and sweep, or do something
incremental but equivalent, to be sure.

/be
I don't mind a short-term fix, so long as it is clearly labeled and the right
way pointed out, and so long as there's follow-through (shaver and I owe that
more than you do, but we all benefit by unifying and improving the "smoothly
scheduled GC" mechanism).

In that light, I don't see why we wouldn't call JS_MaybeGC all the time, without
any damping counter modulus, in ScriptEvaluated.  And, independent of that and
in keeping with the calls to nsIScriptContext::EvaluateString and CallFunction,
we would not have any special code in RunTimeout.

So why not move your FrequencyBalancedGC into ScriptEvaluated, and let it unify
the scheduling policy for all scripts, timeouts or otherwise?  When JS_MaybeGC
or a new API emerges that guarantees a smooth yet effective GC schedule, we can
simplify just that one place in DOM code to call the JS API.

BTW, you need to use LL_EQ, not raw ==, on PRTime and PRInt64/PRUint64 types,
for portability.

/be
Sorry if I was unclear:  I think we might do best to do these things in order:

SHORT TERM: unify something like FrequencyBalancedGC in ScriptEvaluated and make
no hacks to RunTimeout.

LONGER TERM: fix bug 66381 and change ScriptEvaluated to call JS_MaybeGC.

/be
The problem placing FrequencyBalancedGC() in ScriptEvaluated() is that there
does not appear to be any way to make it smartly frequency balanced. About the
best approach is to run it every 10-20 seconds :( As I pointed out, just rapidly
mousing over the mozilla.org TOC will generate hundreds of calls to
ScriptEvaluated() in 10 seconds. A script doing document.load([255kb doc]) in as
efficient a loop as possible via setTimeout (for example) might only be
generating 10 calls to ScriptEvaluated(). Yet the former usage pattern does not
cause any measureable memory increase while the latter would consume 10 megs of
memory.

I don't know how widely document.load() is used on the web; I asked Prashant to
try his search tool (my attempts with some searh engines were not too useful,
they seemed to be case insensitive and ignore punctuation characters).

In any case, document.load() might not be the only function that can cause the
situation we see here. And setTimeout() is probably not the only way to do this
kind of thing either, you might be able to use events as well (so
ScriptEvaluated() might be the best place for FrequencyBalancedGC(), after all).

We investigated the memory flushers etc. a bit with vidur, but there doesn't
seem to be anything immediately available, at least not cross-platform. We would
probably want to have the memory flushers force JS GC in low mem situations. And
we would also want finer grained control with flushing (current heuristics is
like if we are down to the last 10% of total available memory we flush all).

Or we could combine it with JS_MaybeGC() somehow, see if the process size grew
rapidly since we last checked and force GC. This might not be needed if the
generic memory flushers work well, since using memory is not bad per se. It is
bad if we are running out of it, or slowing the system down because of swapping.
I am moving this to 0.9.2 since it appears document.load() is not yet very
commonly used and the fix I have here is not what I would want it to be.
JS_MaybeGC() improvmenent is also slated for 0.9.2, and we would have some more
time to think about the memory flushers as well, which might actually make this
bug go away.

Prashant could not find document.load() on top100 sites. If you know of real
websites using it, please let us know.

A workaround to this bug might be to call some JS-garbage-generator-function
(create lots of small string objects, for example?) often enough so that
JS_MaybeGC() would end up calling JS_GC().
Keywords: nsbeta1+nsbeta1-
Whiteboard: [fixinhand?]
Target Milestone: mozilla0.9.1 → mozilla0.9.2
After five minutes of testing I conclude that it seems to be really simple to
work around this bug in Mozilla. With code like below it takes about 1 second to
load the big XML file which consumes about 1 meg of memory. While the load is
still going on, we generate dummy JS objects to trick JS GC to kick in. I see
the browser memory usage stays between 40 and 50 megs during 100 loads (without
this trick it should have gone up to a hundred or so)

setInterval(tick,10);

function tick() {
  if (!inProgress) {
    inProgress=true;
    XMLDom.load('/heikki/big.xml');
  } else {
    // Work around Mozilla bug 69145
    var i;
    for (i = 0; i < 100; i++) {
      var dummy = "dummy" + i;
    }
  }
}
Whiteboard: (pseudo-leak, have workaround)
Target Milestone: mozilla0.9.2 → mozilla0.9.3
Target Milestone: mozilla0.9.3 → mozilla0.9.4
I don't think I will be doing anything with this one anymore. Giving to shaver...
Assignee: heikki → shaver
Status: ASSIGNED → NEW
Priority: P2 → --
Target Milestone: mozilla0.9.4 → ---
Blocks: 92580
No longer blocks: 92580
Assignee: shaver → general
QA Contact: desale → ian
When I load the testcase, I just get:
"uncaught exception: Permission denied to call method XMLDocument.load"
I've created a working testcase at http://a2pg.com/testcase/xmlleak.html but it does not cause memory use to increase even after tens of thousands of iterations.
After running the testcase all night, I see it "leaks" about 50 bytes per iteration. It is not a true leak, because after I clicked the Back button to leave the test and opened a new tab and navigated to other pages, the memory was released. On the other hand, it caused memory use to increase by over 100 MB temporarily, and that's a problem that should be fixed.
(In reply to comment #32)
> After running the testcase all night, I see it "leaks" about 50 bytes per
> iteration. It is not a true leak, because after I clicked the Back button to
> leave the test and opened a new tab and navigated to other pages, the memory
> was released. On the other hand, it caused memory use to increase by over 100
> MB temporarily, and that's a problem that should be fixed.

Thanks for testing this. The fact that memory use peaked at over 100MB in this case seems bad, but that's unrelated to the leak that this bug is about. Please file a followup bug on that issue and mark this bug fixed (or reference the bug here and ask someone else to mark it fixed if you're unable to). Thanks!
I've filed followup bug 411709. Marking this one as WFM as we don't seem to know which patch fixed the original problem.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: