Last Comment Bug 508128 - httpd.js running in xpcshell GC's too infrequently
: httpd.js running in xpcshell GC's too infrequently
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: XPConnect (show other bugs)
: Trunk
: All All
: -- normal with 1 vote (vote)
: mozilla7
Assigned To: Josh Matthews [:jdm] (away until 9/3)
:
Mentors:
Depends on: 601557 606592 619164 619167 nsITimer-fail
Blocks:
  Show dependency treegraph
 
Reported: 2009-08-03 15:33 PDT by PTO until Sep 5 NZ time; Chris Pearce (:cpearce)
Modified: 2011-07-05 10:55 PDT (History)
24 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch httpd.js to gc() in _connectionClosed [Backout: Comment 28] (1.08 KB, patch)
2009-08-04 16:55 PDT, PTO until Sep 5 NZ time; Chris Pearce (:cpearce)
jwalden+bmo: review+
Details | Diff | Splinter Review
Periodically gc in the server process (1.27 KB, patch)
2009-08-05 13:27 PDT, Jeff Walden [:Waldo] (remove +bmo to email)
no flags Details | Diff | Splinter Review
fix do_timeout [checked in comment #50] (3.01 KB, patch)
2009-11-01 19:11 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
jwalden+bmo: review+
Details | Diff | Splinter Review
fix staller.sjs (2.93 KB, patch)
2009-11-02 00:50 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
jwalden+bmo: review-
Details | Diff | Splinter Review
fix subtle QI bug in test_cookie_header.js [checked in comment 50] (1.09 KB, patch)
2009-11-02 20:25 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
no flags Details | Diff | Splinter Review
fix staller.sjs (v2) [checked in comment 70] (1.57 KB, patch)
2009-11-03 03:13 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
jwalden+bmo: review+
Details | Diff | Splinter Review
Fix RawTest QI issue [checked in comment 70] (1.61 KB, patch)
2009-11-03 13:01 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
no flags Details | Diff | Splinter Review
Fix additional RawTest bugs [checked in comment 70] (1.37 KB, patch)
2009-11-05 15:05 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
jwalden+bmo: review+
Details | Diff | Splinter Review
Additional diagnostics to report more data about exceptions in tests (10.12 KB, patch)
2009-11-05 15:06 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
jwalden+bmo: review+
Details | Diff | Splinter Review
Additional logging and asyncWait change (2.70 KB, patch)
2009-11-05 15:07 PST, Robert O'Callahan (:roc) (email my personal email if necessary)
no flags Details | Diff | Splinter Review
Ensure input stream lifetime is not cut short by a GC. [checked in comment 80] (1.42 KB, patch)
2010-10-05 16:24 PDT, Josh Matthews [:jdm] (away until 9/3)
jwalden+bmo: review+
Details | Diff | Splinter Review
Additional diagnostics to report more data about exceptions in tests [checked in comment 80] (10.42 KB, patch)
2010-10-05 16:26 PDT, Josh Matthews [:jdm] (away until 9/3)
no flags Details | Diff | Splinter Review

Description PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-03 15:33:43 PDT
When repeatedly running mochitest on content/media/test/test_seek.html, I can see in Task Manager or by using `top` that xpcshell.exe's memory usage rapidly increases with every run. It expands by a few hundred megabytes every reload of test_seek.html. If you wait long enough, (I assume) a garbage collection is done, and the memory shrinks back down to a few hundred megabytes. When running this many times, it's easy to expand the memory usage to over a gigabyte.

Usually the first run expands the memory usage to a few hundred magabytes, but if I add a call to gc() at the end of _connectionClosed in httpd.js:719, memory always stays at around 9MB.

Note that test_seek is disabled on Windows because as it fails intermittently. This is because either necko or httpd.js can't handle suspending or resuming connections, this is a separate issue, and I'm looking into this. If test_seek.html appears to not terminate, that's not related to this gc issue. If I set the media code to not try to suspend/resume connections, this memory bloat still happens, so that's not the cause.

This bug happens on at least Linux and Windows. You have to edit content/media/test/Makefile.in to enable test_seek.html on Windows if you want to test on Windows.

STR:
1. Run mochitest on content/media/test_seek.html
2. Observe memory usage of xpcshell.exe in `top`/Task Manager.
3. Repeatedly reload test_seek.html page, note ever increasing memory usage.
Comment 1 Andreas Gal :gal 2009-08-03 15:39:13 PDT
Please try the patch in bug 506125
Comment 2 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-03 16:30:49 PDT
(In reply to comment #1)
> Please try the patch in bug 506125

It may help, but not enough; it's still only take a few runs to get over 1GB on Linux running seek_test.html.
Comment 3 Andreas Gal :gal 2009-08-03 16:34:05 PDT
We should investigate whether the MaybeGC hook is called properly. It should trigger the GC as your working set size increases (for every 18.5% increase you get a GC).
Comment 4 Mike Shaver (:shaver -- probably not reading bugmail closely) 2009-08-04 06:57:08 PDT
Does xpcshell do any calling of MaybeGC by default?  It may be that we need to run it with a -b argument to get steadier memory behaviour here, or just add the gc() call to httpd.js as alluded above.
Comment 5 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-04 16:00:02 PDT
(In reply to comment #4)
> Does xpcshell do any calling of MaybeGC by default?

I tested this with Anreas' patch from bug 506125, and it looks like xpcshell doesn't ever call MaybeGC(). During a mochitest run the browser does, but xpcshell doesn't. It looks like xpcshell only GCs on exit. I've run through a /content mochitest run on Windows with a break point in xpcshell's instance of js_GC() and it's not hit. Memory use for xpcshell.exe peaked at about 885MB, firefox.exe at about 170MB. Remember this was only for the content mochitests!

> It may be that we need to run it with a -b argument to get steadier memory 
> behaviour here, or just add the gc() call to httpd.js as alluded above.

I looks like xpcshell doesn't have a -b argument, could you elaborate please?
Comment 6 Andreas Gal :gal 2009-08-04 16:18:55 PDT
That explains it. I will look at xpcshell and see if we can easily add MaybeGC calls.
Comment 7 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-04 16:55:27 PDT
Created attachment 392624 [details] [diff] [review]
Patch httpd.js to gc() in _connectionClosed
[Backout: Comment 28]

Patch adds a gc() call, keeps xpcshell memory down to about 15MB or so.

<gal> cpearce: so bent said xpcshell is mostly used for testing
<gal> and if we do an automated gc heuristics, you couldn't test GC stuff easily any more
<gal> so maybe we should simply add that manual gc() call
<cpearce> gal: that's a shame, but ok.
<gal> yeah, I agree
<gal> the regular shell triggers the GC as expected
<cpearce> I'll put a patch up for gc() in httpd.js then.
<cpearce> Are there any other scripts run by xpcshell that we need to do something similar in?
<gal> I wouldn't know
<gal> I know we have tests that hit last ditch gc
<brendan> we shouldn't do too much gc() hacking
Comment 8 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-04 17:39:45 PDT
Comment on attachment 392624 [details] [diff] [review]
Patch httpd.js to gc() in _connectionClosed
[Backout: Comment 28]

I'm sympathetic, but no.  xpcshell should have some way -- I believe by default -- for GC to occur naturally over time, without requiring explicit calls to a gc function.  If there is some reason to need more precise control over GC, that functionality should be opt-in; the common case is not to be testing precise GC behavior.  It is not the duty of every xpcshell user to work around this lack.

As a side effect, your patch probably causes the reftest harness to fail.
Comment 9 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-08-04 18:04:05 PDT
Brendan, see comment #8. Can you, Jeff and Andreas reach an agreement on what needs to be done here, and do it? This is a pretty huge problem --- I don't know how we've struggled on this long without noticing it. (OK, I have noticed huge httpd.js memory usage before, but I didn't consider that GC might just not be happening.)
Comment 10 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-04 18:06:06 PDT
GC will happen, but only at the extreme of $hugemem, not optimistically before then, I wasn't thinking of that, sorry.

If you want to minimize the time it takes to gc, adding |data.length = 0| after |output.writeByteArray(data, data.length)| in httpd.js will probably make a large difference for this particular case; I'd be fine with that addition.
Comment 11 Brendan Eich [:brendan] 2009-08-04 19:11:36 PDT
Can we do something expedient in httpd.js ASAP, with a followup bug for the right fix to xpcshell filed and patched soon enough?

/be
Comment 12 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-04 20:05:49 PDT
(In reply to comment #8)
> (From update of attachment 392624 [details] [diff] [review])
> I'm sympathetic, but no. [...]
> As a side effect, your patch probably causes the reftest harness to fail.

I pushed the patch to tryserver, there were a few random-looking failures but all the tests ran at least, and all tests also run locally. I've pushed to tryserver again to see if the random-looking-failures happen again. How could this change cause the reftest harness to fail?
Comment 13 Andreas Gal :gal 2009-08-04 22:32:08 PDT
Looks like I am going to be fixing this. My memory pressure-based GC heuristic patch removes triggering GC from the allocator path and since xpcshell never calls MaybeGC, nochrome talos fails on some platforms now.
Comment 14 Andreas Gal :gal 2009-08-04 22:33:12 PDT
Chris, can you get the quick fix in? I will provide a generalized fix.
Comment 15 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-04 22:37:35 PDT
(In reply to comment #14)
> Chris, can you get the quick fix in? I will provide a generalized fix.

Sure, who can will review+ it?
Comment 16 Andreas Gal :gal 2009-08-04 22:42:36 PDT
My first pick would be jwalden, followed by sayrer. Both are cc'd.
Comment 17 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-08-04 22:48:58 PDT
jwalden just refused to r+ it.
Comment 18 Andreas Gal :gal 2009-08-04 22:55:08 PDT
Well he knows that corner a lot better me. I can't overrule him. Waldo, whats up with the review? Are you ok with a temporary stop-gap measure until I have the MaybeGC path in place? We can back it out later. It sounds like this issue is biting people right now.
Comment 19 Robert Sayre 2009-08-04 23:48:20 PDT
if this is turning tinderboxen orange as I write, we should take the httpd.js bandaid until we get a real fix.

bandaids are ok as long as they get removed properly.
Comment 20 Andreas Gal :gal 2009-08-04 23:49:59 PDT
is that an r=sayrer?
Comment 21 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-05 12:44:19 PDT
I expect this would break reftests because there wouldn't be a gc() function to call in reftest; the server is run there as a component, and I wouldn't expect components to have such a function in scope.  Thinking a little more, it's possible that the method is called late enough that the thrown ReferenceError happens to be lucky and doesn't break anything.  It's also possible my non-fanatical attempts to make httpd.js code exception-safe (or, rather, don't-hang-if-an-exception-is-thrown-safe) were successful in this case; hard to say without investigating, which I'm not inclined to do.

There are bandaids that change the way existing APIs are used to work around problems in them.  There are also gross, awful hacks that perform some completely random change unmotivated by what the surrounding code does.  I think this is an instance of the latter and not an instance of the former.
Comment 22 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-05 12:49:13 PDT
Here's a hack I'd accept, because it doesn't touch httpd.js itself: if we added testing/mochitest/tests/run-gc.sjs with contents |function handleRequest(request, response) { response.setHeader("Cache-Control", "no-cache", false); Components.utils.forceGC(); }| and occasionally loaded that file, would that also do a JS GC and ameliorate the problem?
Comment 23 Robert Sayre 2009-08-05 13:00:46 PDT
(In reply to comment #22)
> Here's a hack I'd accept, because it doesn't touch httpd.js itself: if we added
> testing/mochitest/tests/run-gc.sjs with contents |function
> handleRequest(request, response) { response.setHeader("Cache-Control",
> "no-cache", false); Components.utils.forceGC(); }| and occasionally loaded that
> file, would that also do a JS GC and ameliorate the problem?

gah. way, way too much ink spilled on a trivial, temporary bandaid patch. 

Making an sjs file and calling is more complicated. We should just call Components.utils.forceGC() on connection close until Gal gets us a better fix, if you ask me.
Comment 24 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-05 13:27:08 PDT
Created attachment 392786 [details] [diff] [review]
Periodically gc in the server process

Actually, easier would be to just do it in the server directly, like so.  Rebuilding and testing this now...
Comment 25 Jeff Walden [:Waldo] (remove +bmo to email) 2009-08-05 15:07:36 PDT
Comment on attachment 392624 [details] [diff] [review]
Patch httpd.js to gc() in _connectionClosed
[Backout: Comment 28]

Please guard this with |if (typeof gc === "function)|.  Also, I hate myself for doing this.

My patch didn't work, I'm guessing because by the time it's called the garbage your gc collects is entrained by XPCOM gunk, and I don't see a way to force a cycle collection in xpcshell.  (The usual way requires nsIDOMWindowUtils, which exists in the browser.)
Comment 26 Robert Sayre 2009-08-05 15:31:37 PDT
let's try this on the tracemonkey tree:

http://hg.mozilla.org/tracemonkey/rev/1391e8b39c73
Comment 27 PTO until Sep 5 NZ time; Chris Pearce (:cpearce) 2009-08-05 21:44:28 PDT
(In reply to comment #26)
> let's try this on the tracemonkey tree:
> 
> http://hg.mozilla.org/tracemonkey/rev/1391e8b39c73

This seems to have made the tracemonkey test boxes consistently orange across the board. :(
Comment 28 Robert Sayre 2009-08-05 22:48:20 PDT
yup, already backed out.

http://hg.mozilla.org/tracemonkey/rev/b0a52a773828
Comment 29 Andreas Gal :gal 2009-08-18 15:25:57 PDT
Fixed by 506117 as a byproduct.

*** This bug has been marked as a duplicate of bug 506117 ***
Comment 30 Andreas Gal :gal 2009-08-18 15:57:21 PDT

*** This bug has been marked as a duplicate of bug 506125 ***
Comment 31 Andreas Gal :gal 2009-08-18 15:59:25 PDT
Fixed by 506125, not 506117. Apologies.
Comment 32 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-10-29 14:48:02 PDT
Un-duping. There's been no progress on bug 506125 for two months, and we need this fixed ASAP, and surely we can do something simple in httpd.js that doesn't require us to implement intelligent memory-pressure-based GC across the board.
Comment 33 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 17:03:38 PST
I pushed the comment #26 patch to the try-servers. It seems have caused a hang in extension manager tests:

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257122789.7235.gz#err0

TEST-PASS | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_LightweightThemeManager.js | test passed
TEST-PASS | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_bug257155.js | test passed

command timed out: 300 seconds without output, killing pid 1083
Comment 34 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 17:20:12 PST
Windows failed in unit tests, too, but at a different test, so I suppose the hang is probably at a random place:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257123928.19656.gz#err0

TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_382825.js | test passed
TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_384744.js | test passed
TEST-PASS | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_dm\unit\test_bug_395092.js | test passed

command timed out: 300 seconds without output

On both Linux and Windows (Mac hasn't reported yet), we have a hang at the end of chrometests, after they've passed:

TEST-PASS | chrome://mochikit/content/browser/xpinstall/tests/browser_whitelist6.js | Install should have been triggered by the right uri

Browser Chrome Test Summary
	Passed: 4000
	Failed: 0
	Todo: 5

*** End BrowserChrome Test Results ***
INFO | automation.py | Application ran for: 0:03:00.187000
SUCCESS: The process with PID 8288 has been terminated.
before 995328, after 995328, break 00000000
Comment 35 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 17:44:32 PST
Interestingly, Mac hung in exactly the same place as Linux:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257114322.1257125858.8527.gz#err1

TEST-PASS | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_LightweightThemeManager.js | test passed
TEST-PASS | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_extensionmanager/unit/test_bug257155.js | test passed

command timed out: 300 seconds without output, killing pid 16779

The hang after running chrome tests also happened on Mac.
Comment 36 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 18:27:27 PST
We hang in test_bug299716.js because it does do_timeout(DELAY, "run_test_pt2()"). That timeout never fires, I think because we GC before it fires and the GC cleans up all references to the timer, which cancels the timer.
Comment 37 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 19:11:54 PST
Created attachment 409642 [details] [diff] [review]
fix do_timeout [checked in comment #50]

Fix do_timeout so pending timers aren't GCed and canceled.
Comment 38 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 21:36:24 PST
The remaining hang is that httpd.js fails to terminate after the browser has exited from the browser mochitests. We get to
>>> shutting down server on port 8888
and then we close the connection to /server/shutdown, but it looks like there are still other connections open. Maybe we missed a connection-closed notification because something got GCed that was supposed to notice when a connection has closed? It's 100% reproducible for me.
Comment 39 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 22:06:50 PST
Yes, the problem is that the connection for browser/components/privatebrowsing/test/browser/staller.sjs never closes.
Comment 40 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-01 22:09:59 PST
It looks like this:

  function stall() {
    response.write("stalling...\n");
  }

  response.setHeader("Content-Type", "text/plain", false);
  stall();

  const nsITimer = Components.interfaces.nsITimer;
  var timer = Components.classes["@mozilla.org/timer;1"]
                        .createInstance(nsITimer);
  timer.initWithCallback(stall, 500, nsITimer.TYPE_REPEATING_SLACK);

so my guess is that this is supposed to write to a closed socket, which will trigger an error that lets httpd.js notice that the connection is closed. But I bet we're GCing away this timer callback so we never write and never notice the socket closed.
Comment 41 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-01 23:04:52 PST
Comment on attachment 409642 [details] [diff] [review]
fix do_timeout [checked in comment #50]

Seems sensible.

The naming convention for head.js-private variables is non-existent, and I was tempted to grumble here, but this isn't any different than what exists -- filed bug 525819 to hit that.  Also, I filed bug 525820 to make do_timeout take a nullary function (not a string) as an argument.
Comment 42 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-02 00:50:40 PST
Created attachment 409662 [details] [diff] [review]
fix staller.sjs

Similar fix for staller.sjs. This is slightly more complicated, we have to keep firing the timer until the response has ended (the browser closes its side and we get an error writing to the socket). This requires adding an isEnded API to Response.
Comment 43 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-02 17:52:32 PST
With those patches, I got some new test failures on tryserver on Windows:

TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\unit\test_cookie_header.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_cookie_header.js | Expected to find header 'Cookie' but didn't find it - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_cookie_header.js | Unexpected exception: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_host.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_request_line_split_in_two_packets.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js |  == HTTP/1.1 200 TEST PASSED - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_seizepower.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js |  == too-late passed - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - 

Linux and Mac had the test_cookie_header.js failures but not the others.
Comment 44 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-02 18:35:22 PST
Comment on attachment 409662 [details] [diff] [review]
fix staller.sjs

>diff --git a/netwerk/test/httpserver/httpd.js b/netwerk/test/httpserver/httpd.js

>+  /**
>+   * Returns true when the response has ended (completely constructed if possible
>+   * and the connection closed); further actions on this will fail.
>+   */
>+  get isEnded()
>+  {
>+    return this._ended;
>+  },

I'd rather not add this.  write() is guaranteed to throw if the response is fully constructed (whether completely sent or not), and it is entirely safe to rely on that behavior.


>diff --git a/browser/components/privatebrowsing/test/browser/staller.sjs b/browser/components/privatebrowsing/test/browser/staller.sjs

There seems to be more complexity here than necessary, some to support a repeating timer using that explicit timer type and some to support multiple timers (except this only ever uses one).  I had trouble following the control flow and the timer-release behavior.  I'd rather see repeated reuse of a one-shot timer as below, which I think is clearer and more concise (this semi-relies on a one-shot timer clearing its callback if not reinitialized; feel free to add a try { } catch (e) { timer = null; throw e; } around the contents of stall if you prefer):

var timer;
function handleRequest(request, response)
{
  response.setStatusLine(request.httpVersion, 200, "OK");
  response.setHeader("Content-Type", "text/plain", false);

  const nsITimer = Components.interfaces.nsITimer;
  timer = Components.classes["@mozilla.org/timer;1"]
                    .createInstance(nsITimer);
  function stall(t)
  {
    response.write("stalling...\n");
    timer.initWithCallback(stall, 500, nsITimer.TYPE_ONE_SHOT);
  }

  stall();
  response.processAsync();
}
Comment 45 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-02 18:41:50 PST
OK, I guess that'll work.
Comment 46 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-02 20:25:25 PST
Created attachment 409843 [details] [diff] [review]
fix subtle QI bug in test_cookie_header.js [checked in comment 50]

This one's a doozy. If someone does a QI on JS-wrapped XPCOM object, the JS wrapper remembers that it has the methods of that interface, so if someone else gets that XPCOM object, they'll see those methods without QI'ing. But if you get a GC in between, the JS wrapper goes away, and the "someone else" won't see those methods.

In this case, the test was QIing to the wrong interface but finding the method anyway ... unless we happen to GC in between!
Comment 47 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 03:13:56 PST
Created attachment 409896 [details] [diff] [review]
fix staller.sjs (v2) [checked in comment 70]

I hope this is what you had in mind? Seems to work.
Comment 48 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 03:15:52 PST
With these patches, things look good locally. My try-server build seems to be in limbo.
Comment 49 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-03 11:06:42 PST
Comment on attachment 409896 [details] [diff] [review]
fix staller.sjs (v2) [checked in comment 70]

Well, what I had in mind was exactly what I typed out.  :-)  Maybe that had some thinkos in it?  No matter, this is as good.
Comment 50 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 12:46:45 PST
Checked in patches from comments #37 and #46:
http://hg.mozilla.org/mozilla-central/rev/89bce43a99d1
http://hg.mozilla.org/mozilla-central/rev/4aa00e390ced

My try-server build with the full set of patches did not work. On Windows the errors in comment #43 recurred (other than test_cookie_header errors). They don't occur locally, as I mentioned. So more investigation is needed.
Comment 51 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 13:01:19 PST
Created attachment 410013 [details] [diff] [review]
Fix RawTest QI issue [checked in comment 70]

I think I found the bug. RawTest's 'writer' gets a stream from XPCOM in onOutputStreamReady and then sometimes calls waitToWriteOutput, which expects that stream to have been previously QIed to nsIAsyncOutputStream. That QI was done on the stream object previously in performNextTest, but GC might have stripped it. There is a similar issue for 'reader'.

Pushed to try server...
Comment 52 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 13:03:02 PST
I think our test-running xpcshells need to GC vigorously if only so we can prevent more tests being added with these bugs!
Comment 53 Brendan Eich [:brendan] 2009-11-03 13:23:17 PST
(In reply to comment #52)
> I think our test-running xpcshells need to GC vigorously if only so we can
> prevent more tests being added with these bugs!

"These bugs" are all bug 307441, a bad bug from the First Age. Who will fix it?

/be
Comment 54 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 17:34:30 PST
My try-server build passed on Linux (again), got what could be a new intermittent failure on Mac:
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_traceable_channel.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-macosx-unittest/mozilla/objdir/_tests/xpcshell/test_necko/unit/test_traceable_channel.js | replaced http response body == original http response body - See following stack:

Windows never reported back :-(
Comment 55 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-03 18:20:28 PST
Ah, Windows just finally reported back. A couple of the errors are still there :-(.

TEST-UNEXPECTED-FAIL | e:\builds\moz2_slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_host.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | error thrown by responseCheck: 2147500036 - See following stack:
TEST-UNEXPECTED-FAIL | e:\builds\moz2_slave\win32-unit\mozilla\objdir\_tests\xpcshell\test_necko\test\test_seizepower.js | test failed (with xpcshell return code: 0), see following log:
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js | exceptions test passed ==  - See following stack:
Comment 56 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-04 12:13:31 PST
Are you manually eliding stacks in the above output, or is the stack-dumping code not working?  (Although, stacks might be fairly generic here given the asynchronicity of the tests and the RawTest gunk, so I'm not sure if they'd actually help or not.)
Comment 57 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-04 12:36:29 PST
I'm copying from the Build Error Summary. The log has more stack data.
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257303442.1257311741.1432.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257304225.1257314321.30148.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1257306712.1257319468.21533.gz

That's from a push with some diagnostic output, but I might have broken something in the process. Checking now.
Comment 58 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-04 13:02:54 PST
The first one at least looks like it's what's expected, although there's some fail there because it's catching an exception and throwing a new one (losing the actually interesting stack in the process).  Still examining...
Comment 59 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-04 14:54:21 PST
      onOutputStreamReady: function(stream)
      {
        var data = testArray[testIndex].data[dataIndex];

        var written = 0;
        try
        {
          written = stream.write(data, data.length);
          if (written == data.length)
            dataIndex++;
          else
            testArray[testIndex].data = data.substring(written);
        }
        catch (e) { /* stream could have been closed, just ignore */ }

        // Keep reading data until there's no more data to read
        if (written != 0)
          waitToWriteOutput(stream);
        else
          stream.close();
      }

So, |testArray[testIndex].data| is an array of strings.  Initially |data| is the next one of those strings to write out.  We try to write it out.  If we wrote it all out, then we increase |dataIndex| to work on the next string to write out.  If we *didn't*, however, we grab what wasn't written and re-queue it:

            testArray[testIndex].data = data.substring(written);

|data.substring(written)| is a string.  |testArray[testIndex].data| was previously an array.  Type error!  The left-hand side should be |testArray[testIndex].data[dataIndex]|.  The TypeError you're seeing would happen if the remaining substring is small but |dataIndex| was large.
Comment 60 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 00:31:05 PST
OK, so I've fixed that, and fixed another bug where we should be checking dataIndex < testArray[testIndex].data.length, but we've still got failures.

On Linux:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257390383.1257399263.19861.gz&buildtime=1257390383&buildname=Linux%20try%20hg%20unit%20test&fulltext=1#err1
TEST-UNEXPECTED-FAIL | /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js |  == async seizure passed - See following stack:
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/testing/xpcshell/head.js :: do_throw :: line 201
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/testing/xpcshell/head.js :: do_check_eq :: line 253
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js :: checkAsyncSeizure :: line 252
JS frame :: /builds/slave/sendchange-linux-unittest/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606

On Windows:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257390383.1257402748.25078.gz&buildtime=1257390383&buildname=WINNT%205.2%20try%20hg%20unit%20test&fulltext=1#err0
TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js :: check400 :: line 303
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606

TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js |  == HTTP/1.1 200 TEST PASSED - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_request_line_split_in_two_packets.js :: checkLotsOfLeadingBlankLines :: line 133
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606


TEST-INFO | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 585"  data: no], see following stack:
undefined
TEST-UNEXPECTED-FAIL | e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js | exceptions test passed ==  - See following stack:
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\moz2_slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_seizepower.js :: checkExceptions :: line 243
JS frame :: e:/builds/moz2_slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 606
Comment 61 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 01:28:43 PST
All of these errors are consistent with RawTest just dropping data on the floor before calling the data-check callback.
Comment 62 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 01:42:32 PST
I wonder if it's possible for reader.onInputStreamReady to be fired sometimes when there's no data in the stream yet? I pushed another patch to the try-server to see if I can narrow that down.
Comment 63 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-05 13:40:17 PST
Plausible.  :-\

It's depressing how complicated this code is, but on the other hand 1) at least it's not spread across a handful of different tests, and 2) if the tests didn't exist we'd be in an even worse position.

Regarding the written != 0 check you mentioned as looking wrong, the reason for it, after thinking a little more, is that if the stream closes before we've finished writing all data (entirely possible when the test is that the server fails fast with a 400 error), we have to stop writing out data.  Switching to just check whether more data remains to be written seems not enough to handle that case, but I'm not entirely sure what the better thing to do is.
Comment 64 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 14:09:34 PST
I guess we can check written != 0 too. onOutputStreamReady shouldn't fire unless we can write at least one byte.

The latest push has no errors on Linux or Mac (although there could just be randomness here, of course) but the same 3 errors on Windows. Here's a relevant snippet:
http://tinderbox.mozilla.org/showlog.cgi?tree=MozillaTry&errorparser=unittest&logfile=1257414317.1257428920.30309.gz&buildtime=1257414317&buildname=WINNT%205.2%20try%20hg%20unit%20test&fulltext=1#err0

*** copying undefined bytes to output: 66,97,100,32,114,101,113,117,101,115,116,10
*** _waitForData
*** onInputStreamReady
*** cancel(0)
*** onStopRequest async callback
*** onStopRequest [status=0x0]
*** closing connection 2 on port 2904
before 1060864, after 1052672, break 00000000
TEST-INFO | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 587"  data: no], see following stack:
undefined
*** reader.onInputStreamReady: available=0
NEXT ERROR TEST-UNEXPECTED-FAIL | e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js |  == HTTP/1.1 400  - See following stack:
JS frame :: e:\builds\slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_throw :: line 201
JS frame :: e:\builds\slave\win32-unit\mozilla\testing\xpcshell\head.js :: do_check_eq :: line 253
JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/test_host.js :: check400 :: line 303
JS frame :: e:/builds/slave/win32-unit/mozilla/objdir/_tests/xpcshell/test_necko/test/head_utils.js :: anonymous :: line 610

I think the key line here is "reader.onInputStreamReady: available=0". That's just wrong.
Comment 65 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 15:01:32 PST
(In reply to comment #63)
> It's depressing how complicated this code is, but on the other hand 1) at least
> it's not spread across a handful of different tests, and 2) if the tests didn't
> exist we'd be in an even worse position.

If we find out that lack of GC in httpd.js is causing a lot of the orange across the board, then I say if the tests didn't exist we'd be in a much *better* position.

In fact I think we should disable RawTests and land these fixes. The tests are clearly broken and I have had enough of trying to fix them.
Comment 66 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 15:05:30 PST
Created attachment 410631 [details] [diff] [review]
Fix additional RawTest bugs [checked in comment 70]
Comment 67 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 15:06:13 PST
Created attachment 410632 [details] [diff] [review]
Additional diagnostics to report more data about exceptions in tests
Comment 68 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-05 15:07:16 PST
Created attachment 410634 [details] [diff] [review]
Additional logging and asyncWait change

I don't know if this patch helps at all, but it might helpful for future work on these tests.
Comment 69 Jeff Walden [:Waldo] (remove +bmo to email) 2009-11-06 18:18:39 PST
Comment on attachment 410632 [details] [diff] [review]
Additional diagnostics to report more data about exceptions in tests

>diff --git a/netwerk/test/httpserver/test/head_utils.js b/netwerk/test/httpserver/test/head_utils.js

Hm, we should have added this noting and reporting functionality ages ago, didn't occur to me that such functionality should be generalized.  I wonder: is this because people aren't testing the exceptions thrown by methods enough?


>@@ -571,45 +571,58 @@ function runRawTests(testArray, done)
>   /** Data received so far from the server. */
>   var received = "";
> 
>   /** Reads data from the socket. */
>   var reader =
>     {
>       onInputStreamReady: function(stream)
>       {
>-        var bis = new BinaryInputStream(stream);
>+        try {

try
{

catch (e)
{

finally
{

throughout.


>+          var av = 0;
>+          try
>+          {
>+            av = bis.available();
>+          }
>+          catch (e) {
>+            do_note_exception(e);
>+            /* default to 0 */

Place the comment before the noting; that seems the more logical ordering (same elsewhere too).  We default to 0, therefore we merely note the exception rather than doing something with it like checking its value.


>diff --git a/netwerk/test/unit/test_traceable_channel.js b/netwerk/test/unit/test_traceable_channel.js

>   onStopRequest: function(request, context, statusCode) {
>-    this.listener.onStopRequest(request, context, statusCode);
>-    httpserver.stop(do_test_finished);
>+    dump("In TracingListener.onStopRequest\n");
>+    try {
>+      this.listener.onStopRequest(request, context, statusCode);
>+      httpserver.stop(do_test_finished);
>+      dump("Did TracingListener.onStopRequest\n");
>+    } catch (e) {
>+      dump("TracingListener.onStopRequest swallowing exception: " + e + "\n");
>+    }
>   },

This formulation is a bad idea because if onStopRequest throws, the server will never be stopped and do_test_finished will never be called.  Instead, move the server-stop call into a finally clause.
Comment 70 Robert O'Callahan (:roc) (email my personal email if necessary) 2009-11-09 19:58:03 PST
Checked in attachment #410013 [details] [diff] [review], attachment #409896 [details] [diff] [review] and attachment #410631 [details] [diff] [review].

http://hg.mozilla.org/mozilla-central/rev/e24c9b4be5c6
http://hg.mozilla.org/mozilla-central/rev/e91e024ace59
http://hg.mozilla.org/mozilla-central/rev/23fa68bf3610

I'm not planning to work on this myself more in the immediate future, but I think this bug still really needs to be resolved quite urgently. I hope someone else can pick up the baton.
Comment 71 Josh Matthews [:jdm] (away until 9/3) 2010-10-04 01:18:18 PDT
I'm looking into this.  I'm seeing a very repeatable assertion failure in test_async_notifications_404.js which looks like a weakness in the scope of the request proxies, which I've filed as bug 601557.
Comment 72 Josh Matthews [:jdm] (away until 9/3) 2010-10-05 11:22:13 PDT
Having applied attachment 410632 [details] [diff] [review] and made Waldo's requested changes, I see this notice before the test failure occurs:

TEST-INFO | /Users/moco/src/m-c/ff-dbg/_tests/xpcshell/netwerk/test/httpserver/test/head_utils.js | Swallowed exception [Exception... "Component returned failure code: 0x80470002 (NS_BASE_STREAM_CLOSED) [nsIBinaryInputStream.available]"  nsresult: "0x80470002 (NS_BASE_STREAM_CLOSED)"  location: "JS frame :: /Users/moco/src/m-c/ff-dbg/_tests/xpcshell/netwerk/test/httpserver/test/head_utils.js :: anonymous :: line 532"  data: no], see following stack:
Comment 73 Josh Matthews [:jdm] (away until 9/3) 2010-10-05 11:45:48 PDT
Apparently that's not the cause, however, because the same exception occurs in successful runs as well.
Comment 74 Josh Matthews [:jdm] (away until 9/3) 2010-10-05 16:14:15 PDT
As far as I can tell, we're suffering from a similar problem to bug 601557.  The input stream created in performNextTest goes out of scope, and apparently nobody else bothers to hold a strong reference to it.  When httpd.js dispatches a runnable that ends up triggering a GC, it races against the stream becoming readable.  If the GC wins, the stream is destroyed and in its dying throes triggers a runnable which calls OnInputStreamReady and keeps the stream alive for the callback.  The easiest solution is to retain a strong ref to the stream somewhere in head_utils.js, such as within the reader object.
Comment 75 Josh Matthews [:jdm] (away until 9/3) 2010-10-05 16:24:41 PDT
Created attachment 481089 [details] [diff] [review]
Ensure input stream lifetime is not cut short by a GC. [checked in comment 80]
Comment 76 Josh Matthews [:jdm] (away until 9/3) 2010-10-05 16:26:04 PDT
Created attachment 481090 [details] [diff] [review]
Additional diagnostics to report more data about exceptions in tests [checked in comment 80]

Here's the updated diagnostic patch for eventual landing.  Carrying forward the previous r+.
Comment 77 Josh Matthews [:jdm] (away until 9/3) 2010-10-06 13:27:23 PDT
I've had several try server runs with the patch from bug 601557 and here applied, and the only failures I've received are known intermittent oranges that don't appear to be any worse with this GC behaviour change.  I'm feeling pretty good about this.
Comment 78 Robert O'Callahan (:roc) (email my personal email if necessary) 2010-10-06 21:20:52 PDT
You rock! Please land this!
Comment 79 Jeff Walden [:Waldo] (remove +bmo to email) 2010-10-21 17:26:35 PDT
Comment on attachment 481089 [details] [diff] [review]
Ensure input stream lifetime is not cut short by a GC. [checked in comment 80]

>diff --git a/netwerk/test/httpserver/test/head_utils.js b/netwerk/test/httpserver/test/head_utils.js
>@@ -517,16 +518,17 @@ function runRawTests(testArray, done)
>   /** Data received so far from the server. */
>   var received = "";
> 
>   /** Reads data from the socket. */
>   var reader =
>     {
>       onInputStreamReady: function(stream)
>       {
>+        do_check_true(stream == this.stream);

=== please, == is a bit broken (not in this instance, but bad choices can snowball into bad habits).

Is it the case that this issue also needs to be fixed in runHttpTests?  Looks like it to me, but maybe I'm missing something.  If you agree, please add:

* a |listener._channel = ch;| before the |ch.asyncOpen(listener, null)| in
  |performNextTest|,
* a |/** Current channel being observed by this. */ _channel: null,| in the
  |listener| literal,
* a |do_check_true(request === this._channel);| in |onStartRequest|,
* and a |this._channel = null;| in |onStopRequest|

while you're making the changes you've already suggested.  Sorry for the delay in getting to the review...
Comment 81 Josh Matthews [:jdm] (away until 9/3) 2010-10-22 13:16:09 PDT
And the actual gc() patch bounced because it lit up the xpcshell tests.  Investigating yet again.  Sigh.
Comment 82 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2010-10-22 13:18:24 PDT
The failures in xpcshell tests all seem to be:

###!!! ASSERTION: nsExceptionManager not thread-safe

at the stack:

 5  libxul.so!nsExceptionManager::Release [nsExceptionService.cpp:031018350d0f :
    eip = 0x02860530   esp = 0xbfe7ac00   ebp = 0xbfe7ac38   ebx = 0x0357750c
    Found by: call frame info
 6  libxul.so!nsExceptionService::DoDropThread [nsExceptionService.cpp:031018350
    eip = 0x0285f744   esp = 0xbfe7ac40   ebp = 0xbfe7ac68   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 7  libxul.so!nsExceptionService::DropAllThreads [nsExceptionService.cpp:0310183
    eip = 0x0285f782   esp = 0xbfe7ac70   ebp = 0xbfe7ac78   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 8  libxul.so!nsExceptionService::Shutdown [nsExceptionService.cpp:031018350d0f 
    eip = 0x0285fa47   esp = 0xbfe7ac80   ebp = 0xbfe7ac98   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
 9  libxul.so!nsExceptionService::Observe [nsExceptionService.cpp:031018350d0f :
    eip = 0x0285fa93   esp = 0xbfe7aca0   ebp = 0xbfe7aca8   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
10  libxul.so!nsObserverList::NotifyObservers [nsObserverList.cpp:031018350d0f :
    eip = 0x027fb5e3   esp = 0xbfe7acb0   ebp = 0xbfe7acd8   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
11  libxul.so!nsObserverService::NotifyObservers [nsObserverService.cpp:03101835
    eip = 0x027fc6ff   esp = 0xbfe7ace0   ebp = 0xbfe7ad18   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info
12  libxul.so!mozilla::ShutdownXPCOM [nsXPComInit.cpp:031018350d0f : 624 + 0x3b]
    eip = 0x027df6d6   esp = 0xbfe7ad20   ebp = 0xbfe7ad98   ebx = 0x0357750c
    esi = 0x0840f998
    Found by: call frame info


Seems like a bug in nsExceptionService, perhaps?
Comment 83 Josh Matthews [:jdm] (away until 9/3) 2010-10-22 13:47:40 PDT
There's also this failure:

TEST-UNEXPECTED-FAIL | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | test failed (with xpcshell return code: 0), see following log:
>>>>>>>
Setting the identity for passphrase
TEST-INFO | (xpcshell/head.js) | test 1 pending
TEST-INFO | (xpcshell/head.js) | test 2 pending
2010-10-22 13:21:27	Test                 INFO	Setting up server and authenticator
2010-10-22 13:21:27	Test                 INFO	Generating keypair + symmetric key
2010-10-22 13:21:27	PubKeyManager        DEBUG	Generating RSA keypair
2010-10-22 13:21:27	PubKeyManager        DEBUG	Generating RSA keypair... done
2010-10-22 13:21:27	Test                 INFO	Setting up keyring
2010-10-22 13:21:27	Test                 INFO	Creating a record
TEST-PASS | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | [run_test : 80] http://localhost:8080/crypto/steam == http://localhost:8080/crypto/steam
TEST-PASS | /home/cltbld/talos-slave/mozilla-central_fedora64_test-xpcshell/build/xpcshell/tests/services/sync/tests/unit/test_records_crypto.js | [run_test : 81] ../crypto/steam == ../crypto/steam
2010-10-22 13:21:27	Test                 INFO	Encrypting a record
2010-10-22 13:21:27	PubKeyManager        TRACE	Importing record: http://localhost:8080/keys/pubkey
2010-10-22 13:21:28	Net.Resource         DEBUG	GET success 200 http://localhost:8080/keys/pubkey
2010-10-22 13:21:28	PrivKeyManager       TRACE	Importing record: http://localhost:8080/keys/privkey
before 1576960, after 1417216, break 01db7000
2010-10-22 13:21:28	Net.Resource         DEBUG	GET success 200 http://localhost:8080/keys/privkey
before 1421312, after 1413120, break 01db7000
TEST-UNEXPECTED-FAIL | (xpcshell/head.js) | 2147500037
Comment 84 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2010-10-22 13:51:10 PDT
(In reply to comment #82)
> Seems like a bug in nsExceptionService, perhaps?

I talked to jst about this, and he sounded ok with just making the reference counting threadsafe in this case.

http://hg.mozilla.org/mozilla-central/file/3a5b84a290d7/xpcom/base/nsExceptionService.cpp#l95 already documents that it can be released on a different thread during shutdown.
Comment 85 Josh Matthews [:jdm] (away until 9/3) 2010-10-24 21:46:40 PDT
Turns out that the patch in bug 606592 to address comment 84 also resolves the quite mysterious test_records_crypto.js failure.  I can't explain where the failure comes from, but I've got multiple try runs demonstrating completely green trees with that patch applied.
Comment 86 Josh Matthews [:jdm] (away until 9/3) 2010-11-03 20:58:48 PDT
Oh joy. Some more try runs revealed that the test_records_crypto.js failure is alive and well, despite the patches in bug 606592 and bug 450812.  More investigation required.
Comment 87 Josh Matthews [:jdm] (away until 9/3) 2010-12-14 00:51:00 PST
Since the last time I sent this to try, test_records_crypto.js has been changed in such a way that we don't hit the mysterious error any longer.  Of course, parser/htmlparser/tests/mochitest/test_bug543062.html now times out, but for a very sensible reason.  In file_bug543062.sjs, the request handler calls armTimer, which creates a timer in a local variable.  When this function finishes and the var goes out of scope, the timer is canceled and destroyed.
Comment 88 Robert O'Callahan (:roc) (email my personal email if necessary) 2010-12-14 09:54:09 PST
Great, if you can fix that maybe we can land this before anyone else accidentally adds new dependencies on not GCing :-)
Comment 89 Josh Matthews [:jdm] (away until 9/3) 2011-05-30 07:31:17 PDT
I had a fully green try run at http://tbpl.mozilla.org/?tree=Try&pusher=josh@joshmatthews.net&rev=0426e35b1403. I'm going for it!
Comment 90 Josh Matthews [:jdm] (away until 9/3) 2011-05-30 07:51:24 PDT
Pushed http://hg.mozilla.org/projects/cedar/rev/7ee0fe6aec58.
Comment 92 Robert O'Callahan (:roc) (email my personal email if necessary) 2011-05-30 14:33:19 PDT
Yeeehaw! Thank you Josh!
Comment 93 David Mandelin [:dmandelin] 2011-07-01 17:40:18 PDT
Josh, jsreftests have started intermittently timing out on Android (see bug 662468), with a lot of GC spew from xpcshell. Do you think this bug might have introduced that?
Comment 94 Joel Maher ( :jmaher) 2011-07-01 17:52:43 PDT
for the tegras we run xpcshell on a host computer and access it remotely from Android.  Maybe it is related to this patch, but it wouldn't be related to httpd.js and xpcshell binary.
Comment 95 Josh Matthews [:jdm] (away until 9/3) 2011-07-01 21:30:16 PDT
If this patch is responsible for a change in behaviour, it would be that httpd.js is now GCing more frequently. As Joel points out, that means a remote machine is doing the GCing, not the fennec process on Android. I'm having a hard time coming up with any reasons this patch could be causing the jsreftest problems.
Comment 96 cmtalbert 2011-07-05 08:32:13 PDT
(In reply to comment #95)
> If this patch is responsible for a change in behaviour, it would be that
> httpd.js is now GCing more frequently. As Joel points out, that means a
> remote machine is doing the GCing, not the fennec process on Android. I'm
> having a hard time coming up with any reasons this patch could be causing
> the jsreftest problems.

See bug 648104, comment 9.  One of the issues with the reftest family of tests on android is that we are forced to run all reftests at a very high resolution (for a tegra), which significantly impacts available memory on the tegras.
Comment 97 Josh Matthews [:jdm] (away until 9/3) 2011-07-05 08:40:53 PDT
(In reply to comment #96)
> See bug 648104, comment 9.  One of the issues with the reftest family of
> tests on android is that we are forced to run all reftests at a very high
> resolution (for a tegra), which significantly impacts available memory on
> the tegras.

Yes, but that was the case before this patch as well, so I don't see how that's relevant.
Comment 98 David Mandelin [:dmandelin] 2011-07-05 10:55:50 PDT
(In reply to comment #95)
> If this patch is responsible for a change in behaviour, it would be that
> httpd.js is now GCing more frequently. As Joel points out, that means a
> remote machine is doing the GCing, not the fennec process on Android. I'm
> having a hard time coming up with any reasons this patch could be causing
> the jsreftest problems.

Josh, thanks for the explanation: I agree, it is almost certainly not due to the GC change.

Note You need to log in before you can comment on or make changes to this bug.