Closed Bug 957666 Opened 10 years ago Closed 10 years ago

Frequent "System JS : ERROR (null):0 - uncaught exception: 2147746065" spammed during desktop mochitest runs, coming from httpd.js (nsIHttpResponse.write)

Categories

(Core :: Networking, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox29 --- fixed
firefox30 --- fixed
firefox31 --- fixed
firefox-esr24 --- fixed
b2g-v1.2 --- fixed
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed

People

(Reporter: RyanVM, Assigned: smaug)

References

Details

(Whiteboard: [qa-])

Attachments

(1 file)

I know this has been around for awhile now, but it's annoying enough that it needs investigation if for no other reason than it being a significant source of log spam (thousands of instances per log in the ones I'm looking at). It has also served as a point of confusion more than once about whether it's contributing to intermittent failures or not.

Obviously, the error message doesn't give much to go on, but there are a few things that might help track it down.

1. It happens on all desktop platforms when running debug mochitests. AFAICT, it doesn't occur in any other test suites.
--> On Linux mochitest-1, it always starts after content/base/test/test_bug338583.html.
--> OSX and Windows seem to be a bit more variable. Presumably due to differences in what tests are run on each platform?
--> On B2G desktop, it starts during dom/imptests/editing/conformancetest/test_event.html.
2. It happens on B2G desktop mochitests, which are opt.
3. It does *not* happen on Android/B2G mochitests, even on Cedar where we run them on debug as well.

Example logs:
https://tbpl.mozilla.org/php/getParsedLog.php?id=32699553&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32699563&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32697267&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32699267&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32699354&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32697430&tree=Mozilla-Central&full=1
https://tbpl.mozilla.org/php/getParsedLog.php?id=32697163&tree=Mozilla-Central&full=1

Given the reproducibility, I'm hoping someone can look at this. It's a very frustrating error for the reason given above.
I did not repro on Mac.  On Linux, I keep hitting bug 957729.  Ryan, can you please get somebody to fix the js engine there?
Depends on: 957729
Given the workaround in bug 957729 comment 5, I managed to run all of content/base tests under the debugger without hitting this error path.  Then I started a full mochitest run and let it run over night and I hit the crash in bug 957729 comment 8.

Ryan, do you have any other ideas/hunches/guesses on how I can reproduce this locally?
Nope, sorry. Everything I know is in comment 0 :(
OK, no luck in repro'ing on Mac either.  I'll file a bug to get a loaner.
No longer depends on: 957729
Depends on: 963287
Blocks: 966408
2147746065 == NS_ERROR_NOT_AVAILABLE fwiw
Status update: I have been trying really hard to get my slave to run tests in a way that I can debug them.  I gave up doing that earlier today and am doing my own build on the slave at this point.  The build is 90 minutes in and still running.  I'm going home now!  More next week.
I finally figured out what's happening here, and why I could never catch this under the debugger.  I pushed this patch to try <https://tbpl.mozilla.org/?tree=Try&rev=cabfea40e5d3> which rewrote a bunch of NS_ERROR_NOT_AVAILABLE error codes thrown in httpd.js to dummy error values to see what the error message in the log <https://tbpl.mozilla.org/php/getParsedLog.php?id=34331844&tree=Try&full=1> would look like, and lo and behold:

System JS : ERROR (null):0 - uncaught exception: 2152136708

So, this means that these are coming from:

    1.44    //
    1.45    // see nsIHttpResponse.write
    1.46    //
    1.47    write: function(data)
    1.48    {
    1.49      if (this._finished)
    1.50 -      throw Cr.NS_ERROR_NOT_AVAILABLE;
    1.51 +      throw Cr.NS_BASE_STREAM_ILLEGAL_ARGS;
    1.52  
    1.53      var dataAsString = String(data);
    1.54      this.bodyOutputStream.write(dataAsString, dataAsString.length);
    1.55    },

Which means that they are caused by something calling nsIHttpResponse.write after the response is finished.

Someone who knows more about that code should investigate.  Filed bug 969760 on the XPConnect side of this puzzle.
Component: General → Networking
Summary: Frequent "System JS : ERROR (null):0 - uncaught exception: 2147746065" spammed during desktop mochitest runs → Frequent "System JS : ERROR (null):0 - uncaught exception: 2147746065" spammed during desktop mochitest runs, coming from httpd.js (nsIHttpResponse.write)
The credit here should go to Karl for comment 6, BTW, which made me think about this idea!
Awesome detective work! Thanks for sticking with this, Ehsan :)
(In reply to comment #10)
> Awesome detective work! Thanks for sticking with this, Ehsan :)

No worries!  I'm assuming that you will hunt down somebody else to finish the investigation here, BTW.  ;-)
Jason, any suggestions for who might be able to help investigate here? :)
Flags: needinfo?(jduell.mcbugs)
httpd.js is sort of unowned at the moment.  Waldo used to do a lot of work here, and might have an idea of what we should do in this case.  It looks like some server-side handler code is writing too late in the game, but I'm not sure what could be causing what are either 1) late write() calls or 2) early reply finish.
Flags: needinfo?(jduell.mcbugs) → needinfo?(jwalden+bmo)
Jason, this could be another good bug for the workweek if you're still looking for ideas.
So I've hunted down the problematic code here, somewhat.  What's happening is that we run content/base/test/test_bug338583.html (wish I'd seen comment 0 before I started debugging!) which as part of doTest7 invokes new EventSource("delayedServerEvents.sjs").

delayedServerEvents.sjs sets a 500ms repeating timer to do this:

  12   if (strings_to_send.length == 0) {
    // clean up stuff here
  18   }
  19 
  20   resp.write(strings_to_send[0]);
  21   strings_to_send = strings_to_send.splice(1, strings_to_send.length - 1);

I added some logging to this sjs, and I see us invoking it 3 times while running this test.  The first two times we send all the strings and empty out the array.  The third time, we send 8 of the 10 strings, and then the write() calls start throwing, which means we never splice out the thing we tried to write, which means that we keep logging this exception message every 500ms.

I assume the reason this happens is that the server closes the connection after sending the data.  But eventsource then goes and reconnects!  So we invoke the SJS again.  And again.  However many times we manage to in the 8s*stress_factor timer the test uses before it checks only the first three returned packets and calls close() on the EventSource.

Is there a way to tell EventSource to not reconnect?  If not, we should fix the sjs to either cancel the timer if write fails or set some sort of persistent cookie after the first time it's called to not try sending any more data or something...
Flags: needinfo?(bugs)
I guess server could send back huge retry reconnection time.
Flags: needinfo?(bugs)
Attached patch possible fixSplinter Review
https://tbpl.mozilla.org/?tree=Try&rev=c5b30960b443

default reconnection timeout is 5000ms, so this increases that quite a bit.
Try run says that fix works.
Attachment #8407174 - Flags: review?(ehsan)
Attachment #8407174 - Flags: review?(ehsan) → review+
https://hg.mozilla.org/mozilla-central/rev/f0d5c27b7257
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Please nominate this for Aurora uplift when you get a chance.
Assignee: nobody → bugs
Flags: needinfo?(jwalden+bmo) → needinfo?(bugs)
Comment on attachment 8407174 [details] [diff] [review]
possible fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): old stuff
User impact if declined: this is testing only
Testing completed (on m-c, etc.): just landed
Risk to taking this patch (and alternatives if risky): testing only 
String or IDL/UUID changes made by this patch: NA
Attachment #8407174 - Flags: approval-mozilla-aurora?
Flags: needinfo?(bugs)
This is NPOTB, why does it require an approval?
Comment on attachment 8407174 [details] [diff] [review]
possible fix

Because I wasn't paying attention. Sorry for the wasted time, smaug.
Attachment #8407174 - Flags: approval-mozilla-aurora?
Whiteboard: [checkin-needed-aurora]
Whiteboard: [qa-]
You need to log in before you can comment on or make changes to this bug.