Firebug: breakpoints in onload handlers cause the browser to hang on OSX.

VERIFIED FIXED

Status

()

Core
XPCOM
VERIFIED FIXED
9 years ago
9 years ago

People

(Reporter: cbartley, Assigned: bz)

Tracking

Trunk
x86
Mac OS X
Points:
---
Dependency tree / graph
Bug Flags:
blocking1.9.1 -
blocking1.9.1.1 -
in-testsuite ?

Firefox Tracking Flags

(blocking1.9.1 -, status1.9.1 wanted)

Details

(Whiteboard: [firebug-p1])

Attachments

(7 attachments, 1 obsolete attachment)

(Reporter)

Description

9 years ago
Created attachment 382229 [details]
This HTML page exhibits the hang on breakpoint bug.

Firebug: breakpoints in onload handlers cause the browser to hang on OSX.

Repro Steps

   1. View the "bad" test page in Firefox.
   2. Open Firebug.
   3. Make sure Firebug is completely enabled for the page.
   4. Select the Script panel.
   5. Place a breakpoint inside onload() on the call to yada().
   6. Hit Cmd-R to reload the page.

Result

    * The browser hangs and the UI completely locks up.

I have reproduced this bug (using a different test case) across three different browser versions (3.0.10, 3.5b4, and a debug build of the 1.9.1 Firefox branch, updated ~ June 2nd) and going all the way back to r2000 of Firebug.
(Reporter)

Comment 1

9 years ago
Created attachment 382230 [details]
This page is virtually identical to the "bad" page but does not exhibit the bug.

This page is virtually identical to the "bad" page but does not exhibit the bug.  The difference is simply a missing SRC attribute on the image.
(Reporter)

Comment 2

9 years ago
The corresponding Firebug bug is 1087 at http://code.google.com/p/fbug/issues/detail?id=1087.
(Reporter)

Comment 3

9 years ago
Created attachment 382232 [details]
Error case: logging+stacktrace from a test run against attachment 382229 [details]

Error case: logging+stacktrace from a test run against attachment 382230 [details].  The stack trace is from right before the hang.
(Reporter)

Comment 4

9 years ago
Created attachment 382233 [details]
Normal (no hang) case: logging+stacktrace from a test run against attachment 382230 [details]

Normal (no hang) case: logging+stacktrace from a test run against attachment 382230 [details].  The stacktrace is from just before breaking (successfully) into Firebug.  It was possible to continue execution and then quit Firefox normally.
(Reporter)

Updated

9 years ago
Attachment #382232 - Attachment description: Error case: logging+stacktrace from a test run against attachment 382230 → Error case: logging+stacktrace from a test run against attachment 382229
(Reporter)

Comment 5

9 years ago
The proximate cause of the hang appears to be nsAppShell::InGeckoMainEventLoop.  In the normal (no-hang) breakpoint case, it returns false.  In the error case it returns true.  The difference seems to be that "mRecursionDepth" has a value of "1" in the former case and a value of "0" in the latter case.

Now, why is "mRecursionDepth" wrong?

The logging and stack trace excerpted in attachment 382233 [details] is the "reference" run. This run was against the no-error case in attachment 382230 [details].  There was no hang in this case.

The logging and stack trace excerpted in attachment 382232 [details] is the "error" run.  This run was against the error case in attachment 382229 [details].

The stack traces were taken just before the JSD calls Firebug's breakpoint hook.

mRecursionDepth tracks nsThread's mRunningEvent member variable.  mRunningEvent is incremented and decremented in nsThread::ProcessNextEvent, in the following code:

    ++mRunningEvent;
    event->Run();
    --mRunningEvent;

the actual code has printf's right after the "++mRunningEvent" and "--mRunningEvent" statements.  These are the "nsThread::ProcessNextEvent BEFORE" and "nsThread::ProcessNextEvent AFTER" lines in the attached files.

We should see something like:

  nsThread::ProcessNextEvent BEFORE
  ...
  stacktrace from right before the breakpoint
  Firebug gets the onbreakpoint call
  ...
  nsThread::ProcessNextEvent AFTER

and indeed that is what we see in the "good" file.

However, in the "bad" file, we see:

  nsThread::ProcessNextEvent BEFORE
  nsThread::ProcessNextEvent AFTER
  ...
  stacktrace from right before the breakpoint
  Firebug gets the onbreakpoint call
  ...
  nsThread::ProcessNextEvent BEFORE
  nsThread::ProcessNextEvent AFTER

Which is to say that we're calling the breakpoint hook from somewhere that is not bracketed by the increment-mRunningEvent and decrement-mRunningEvent statements.

If we examine the stacktrace from the "bad" case, we can see where we're getting called from, if not why.

The call to nsThread::ProcessNextEvent is at #30 on the stacktrace.  Our suspect is at #29, and it's a call to nsCOMPtr::~nsCOMPtr.  It appears that nsThread::ProcessNextEvent has the sole reference to the event object, and when the event variable goes out of scope it gets destructed, which triggers the whole sequence of events resulting in Firebug's onbreakpoint hook being called.  And since it's not protected by a non-zero mRunningEvent count, we get the hang elsewhere in the event handling code.

Comment 6

9 years ago
I want to request blocking for 1.9.1.1 (but there is not a spot yet). This not a very common case, though we don't know because the user experience roughly "I set a breakpoint and later Firefox stopped working". Users may not even connect those two events, and if they hit this they are completely whacked. And Curtis has worked this down to a great pair of test cases and has pinpointed the likely area, at least it seems to match all of the observations. If we have some one from the Mac team take a look they can give a quick assessment: might be an easy fix.
Flags: blocking1.9.1?
Whiteboard: [firebug-p1]
Version: Trunk → 1.9.1 Branch
So the obvious fixes seem to be:

1)  Change nsThread::ProcessNextEvent so that |event| goes out of scope before
    we call AfterProcessNextEvent.
2)  Change nsImageLoadingContent::Event::~Event to pass PR_FALSE to
    UnblockOnload.

I'd prefer #1, since it's very non-intuitive that an event object destructor can't trigger JS (which is where we are right now).  Also not documented, etc...
Component: DOM: Events → XPCOM
QA Contact: events → xpcom
Version: 1.9.1 Branch → Trunk
Created attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?
Attachment #382333 - Flags: superreview?(benjamin)
Attachment #382333 - Flags: review?(benjamin)
(Reporter)

Comment 9

9 years ago
(In reply to comment #8)
> Created an attachment (id=382333) [details]
> Curtis, does this fix the issue you see?

I tried just assigning "event" to NULL right after the call to Run().  This worked on code updated a few days ago (Sunday?), but when I updated last night it stopped working, with a JavaScript assertion failure.  It's possible that the JS assertion failure is essentially an unrelated JS bug, since it looks like (superficially) the sort of thing that shouldn't ever happen.

I'll try your patch and see if it works better.
Component: XPCOM → DOM: Events
Version: Trunk → 1.9.1 Branch
(Reporter)

Comment 10

9 years ago
It seems like the patch works for the provided test case -- no hang, Firebug works as expected, etc.  I tried it on a different test case and hit the JavaScript assertion again.

That assertion is:

Assertion failure: cx->fp, at /Users/cbartley/Dev/mozilla-c/src/js/src/jsdbgapi.cpp:1246
(Reporter)

Comment 11

9 years ago
The other test case I'm using is

http://www.almostinfinite.com/slideshow-tdl/slideshow.html

Repro Steps:

  * Apply the patch in attachment 382333 [details] [diff] [review] and rebuild
  * Run Firefox and open http://www.almostinfinite.com/slideshow-tdl/slideshow.html
  * Open Firebug
  * Set a breakpoint on slideshow.html line 19 (the first call inside Initialize())
  * Hit Cmd-R

Result:

  Assertion failure: cx->fp, at /Users/cbartley/Dev/mozilla-c/src/js/src/jsdbgapi.cpp:1246
(Reporter)

Comment 12

9 years ago
OK, even the attached test case is failing with the JS assertion.  I swear wasn't hallucinating when I said it worked in comment #10, but it's failing consistently now.
(Reporter)

Comment 13

9 years ago
Created attachment 382339 [details]
Problem Report (stack dump, etc) from JS assertion failure
Sounds like we need a separate (blocking?) bug on the JS assert.  My patch is basically equivalent to setting |event| to null right after Run(); maybe that would be a better fix, in fact.
Component: DOM: Events → XPCOM
Version: 1.9.1 Branch → Trunk

Updated

9 years ago
Attachment #382333 - Flags: superreview?(benjamin)
Attachment #382333 - Flags: superreview+
Attachment #382333 - Flags: review?(benjamin)
Attachment #382333 - Flags: review+
Attachment #382333 - Flags: approval1.9.1?
Comment on attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?

Nominating to get this on the m-c-landing radar.

I talked to Benjamin and we both feel that this is a bit scary for 1.9.1 at this point.  We should land on trunk, bake, and consider landing for 1.9.1.1.
Attachment #382333 - Flags: approval1.9.1?
Flags: wanted1.9.1.x?
Whiteboard: [firebug-p1] → [firebug-p1][3.5.1?]
(Reporter)

Updated

9 years ago
Depends on: 497119
What's the effect of not taking this on 1.9.1 for getting Firebug working properly with Firefox 3.5?
We're opening the m-c tree today after Neil's focus patch lands, so you don't need approval to land there.  Long-standing bug, look forward to it being fixed in 1.9.1.1.

(Another bug that just wouldn't matter if FB were out of process, I think!)
Flags: blocking1.9.1? → blocking1.9.1-
> What's the effect of not taking this on 1.9.1 

I think comment 6 covers it.  I believe this is not a regression from 1.9.0, though.  John, Curtis, can you confirm?

It _might_ be safer to change images to fire onload on the document async, which would at least reduce the incidence of this bug... But I'm not sure it really is safer: it leads to a web-observable difference in the order in which events fire in the DOM, and while I would hope pages don't depend on the ordering of certain async things, in practice some may turn out to do so....

Comment 20

9 years ago
(In reply to comment #17)
> What's the effect of not taking this on 1.9.1 for getting Firebug working
> properly with Firefox 3.5?

We will look into disabling breakpoints in events on MacOS.

Comment 21

9 years ago
(In reply to comment #18)
> We're opening the m-c tree today after Neil's focus patch lands, so you don't
> need approval to land there.  Long-standing bug, look forward to it being fixed
> in 1.9.1.1.

We don't really know if this is new or old. The reason that it is now important is a change in the user community. Macs were always popular among designers, but over the last year or so, there has been a large shift of JS devs from Win/Linux onto Mac. 


> (Another bug that just wouldn't matter if FB were out of process, I think!)

I don't think so. The in-process debug server would have to call for a nested event loop to allow the user to interact with the dom while the current thread is halted. If that happened within a event handling method the same bug would occur.  If we had separate event loops for in-page and out-of-page code within the same process, then we could avoid this problem.
(Reporter)

Comment 22

9 years ago
(In reply to comment #17)
> What's the effect of not taking this on 1.9.1 for getting Firebug working
> properly with Firefox 3.5?

I think this is a pretty serious bug for Mac Firebug users.  Trying to break into an onload handler is an extremely common use case.  We have multiple complaints in the newsgroup about it, plus I stumbled onto accidentally using some of my own code as a test case for another bug.  Also consider the severity of the bug -- it's not just Firebug failing to work right -- it's actually hanging the browser forcing the user to kill the process.  This is not so bad for users who have special test profiles, but most users probably don't so they'll be blowing away the entire browser session.

On the other hand, this bug appears to have existed for a very long time -- I can reproduce it with Firebug 1.3.3 in Firefox 3.0.10.  If the users have lived with it that long, then they can probably live with it a little longer.  Making them wait for 3.5.1 is not that big of a deal.
Flags: blocking1.9.1- → blocking1.9.1?
Curtis, did you mean to reset the blocking nom?
(Reporter)

Comment 24

9 years ago
(In reply to comment #23)
> Curtis, did you mean to reset the blocking nom?

Not intentionally, no.  I'd restore it, but I'm not sure what it was before I whacked it.
(Reporter)

Comment 25

9 years ago
(In reply to comment #19)

> It _might_ be safer to change images to fire onload on the document async,
> which would at least reduce the incidence of this bug... But I'm not sure it
> really is safer: it leads to a web-observable difference in the order in which
> events fire in the DOM, and while I would hope pages don't depend on the
> ordering of certain async things, in practice some may turn out to do so....

I'm still surprised that we're firing off an event handler from inside a destructor.  Is that by design?
(restoring flag to blocking1.9.1-)
Flags: blocking1.9.1? → blocking1.9.1-
> Is that by design?

More or less, yes.  Or more precisely, we're allowing the event handler to fire, and not forcing async firing.
(Reporter)

Comment 28

9 years ago
(In reply to comment #8)
> Created an attachment (id=382333) [details]
> Curtis, does this fix the issue you see?

I'm looking over the patch again, and I'm wondering: Does it make a different whether the event object is destructed before or after the call to AfterProcessNextEvent?
I was hoping not, but that's one of the reasons that I'm leery of just taking this on branch with no baking....  Since we need to decrement the counter before calling AfterProcessNextEvent, and since we need to decrement it after the destructor, if the destructor will do something "interesting", the new ordering is sort of forced...
(Reporter)

Comment 30

9 years ago
Created attachment 382581 [details] [diff] [review]
Proposed alternative patch (untested)

A proposed alternative to attachment 382333 [details] [diff] [review].  This patch maintains the same order of operations as the original code, but with the event object release made explicit and wrapped in its own "mRunningEvent" increment-decrement block.  This patch is currently untested, it's just intended as a discussion piece for now.

The code comments can probably be improved as well.
(Reporter)

Updated

9 years ago
Attachment #382581 - Attachment is patch: true
Attachment #382581 - Attachment mime type: application/octet-stream → text/plain
Given comment 5, that patch shouldn't fix this bug, since mRecursionDepth is updated by the OnProcessNextEvent and AfterProcessNextEvent calls...
(Reporter)

Comment 32

9 years ago
(In reply to comment #31)
> Given comment 5, that patch shouldn't fix this bug, since mRecursionDepth is
> updated by the OnProcessNextEvent and AfterProcessNextEvent calls...

So it is.  Hmm.
Pushed http://hg.mozilla.org/mozilla-central/rev/007d182fca17
Assignee: nobody → bzbarsky
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Flags: in-testsuite?
Resolution: --- → FIXED
(Reporter)

Comment 34

9 years ago
(In reply to comment #33)
> Pushed http://hg.mozilla.org/mozilla-central/rev/007d182fca17

I can't verify that this fixes the bug yet since bug 497177 prevents Firefox from breaking into Firebug at all.
(Reporter)

Comment 35

9 years ago
(In reply to comment #33)
> Pushed http://hg.mozilla.org/mozilla-central/rev/007d182fca17

It's working for me now.
(Reporter)

Comment 36

9 years ago
Verified fixed on trunk for 29162:cccc5f021924.  Note that some care was necessary to avoid triggering bug 497177 and bug 497998 during verification.
Status: RESOLVED → VERIFIED
Flags: blocking1.9.1.1?
Whiteboard: [firebug-p1][3.5.1?] → [firebug-p1]
Can we get the alternative patch marked obsolete and the appropriate patch marked for approval1.9.1.1 please? Don't think this blocks, but is wanted.
blocking1.9.1: --- → -
status1.9.1: --- → wanted
Flags: wanted1.9.1.x?
Flags: wanted1.9.1.x+
Flags: blocking1.9.1.1?
Flags: blocking1.9.1.1-
Attachment #382333 - Flags: approval1.9.1.2?
Attachment #382333 - Flags: approval1.9.1.2? → approval1.9.1.1?
Attachment #382581 - Attachment is obsolete: true
(Reporter)

Updated

9 years ago
Attachment #382333 - Flags: approval1.9.1.2?
(Reporter)

Comment 38

9 years ago
Comment on attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?

Requesting approval1.9.1.2?  bsmedberg and bzbarsky felt this was too risky a change to take right before 1.9.0/3.5 released;  However, it's been baking on trunk for over a month now.

It fixes a major Firebug/OS X bug.  And by major, I mean really, really serious.
(Reporter)

Updated

9 years ago
Blocks: 497998
Comment on attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?

Sorry to keep throwing up roadblocks but can we see tests for this? Renominate once you have them, though I think we'll likely wait for 1.9.1.3
Attachment #382333 - Flags: approval1.9.1.2?
Attachment #382333 - Flags: approval1.9.1.2-
Attachment #382333 - Flags: approval1.9.1.1?
(Reporter)

Comment 40

9 years ago
(In reply to comment #39)
> (From update of attachment 382333 [details] [diff] [review])
> Sorry to keep throwing up roadblocks but can we see tests for this? Renominate
> once you have them, though I think we'll likely wait for 1.9.1.3

Do you mean tests that verify that we've fixed Firebug, or tests that verify that we haven't broken Firefox in the process?  (The latter is what we'd really like to know, of course.)

To my knowledge we don't have any Firebug-based tests deployed yet. I don't know much about our JS testing infrastructure, but I was under the impression that we don't have anything automated for the JSD either.  But perhaps I'm wrong about that last point.
Hmm.  So what's Firebug actually doing that causes the hang?

Note that even if we could figure that out and reproduce in a test, the hang fundamentally depends on a race between the image load and other loads, and I'm not sure we can write a test that would consistently make the image lose the race...  So I don't know that it's possible to test the hang case sanely.
Mike, see comment 40 and 41.
(Reporter)

Comment 43

9 years ago
(In reply to comment #41)
> Hmm.  So what's Firebug actually doing that causes the hang?
> 

Basically, Firebug is just trying to process events while the onload event is paused.

Unfortunately, in this case the onload event is triggered outside the "protected" lines of code and the depth counter doesn't get updated properly.

(Disclaimer: My memory's a little hazy at this point, but I think that's right.)
Aha.  So all we need to do is to make sure the image loses the race and then in onload try spinning the event loop?

Does Firebug actually hang (as in deadlock or something) or is it just that no event processing happens?  The former would be easy to detect in a test, of course; the latter not so much...
(Reporter)

Comment 45

9 years ago
(In reply to comment #44)
> Aha.  So all we need to do is to make sure the image loses the race and then in
> onload try spinning the event loop?
> 
> Does Firebug actually hang (as in deadlock or something) or is it just that no
> event processing happens?  The former would be easy to detect in a test, of
> course; the latter not so much...

I don't think this is a race condition.  Here's the situation:

Firebug has a breakpoint in a JavaScript onload handler.
The document has a single image in it.
The document finishes loading first.
Then the image (the last outstanding part of the document) finishes loading
A counter somewhere goes to zero and the window onload event is fired
Firebug detects the onload event, intercepts it, and takes control (or something like that).

OK, so maybe it is a race condition between the document itself and the image.

So if you can guarantee that the image completes loading before the document, then the onload event should get fired while protected by the depth counter.

(Maybe this will all still make sense in the morning...)

In answer to the second question:

When Firebug gets called in the onload handler, it calls EnterNestedEventLoop which in turn calls Firefox event handling code.  The hang actually happens inside Mac-specific event handling code.  Basically it spins the event loop, but can't process any events.  As I recall, this is because InGeckoMainEventLoop() returns the wrong answer, and the event loop thinks that it's not supposed to to process an event at that point.  Since InGeckMainEventLoop() is never going to return a different answer, it just repeats ad infinitum.  But don't quote me on that.  I'd need to go back and look at the code to say for sure, and I'm not sure I ever fully understood what was going on there in the first place.
(In reply to comment #39)
> (From update of attachment 382333 [details] [diff] [review])
> Sorry to keep throwing up roadblocks but can we see tests for this? Renominate
> once you have them, though I think we'll likely wait for 1.9.1.3

this is going to be a hard thing to add to Firefox' existing unittest setup without pulling in all of Firebug. Something we'd be likely able to test for in FBTest in the Firebug testing kit.

Comment 47

9 years ago
As a practical matter we can't do a Mac OS only test on FBTest (not to mention it's not 'our' problem, it will affect eg Venkman just as well).

A Firefox unittest should be straight forward and it would be a great way to verify that Curtis' analysis is correct.

If you put code like below in the onLoad event handler you will simulate the break point in the onload handler. Then you just need to figure out how to adjust the timing of events to recreate your bug.

            jsd = DebuggerService.getService(jsdIDebuggerService);
            jsd.on();
            jsd.flags |= DISABLE_OBJECT_TRACE;

jsd.enterNestedEventLoop({
            onNest: function()
            {
               // you're in the event loop again here. Note that returning
               // does *not* leave the event loop, you have to do that explicitly.
            }
        });
> OK, so maybe it is a race condition between the document itself and the image.

Precisely.  The image needs to lose that race to trigger the bug.

> basically it spins the event loop, but can't process any events.

Yeah... that's harder to detect....

I suppose we could set up a timer and then spin the event loop (ourselves; we can create a nested one like jsd does as needed, or maybe even spinning the main event loop will show the bug?) till it fires.  If it doesn't fire before the unit test timeout, the test fails and we claim that it's this bug.  But we need a way to guarantee the image losing the race...

Comment 49

9 years ago
I've not tried this, but I was thinking about httpd.js on a Worker so we can get moving on FF 3.6 + Firebug. That might also work here. Have the httpd.js block the image-GET call until the onNest run and signals it. Of course this may involve even more event loops and confuse the whole issue here...
(Reporter)

Comment 50

9 years ago
(In reply to comment #48)

> ...  But we need a way to guarantee the image losing the race...

Can we extend the MochiTest webserver to provide a proxy with a built in delay, something like:

http://localhost/delayed-proxy?delay=1000ms&url=http://localhost/yada/test.png

we might not be able to provide a 100% guarantee that the document gets loaded before the the image, but getting the guarantee up into the high 90s ought to be pretty easy.  Then we could structure the test so that a false negative passes the test.

(But keep in mind that when I say this, I'm not even sure we can test JSD from a MochiTest!)
(Reporter)

Comment 51

9 years ago
I think we can come up with some way to test this patch.  I am less confident that we can get a proper test working in time to land on 1.9.1.2.  Since this patch fixes a quite serious Firebug bug on OS X, I want to argue for landing it anyway.  I don't think we should need a test to get approval to land on 1.9.1.2.

Basically, there are two risks with this patch:

1. The patch doesn't actually fix the bug.
2. The patch does fix the bug, but breaks something else, since it can effect event processing on OS X even when Firebug is not installed.

An automated test can address #1.  On the other hand, we can easily verify that manually, and frankly if the patch failed to fix the bug, we wouldn't be any worse off than we are today.

The real risk here is #2, and we simply can't construct an automated test for that.  The best we can do is test a build with the patch with all our other automated tests plus manual testing.  This is, in fact, what we've been doing, since this patch has been baking on trunk for five or six weeks now.

Requiring an automated test as a prerequisite for landing on 1.9.1.2 doesn't make sense, since the presence of an automated test doesn't do anything to mitigate the risk from #2 (and indeed can't).

Requiring the patch to bake on trunk for some amount of time does mitigate #2, and we have in fact done that.

(This said, don't let me derail discussion about *how* to do an automated test here, that's still a good thing in general!)
(Reporter)

Comment 52

9 years ago
Comment on attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?

Re-requesting approval1.9.1.2.  See comment #51 for detailed reasoning.
Attachment #382333 - Flags: approval1.9.1.2- → approval1.9.1.2?
> Can we extend the MochiTest webserver to provide a proxy with a built in delay,

I'm pretty sure that nowadays we can; that might be good enough, I suppose...

> I'm not even sure we can test JSD from a MochiTest

We don't need to test JSD, I'd think; we just need to spin an event loop.  We can certainly do that from a mochitest.  ;)
(Reporter)

Comment 54

9 years ago
(In reply to comment #53)

> We don't need to test JSD, I'd think; we just need to spin an event loop.  We
> can certainly do that from a mochitest.  ;)

We need to spin a *nested* event loop.  Are you saying that we can do that without using jsdService::EnterNestedEventLoop?  That would certainly simplify things!
I recommend reading the source of jsdService::EnterNestedEventLoop (note the comment at the top!) and comparing to, say, the _do_main function in testing/xpcshell/head.js ;)
(Reporter)

Comment 56

9 years ago
(In reply to comment #55)
> I recommend reading the source of jsdService::EnterNestedEventLoop (note the
> comment at the top!) and comparing to, say, the _do_main function in
> testing/xpcshell/head.js ;)

OK, I think I understand the comment about "nested event loops are a thing of the past".  Nevertheless, there seems to be some residual notion of nesting since we've got the "mRecursionDepth", "mNestedLoopLevel", and "mRunningEvent" variables.

In particular, the bug is not simply triggered by the image loosing the load race to the document.  In fact I'm under the impression that this is can occur normally with no ill effects.  So in order to trigger the bug there needs to be that extra ingredient of an out-of-whack "mRecursionDepth".  I imagine there's some way we could set up that condition in a test without using EnterNestedEventLoop(), but I don't know if that's easy or difficult.

Am I completely confused here?
> Nevertheless, there seems to be some residual notion of nesting
> since we've got the "mRecursionDepth", "mNestedLoopLevel", and "mRunningEvent"
> variables.

Sure; while we're in this code we can trigger more content script that will stop at another breakpoint that will reenter this code.  I think that's all that's going on here.

> In particular, the bug is not simply triggered by the image loosing the load
> race to the document. 

Right; you then have to spin the event loop from the onload event triggered by that image load finishing.  I think code similar to what I pointed to in head.js should do the trick.

waldo, do we have an example around of slow loads from httpd.js?
Examples at <https://developer.mozilla.org/En/HTTP_server_for_unit_tests> should illustrate how it might be done.  The following files may also be useful for showing how the state-transition Kabuki dance may be performed:

http://mxr.mozilla.org/mozilla-central/source/netwerk/test/httpserver/test/test_sjs_object_state.js
http://mxr.mozilla.org/mozilla-central/source/netwerk/test/httpserver/test/data/sjs/object-state.sjs

There are currently no examples designed to demonstrate this in a way similar to how real-world tests might use this because, so far, no real-world tests that use it exist (only httpd.js tests that use a number of different abstractions to make writing those tests easy, but which happen to obscure control flow precisely because they make writing those tests easy).  Also, as I've mentioned to bz before, I've been hesitant to write any such tests because I want an impartial actor to write at least the first few such tests to double-check that the current API is sane, fully capable, and not more overly complex or difficult to use than necessary.
(In reply to comment #51)
> anyway.  I don't think we should need a test to get approval to land on
> 1.9.1.2.

Tests are required for all branch landings; it's just good sense. If something's important enough to take on a stability branch, it's also important enough to have a test to make sure that no future changes break that code.

They are not sufficient for branch landings; the requirement is that it's a wanted change that has an acceptable risk profile. Your comments that:

> The real risk here is #2, and we simply can't construct an automated test for
> that.  The best we can do is test a build with the patch with all our other
> automated tests plus manual testing.  This is, in fact, what we've been doing,
> since this patch has been baking on trunk for five or six weeks now.

are what really worry me. The 1.9.1.2/3.5.2 release will have a limited beta period as we're hoping to release it in a few weeks. While it's been baking on trunk for five or six weeks, it's unclear to me if we've done any focused regression hunting on those builds. Have you looked in trunk topcrashes to see if this change is causing problems?

It feels to me like we can do focused testing around areas where we know messing with event processing has had effect. It also feels like we can figure out what parts of the code hit these changes hardest and do focused testing around there.

Mostly, though, it feels like we need a 1M user beta to shake this out instead of a 10k nightly user baking period.

IMO, this should wait for 1.9.1.3
Attachment #382333 - Flags: approval1.9.1.2? → approval1.9.1.3?

Comment 60

9 years ago
Ok, will it make it into FF 3.5.3?
Unlikely, because it still doesn't have tests and code freeze is tomorrow at 11:59pm PDT.

Comment 62

9 years ago
(In reply to comment #59) 
> Mostly, though, it feels like we need a 1M user beta to shake this out instead
> of a 10k nightly user baking period. 

Ok so based on the information here we should expect to wait until 3.6b2. But it will be in 3.6 correct?
(Reporter)

Comment 63

9 years ago
(In reply to comment #61)
> Unlikely, because it still doesn't have tests and code freeze is tomorrow at
> 11:59pm PDT.

I have a test that seems to be working now.  I still need to do a little work to verify that the test is faithfully reproducing the bug.
(In reply to comment #63)
> (In reply to comment #61)
> > Unlikely, because it still doesn't have tests and code freeze is tomorrow at
> > 11:59pm PDT.
> 
> I have a test that seems to be working now.  I still need to do a little work
> to verify that the test is faithfully reproducing the bug.

cool. Hopefully we can get this in 3.5.4.
(Reporter)

Comment 65

9 years ago
Created attachment 393902 [details] [diff] [review]
mochitest v1 (incomplete)

This mochitest reliably triggers the bug, which I verified not just by locking up the browser when running the mochitest, but also by instrumenting nsAppShell::InGeckoMainEventLoop and verifiying the symptoms at the local and member variable level.

The test currently has two major deficiencies:

1. It references an image file that doesn't exist, which is a hack to allow the main document to completely load before the image "finishes" loading, which is a requirement for triggering the bug.  Is there any alternative here beyond extending the webserver?

2. It's not clear to me how to report success to the mochitest framework.  I'm also assuming that we can "report failure" by just timing-out the test framework when the browser hangs, although it would be nice to detect the error explicitly and fail in a controlled (and relatively speedy) fashion.

Any advice on how to proceed would be appreciated.
Attachment #393902 - Flags: review?(bzbarsky)
(In reply to comment #65)
> 1. It references an image file that doesn't exist, which is a hack to allow the
> main document to completely load before the image "finishes" loading, which is
> a requirement for triggering the bug.  Is there any alternative here beyond
> extending the webserver?

You should be able to have two SJS documents that use setObjectState and getObjectState to coordinate response processing in this manner, possibly with a third-party page which, when loaded from a main-page DOMContentLoaded callback, will trigger the completion of the image load.  Or something like that; details as an exercise for the reader, this test might help (or more likely just confuse, given that xpcshell is so close to the metal):

http://mxr.mozilla.org/mozilla-central/source/netwerk/test/httpserver/test/test_sjs_object_state.js
I'm not sure I follow the test.  How would _quit become true?
(Reporter)

Comment 68

9 years ago
It won't.  This is what happens when a programmer cribs code from another test without fully modifying it for its new role.

Actually, I guess that's kind of the point.  I'm not sure how to modify the code for what I need it to do here.  This test reproduces the bug (assuming the fix hasn't been applied, of course) in a way a person can easily detect: It locks up the browser.  But I'm not sure how the test itself can detect the problem.

Question: will an outstanding setTimeout() callback get ignored if the rest of the browser is locked up, too?

If so, we could set the loop up to run for a fixed amount of time, say 30 seconds.  Then we could schedule a function for t + 15 seconds, and have that function clear _quit (or whatever you want to call it).  Then we start the loop.  If the loop breaks because _quit changed, then the test passes.  If the loop ends because of the 30 second limit, however, the test fails.

Well, that seems like a nice idea, anyway.  But I guess the callback can't run until the JavaScript executing the event loop terminates.

Anyway, sorry about the sloppy state of the test code.
> Question: will an outstanding setTimeout() callback get ignored if the rest of
> the browser is locked up, too?

Good question.  Test?  My suggestion earlier in this bug assumed yes....

Note that the right way to fail is to just time out the test timeout, not add your own 30s timeout.

The right way to pass is to not time out the test timeout.  Imo.

> But I guess the callback can't run until the JavaScript executing the event
> loop terminates.

That guess is certainly false.  ;)
(Reporter)

Comment 70

9 years ago
(In reply to comment #69)
 
> > But I guess the callback can't run until the JavaScript executing the event
> > loop terminates.
> 
> That guess is certainly false.  ;)

Well, the good news is that the setTimeout() callback does run.  The bad news is that it seems to run whether the bug is fixed or not.

I tried an alternative approach of dynamically creating and inserting an image and only setting "quit" when that image's onload handler ran.  Alas, that approach worked just like the setTimeout() method.

I'm now hypothesizing that the problem is not so much "processing of events" as it is "processing of *Mac OS* events".  Is there any way we can synthesize an OS event from a mochitest? (Keep in mind that I have never fully grasped what's going on the event loop, so I'm not even sure this makes sense.)
I doubt we can synthesize an OS event from mochitest....
(Reporter)

Comment 72

9 years ago
(In reply to comment #71)
> I doubt we can synthesize an OS event from mochitest....

If we put the issues of a proper mochitest aside for a moment, is there an easy way to test the hypothesis that it's really just OS events that aren't being handled?

Returning to the mochitest issue, I guess the obvious question to consider is whether this bug is going to require a compiled code test.
You could run the test manually in a build without this patch and without the setTimeout.  If the test UI is nonresponsive, but running it with a timeout fires the timer, sounds like the hypothesis would be strongly corroborated....

I don't see how a compiled code test would help.  Our problem is that the failure condition is not one we can easily detect programatically, no?
Attachment #393902 - Flags: review?(bzbarsky) → review-
(Reporter)

Comment 74

9 years ago
(In reply to comment #73)
> You could run the test manually in a build without this patch and without the
> setTimeout.  If the test UI is nonresponsive, but running it with a timeout
> fires the timer, sounds like the hypothesis would be strongly corroborated....

That's pretty much what I've been doing that led me to that conclusion in the first place :-).  I suppose the distinction may be academic anyway.

> I don't see how a compiled code test would help.  Our problem is that the
> failure condition is not one we can easily detect programatically, no?

I'm thinking that we probably *can* synthesize an OS-level event from a C++ test.  We'd also need to be able to specify an event-handler that will get called if and when our synthetic event gets processed.  Essentially it's the same thing I've been trying to do with the mochitest, except with the ability to generate an OS event for the test. Well, that's the theory, anyway.

Another possibility is if we can instrument nsAppShell in some formal way, and then maybe we can just query it directly.  Basically, we'd just be querying it for mRecursionDepth, and checking to make sure that we got back a sensible number.  This kind of a solution might not be ideal as a test, since we wouldn't be testing for the presence of the actual symptom.  Assuming we could do this, I suppose we might be able to expose our instrumented API in a way that would work with a mochitest...

Comment 75

9 years ago
beltzner's 3.6b2 is looking pretty good now...
Flags: wanted1.9.1.x+
Comment on attachment 382333 [details] [diff] [review]
Curtis, does this fix the issue you see?

Clearing approval request. Any more work been done here on getting a test for this? We'll consider it for 1.9.1.4 assuming a test is made.
Attachment #382333 - Flags: approval1.9.1.3?
You need to log in before you can comment on or make changes to this bug.