Closed Bug 67752 (ireflow) Opened 19 years ago Closed 11 years ago

[FIX]need to implement interruptible reflow

Categories

(Core :: Layout, defect, major)

defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9.2a1

People

(Reporter: buster, Assigned: bzbarsky)

References

(Depends on 5 open bugs, Blocks 3 open bugs)

Details

(Keywords: perf)

Attachments

(3 files, 15 obsolete files)

908 bytes, text/html
Details
395 bytes, text/html
Details
94.18 KB, patch
Details | Diff | Splinter Review
We need interruptable reflow for a variety of performance reasons.
Responsiveness during incremental page load, faster window resizing, and faster
percieved responsiveness to multiple incremental reflows (i.e., typing) are just
some of the benefits.

This is a placeholder bug for the enhancement.  Design discussions should take
place on the layout newsgroup.
Keywords: perf
Assigning milestone
Status: NEW → ASSIGNED
Priority: -- → P3
Target Milestone: --- → mozilla1.0
Target Milestone: mozilla1.0 → Future
Blocks: 114584
Keywords: mozilla1.0+
I'll do this one the day before the tree closes for 1.0 :)

Seriously, this is waaay to big and waaay to risky for 1.0, unless we want to
push off 1.0 until the end of 2002.  Besides, I'd argue that effort is better
spent making reflow faster and more predictable as oppssed to making it
interruptible.
Keywords: nsbeta1-
Keywords: mozilla1.0+mozilla1.0-
*** Bug 83732 has been marked as a duplicate of this bug. ***
Note: bug 83732 has a patch (possibly in need of updating) to gather stats on
reflow durations.
Blocks: 71668, 140391
Depends on: 83737
OS: Windows NT → All
Hardware: PC → All
Blocks: 91643
Any ideas on that so far?
I suggest to mark this bugas WONTFIX and focus on a layout/ and content/ code
which can be used by multiple threads at the same time (note that this is easy
to implement (but lots of work)).
That would solve far more issues and has far greater benefits (for example
running the first reflow in a seperate thread while the event queue can process
other windows/frames/TABs/etc. - or running printing and print preview in
seperate threads to keep the browser useable while printing).
No, this is really different from multithreading. We need interruptible reflow
as well as multithreading.
Robert O'Callahan wrote:
> No, this is really different from multithreading. We need interruptible reflow
> as well as multithreading.

Why ? Having the ability to run with multiple threads in layout/&&content/ code
could mainly eliminate the need of interruptible reflow (if we really put the
heavy reflow tasks on sepeate threads and let it send an event when the job is
done).
IMHO there are possible scenarious where a interruptible reflow would make sense
but having a threadsafe version of layout/&&content/ with zero interlocks (which
is technically possible and not hard) would be of a far greater benefit.
We'd like to be able to interact with the reflowing page during long-running
reflows. That requires interruptible reflow.

Multithreaded means that when one page is reflowing you can interact with
another page, which is useful but less beneficial I think, unless the user is
some sort of channel-surfing freak :-).
Robert O'Callahan wrote:
> Multithreaded means that when one page is reflowing you can interact with
> another page, which is useful but less beneficial I think, unless the user is
> some sort of channel-surfing freak :-).

- ... or simply wants to print something without that the browser locks-up
during printing.
- ... or wants to visit print preview while still being able to browse
- ... or wants to load pages like
http://bonsai.mozilla.org/cvsquery.cgi?treeid=default&module=all&branch=HEAD&branchtype=match&dir=&file=mozilla%2F&filetype=match&who=&whotype=match&sortby=Date&date=hours&hours=8760&mindate=&maxdate=&cvsroot=%2Fcvsroot
without getting a Mozilla which locks up after loading 10% of the page

Currently it's not even possible to load one larger page in one window and surf
in a 2nd window without having a very laggy UI. IMHO we should offload some
longer-running jobs (e.g. every thing which needs more than 1/2sec to run) from
the UI thread to seperate threads.
You can do all those things with interruptable reflow.
BTW there is a lot of stuff in Gecko which would require some sort of
interlocking if we wanted to run multiple UI threads. It would introduce a lot
of overhead and be VERY hard to get right. Lots of static shared resources that
have to be protected.

Incidentally part of my day job is developing tools for finding timing bugs in
large multithreaded programs. I really don't want to see Mozilla in the same bad
way.
roland:
is there a bugnumber for making the reflow thread safe?
Robert O'Callahan wrote:
> BTW there is a lot of stuff in Gecko which would require some sort of
> interlocking if we wanted to run multiple UI threads. It would introduce a lot
> of overhead and be VERY hard to get right. Lots of static shared resources 
> that have to be protected.

My solution goes into a different direction - trying to prevent interlocks via
putting global/static data into a clever handle API (e.g. each thread has it's
own set of handle objects) and event/message passing facility for syncronisation
(does anyone remember AmigaOS's nice solution for this ? :)
And we would get interuptable layout with such a design, too (via adding some
kind of "checkpoints" where a job can be canceled/suspended or pushed on another
thread (for example moving a job from the UI to a background thread or the
reverse direction)).

> Incidentally part of my day job is developing tools for finding timing bugs in
> large multithreaded programs. I really don't want to see Mozilla in the same 
> bad way.

I know this kind of h*ll very good. That's why I am proposing a very clean
solution which tries to avoid problems at very early stages (e.g. design level).
Multithreading without interruptible reflow wouldn't be that useful. For
example, if you have one window open with a long running reflow happening in it,
you wouldn't be able issue a command to open a new window (since event handling
in that window would be blocked, just as it is now). You wouldn't be able to
interact with the window in any way.

On the other hand, if we have interruptible reflow, we don't need
multithreading. We can stop reflowing the main window anytime and effectively
"context switch" by going off to process a different event.

If you're not convinced, find me on IRC.
Jens-Uwe wrote:
> is there a bugnumber for making the reflow thread safe?

Not really. I gave up with that project around one year ago since I realised a
major problem:
It requires at least two months (8-10h/day) for me to implement that for the
whole layout/&&content/ tree. Faster is better here to keep the work in sync
with the "trunk" tree.
But I cannot spend that much time into my work for mozilla.org since I am a
student who still has to do other things to get some food into my stomach, not a
full time engineer who is being paid for the work.
Robert O'Callahan wrote:
> Multithreading without interruptible reflow wouldn't be that useful. For
> example, if you have one window open with a long running reflow happening in 
> it, you wouldn't be able issue a command to open a new window (since event 
> handling in that window would be blocked, just as it is now). You wouldn't be
> able to interact with the window in any way.

> On the other hand, if we have interruptible reflow, we don't need
> multithreading.

That's wrong. Your "interupable reflow"-solution does not scale (remember that
SMT will become more and more important in the next couple of years, some
products are already "out" (like Intel's HyperThreading) and some others will
follow soon).

For example:
Take a look at IE6 - if you have 20 windows open you still can use the browser
without problems. Then do the same with Mozilla you'll have luck when you get a
response within some seconds (or better: both browser or any other application
like MailNews is close to unuseable).
IE's solution is to run each window in a seperate process to gurantee that each
window does not affect the others. Mozilla can't do that yet (which would be
another "nice-to-have" to avoid that a browser crash kills MailNews+ChatZilla
etc.) and I doubt we can go into that direction without major rants (we would
need fine-grained locking for the profile data and some kind of fast IPC
facility).

> We can stop reflowing the main window anytime and effectively
> "context switch" by going off to process a different event.

No, you're wrong here. You can't stop it everytime - you can stop it when reflow
returns to a predefined mark. That's not the preemptive scheduling I have in
mind.
Roland, you seem to have ignored my main point, which is that multithreading
alone, without interruptible reflow, doesn't solve most of the problems. I
assure you that one thread per window is the finest granularity you're going to
be able to get, and with that you won't be able to handle user input to the
window while a reflow is happening. (There is *no way* we'll ever be able to
handle events or paint a document tree *while* part of that same document tree
is being reflowed. And you won't be able to handle different documents in a tree
in separate threads, there's too much communication along trees in DOM and layout).

You raise three other issues:

1) You need multiple threads to take advantage of SMP/SMT. That's true but not
very relevant. With one thread per window, it would only improve performance if
you're using two windows simultaneously. That's not a very common scenario.*

2) Threads are a step towards better isolation of windows. Yet, as you observe,
 for real isolation you need separate processes and a different set of
solutions. With standalone apps (Phoenix, Minotaur, etc) coming, we probably
will have to develop those solutions anyway. I think that's a good direction to
go in.

3) Interruptible reflow only lets you switch away at predefined preemption
points. That's true, but it's not a problem as long as you choose the right set
of preemption points. I think even with just a couple of preemption points we
would get decent average preemption latency. (E.g., inside the
nsBlockFrame::ReflowDirtyLines loop, and at the start of nsBlockFrame::Reflow.)

* We might be able to get a really useful performance improvement from SMP by
modifying painting to use multiple threads (e.g., split the painted area up into
bands each painted by a different CPU). This would not be so hard because
painting is pretty much read-only. On the other hand, it would depend on the
underlying platform because many platforms can't handle multithreaded graphics
very well.
For the record, there is another bug open on multithreading on a per-docshell
basis, which would improve UI responsiveness during document reflow.
Hixie: you mean "improve responsiveness of another window during document reflow
in the first window", right?

Roland: I've done multithreaded code for 17 years, starting with kernel code at
SGI, which got into SMPs in a big way in the late '80s.  I'm here to tell you
that your handwaves and talk of "design" are not convincing.  As you noted, we
can't and should not want to "stop the Gecko world" while you take 2 months
(more like 20 months, with help) to try to decouple and compartmentalize data,
and ensure thread-safety.  Incremental reflow, on the other hand, could be
incrementally developed.

Roc's right, hyatt demonstrated three years ago by adding nested event loops to
just a few places in layout (the loop roc mentioned, IIRC) that responsiveness
of a single window loading or reflowing a document could be greatly improved. 
That hack was not safe, of course; too much non-reentrant code doesn't protect
itself.  But it proved the concept.

/be
I can't find Hixie's bug, but there is one docshell for the chrome and another
for content, so if you could have seperate UI threads for those then you would
have a responsive UI while reflowing. Unfortunately we'll never be able to have
per-docshell threads because there's a lot of communication between docshells in
the DOM, layout and views.
OK, I give up fighting the xx@@@!!!-windmills here - someone noone seems to be
interested in a simple and efficient solution to implement both multithreading
and interruptable layout at the same time (for the log: This isn't hard to
implement - on the other hand noone seems to be interested in this (and yes, I
was thinking of an incremental work in this without a seperate branch and
without holding up the normal progress)).
brendan: what roc said (the chrome would be more responsive if each docshell had
its own thread since the content area and the chrome are in different docshells).

roc: bug 40848. The fact that the threads communicate should not be a problem,
messages being posted from one thread to another (e.g. status updates) can be
buffered until the target thread is ready to poll for them (e.g. using a message
pool thread). I can't really think of any inter-docshell message which needs to
be synchronous/blocking, but then I don't really know the architecture that well.
> I can't really think of any inter-docshell message which needs to
> be synchronous/blocking

1) Painting
2) Inter-doc DOM calls
3) Event handling

Those are the obvious ones. There are probably lots more lurking.
Hixie: what roc said (comment 21, maybe you didn't read it before commenting),
all of it, once more with feeling.

Our chrome gets lots and lots of DOM property values from content.  Ask caillon,
or study up before you make bold claims about how easy asynchronous messaging is
as a solution for docshells in threads needing to communicate.

/be
Summary: need to implement interruptable reflow → need to implement interruptible reflow
This bug would need a new owner :)
I don't think there is as great of a need for interruptible reflow now that
fixes for bugs:

http://bugzilla.mozilla.org/show_bug.cgi?id=165039
http://bugzilla.mozilla.org/show_bug.cgi?id=157144

have been checked in.

If there is any user-interaction during page load Gecko dynamically switches to
a mode in which smaller chunks of data are parsed before returning to the native
event loop. This causes the number of frames which are created and reflowed
before returning to the native event loop to remain small which makes Gecko very
responsive. When the user-interaction stops the parser switches back to
processing large chunks of data, which improves overall page load performance. 

On WIN32 the interruptible parsing logic is working very well. Mac and Linux
*may* need some improvements to better detect when there are pending input events.

nsbeta1-.
Keywords: nsbeta1nsbeta1-
Linux is as unresponsive as ever during page load. Clicking to switch tabs at
the begining of page load doesn't execute until page load is complete.
ctrl-clicking in a URL to a page to open it in a new tab during page load is
NEVER executed.

0 visable improvment. Yay for the ignored platforms.
"clicking to switch tabs at the begining of page load doesn't execute until page
load is complete."

That sounds like a linux specific native event starvation issue, which may be
the result of how native PL_Event dispatch is implemented for Linux.
Or it may be fixed when bug 172499 is addressed.
"clicking to switch tabs at the begining of page load doesn't execute until page
load is complete."

Doesn't that sound like bug 76495?

/be
No, bug 76495 is only about the fact that our content area dies, not that the
app becomes unresponsive. You can still resize the window, e.g., while bug 76495
is visible; indeed that is one of the clearest ways of reproducing that bug.
Flags: blocking1.4?
Keywords: mozilla1.3
tough work ahead but would be very fine to have :)
Assignee: attinasi → other
Blocks: 168157
Status: ASSIGNED → NEW
Keywords: nsbeta1-nsbeta1
Priority: P3 → --
Target Milestone: Future → ---
adt: nsbeta1-
Keywords: nsbeta1nsbeta1-
Blocks: 203448
Flags: blocking1.4? → blocking1.4-
I think we could get some interruptibility quite easily if we had a
cross-platform API to quickly check for pending input events. Here's what I'd do:

One key point where we want to be able to interrupt is in the
nsBlockFrame::ReflowDirtyLines loop that iterates over the dirty lines. This is
basically where all children of block elements get reflowed in turn. At each
iteration of this loop we could poll to see whether we need to interrupt. If we
need to interrupt, then we post a dirty reflow command targeted at the current
block, and --- and this is the tricky bit --- capture the unprocessed
incremental reflows targeted at the descendants of this block and repost them as
new commands. Then we can just exit the loop, knowing that those posted reflow
commands eventually be processed and complete the reflow. (That might not work,
but that's the idea :-).)

One tricky thing is that some reflows are really uninterruptible. For example,
sometimes we reflow to get box offsets needed for Javascript code to continue
executing. I don't think we can interrupt those reflows.
This would be easier if we switch to a dirty-bits system.
The last comments on this bug are old, from 2-4 years ago.
I am not a developer so can somebody knowledgeable re-evaluate the situation now, and whether fixing this bug will imrpove mozilla/Firefox responsiveness, and what is the significance of this bug now on Firefox unresponsiveness?

If this bug can improve things, can it be entered to the Firefox 2 or 3 plans?
Although I have little ( > 0 however) experience with multithreaded programming,  I was thinking why roc might be wrong after all regarding the necessity of interruptible reflow in a multi-threaded environment: Why not have 2 (or more) threads per loading page, one for reflow and one for interacting with the page while it is being reflowed.

Also, this:
> Multithreaded means that when one page is reflowing you can interact with
> another page, which is useful but less beneficial I think, unless the user is
> some sort of channel-surfing freak :-).

is like calling everyone who uses multiple tabs or browser windows a channel-surfing freak. It wasn't true 4 years ago and it certainly isn't true today. Many a time have I found myself cursing mozilla for one page with flakey JS or too much content bringing my browsing experience to a halt.

> BTW there is a lot of stuff in Gecko which would require some sort of
> interlocking if we wanted to run multiple UI threads. It would introduce a lot
> of overhead and be VERY hard to get right. Lots of static shared resources 
> that have to be protected.

Is this really true? Also, can't most of this stuff have the need for interlocking removed, or compartmentalized somehow? Surely at the tab-level at least, the interaction and reflowing in one tab shouldn't depend so much on the interaction and reflowing in another page... right?

> I assure you that one thread per window is the finest granularity you're going 
> to be able to get, and with that you won't be able to handle user input to the
> window while a reflow is happening. (There is *no way* we'll ever be able to
> handle events or paint a document tree *while* part of that same document tree
> is being reflowed.

Maybe you're right, I dunno, but can you explain why that is? If we're computing the positions of table cells, or setting colors and borders, or other such work, what's to prevent the user from, say, typing some text in a textbox?

> multiple threads ... would only improve performance if
> you're using two windows simultaneously. That's not a very common scenario.*

Except for your own reservation, here's another one: One very often middle-clicks two or three links from the same webpage, letting them load background tabs simultaneously, while only viewing one of them, or another page altogether. That's not an uncommon scenario - and it's the exact scenario in which mozilla is currently performing somewhat weakly. Especially when the work on other tabs is adversely effecting the user experience in another tab.

> Threads are a step towards better isolation of windows ...
>  for real isolation you need separate processes and a different set of
> solutions. With standalone apps ... coming, we probably
> will have to develop those solutions anyway

Can anyone elaborate on this point, in retrospect?

(In reply to comment #37)
> and what is the significance of this bug now on Firefox unresponsiveness?

IMVHO it is quite significant; that is, in most browsing scenarios in which ff/sm is unresponsive on newer hardware, multithreading or interruptible reflow should help.

> If this bug can improve things, can it be entered to the Firefox 2 or 3 plans?

The answer is "no way" and "most probably not", I would guess.
Thanks for the answers, Eyal.

I have another important question that I forgot: I understand that improving Firefox responsiveness by using threads is complicated, since the work cannot be done incrementally - changes affect the whole design, need to be done at once and in a lot of places before it gets to a stable state.

But what about implementing interruptible reflow? Can the work be done incrementally, in relatively smalll pieces, each time fixing the next low hanging fruit (each time fixing one object type's reflow to be interruptible, for example), or does it have to be done at once for the whole system ?

>Maybe you're right, I dunno, but can you explain why that is? If we're
>computing the positions of table cells, or setting colors and borders, or other
>such work, what's to prevent the user from, say, typing some text in a textbox?

Typing text in a textbox affects the content tree. It also triggers a reflow. And it can trigger event listeners that change the positions of table cells, or the color of an element.
(In reply to comment #36)
> This would be easier if we switch to a dirty-bits system.
> 

I believe this is the case now your reflow branch has landed (bug 300030).
Attached patch proto-patch (obsolete) — Splinter Review
This patch kinda works. You can interrupt reflows, process events, and then the reflow will complete.

The major problem here is that on Mac and GTK I can't find a way to reliably detect whether there's an input event in the event queue. On Mac, Appkit's nextEventMatchingMask:dequeue:NO method actually processes events sometimes (events that don't match the mask???) so that's completely useless (we definitely don't want to process any events inside reflow). I tried using CGEventSourceCounterForEventType and watching for changes, but that doesn't tell us what Appkit has in its queue, it causes interrupts when events are sent to other apps, it causes interrupts for mouse moves, and worst of all it seems to be artifically limited to report with a granularity to the nearest 10 events or so. So the code in this patch just registers an interrupt 500ms since the last time we handled a native event.

GTK/X11 seems to have a similar lack of APIs. The code I have in this patch calls XCheckMaskEvent, which is completely wrong since if it finds an event, the event is discarded without being processed. There doesn't seem to be a way to "peek without blocking". Why can't these platforms have an equivalent of Windows' GetInputState, which has existed for about 20 years?

If we could solve these problems I think the patch would work pretty well :-(.
Attached file testcase
Load this testcase. Click on the button and then start pressing the down-arrow key. If incremental reflow is working, there will be an initial delay (frame construction, which is not interruptible) but then you will be able to scroll before the reflow is complete (much earlier than with a normal build). At some point reflow will complete and line 200,000 will be visible at the bottom of the page.
Stephen, Josh, see comment #42. Maybe you have some insight into the Mac API situation here.
I filed bug 462833 on having an API for checking for pending input events.
(In reply to comment #44)

I'll look into the situation on the Mac (I'll try to find a way to
discover whether there are input events in the system event queue
without using nextEventMatchingMask:dequeue:NO).

My chance of success is no better than 50/50.  I've already failed
once (a couple of years ago) trying to find a way to access the system
event queue directly (without using nextEventMatchingMask:dequeue:).
But I may be able to come up with some sort of hack.  And it may be
easier to detect whether or not the system event queue contains input
events.
(In reply to comment #42)
> GTK/X11 seems to have a similar lack of APIs. The code I have in this patch
> calls XCheckMaskEvent, which is completely wrong since if it finds an event,
> the event is discarded without being processed. There doesn't seem to be a way
> to "peek without blocking". Why can't these platforms have an equivalent of
> Windows' GetInputState, which has existed for about 20 years?

Maybe call XPending to find the number of events in the queue; if > 0, call XPeekIfEvent, making sure to return True if you end up processing the number of events returned by XPending (to ensure that it returns without blocking); that is, if you get to the end of the pending queue.

Though that might have weird interactions with gdk events.. you could just use gdk_events_pending() ?
gdk_events_pending works OK. It's not ideal because it interrupts on mouse moves. I'll fix up some more bugs and post another patch.
(In reply to comment #48)
> gdk_events_pending works OK. It's not ideal because it interrupts on mouse
> moves. I'll fix up some more bugs and post another patch.

That's ok, especially for fennec -- for desktop we might do the XPending/XPeekIfEvent hack.
Attached patch proto-patch v2 (obsolete) — Splinter Review
This uses gdk_events_pending. Seems to work reasonably well. There's some weirdness with the testcase sometimes which I can't pin down. Still, this patch should be worth experimenting with.
QA Contact: chrispetersen → layout
Attached patch Checkpoint with a few tweaks (obsolete) — Splinter Review
Fixes the deferred-vs-nosync view batch end for interruptible reflows to use no-sync instead of deferred.  Fixes SetInterruptState to not interrupt if aInterruptible is being set to false when an event is pending.  Adds some env vars for configuring the way the interrupting will happen, for testing purposes.
Attachment #359086 - Attachment is patch: true
Attachment #359086 - Attachment mime type: application/octet-stream → text/plain
This should handle XUL somewhat better, I hope.  Also fixes some issues with float damage propagation, resize reflows, reflow callbacks in non-interruptible reflow, printing issues.
Attachment #346007 - Attachment is obsolete: true
Attachment #346166 - Attachment is obsolete: true
Attachment #359086 - Attachment is obsolete: true
Depends on: 476927
Depends on: 476724
I should note that I tried that patch on the testcase and it helps some... until session restore kicks in and tries to save the scroll position.  Maybe we should have a followup bug on having a way for session restore to ask for that information without flushing out layout.
No longer depends on: 476724
Depends on: 476724
Attached patch Another checkpoint (obsolete) — Splinter Review
Fixes column-sets, shuts up the timing spew unless one asks for it, passes reftests with my "interrupt in the first leaf block in a depth-first traversal" setup.

This still asserts like crazy, but that can be quieted down if desired by setting sDisableGetUsedXAssertions to true in nsLayoutUtils.h.

I haven't tried this on Linux yet, but if someone gets a chance to try it with Fennec, I'd love to hear how things look, both in terms of UI breakage (hopefully none) and in terms of effect on responsiveness (sadly hard to measure).
Attachment #360423 - Attachment is obsolete: true
Attached patch With better dirty bit management (obsolete) — Splinter Review
The big change here is that we remove bits as we normally would during reflow, but flag interrupted blocks as needing the HAS_DIRTY_CHILDREN bit once reflow finishes and setting those bits at that time.  We also flag all the after-the-interrupt kids of a block that has IS_DIRTY with IS_DIRTY.

There's a remaining issue with font-face, which I think I finally understand.  We load user fonts from reflow, but if the reflow that starts the load is the flush onload is doing, we end up with the load DOM event happening before the user fonts have loaded, and we lose.

I'll think about how best to deal with that and what to do with the remaining assertions.
Attachment #360855 - Attachment is obsolete: true
Depends on: 477880
Depends on: 400226
Depends on: 478101
Attached patch Ready for review, I think (obsolete) — Splinter Review
I think this is more or less ready.  There's still the question of how often we want to interrupt, but that might need to be tuned on a per-app basis.  I didn't really change the widget code from what roc wrote, so if that needs other reviewers please request as needed?

The remaining outstanding issues:

1)  Since we now do an interruptible flush on scroll, I save the node we tried to scroll to and if the flush is interrupted try to scroll again after the next time we do a non-interrupted reflow.  I'm not sure this is desirable behavior, actually; if the user scrolls in between; perhaps we should either not allow interruption of scroll flushes or not worry about scrolling to the wrong place if the flush is interrupted?  If we preserved position better when reflowing, I'd definitely go with the latter...
2)  Various XXX comments in nsBlockFrame that can become followup bugs.
3)  Still needs lots of testing; I'll get in touch with Jesse if nothing else.
Assignee: roc → bzbarsky
Attachment #361454 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #362073 - Flags: review?(roc)
Attachment #362073 - Flags: review?(dbaron)
Summary: need to implement interruptible reflow → [FIX]need to implement interruptible reflow
###!!! ASSERTION: Frame passed in not in reflow?: 'aFrame->GetStateBits() & NS_FRAME_IN_REFLOW', file /Users/jruderman/central/layout/base/nsPresShell.cpp, line 3303

Testing with:
export GECKO_REFLOW_INTERRUPT_MODE=counter
export GECKO_REFLOW_INTERRUPT_CHECKS_TO_SKIP=1
export GECKO_REFLOW_INTERRUPT_FREQUENCY=1
When I'm using Bugzilla, I think this patch causes me to get more:

###!!! ASSERTION: cannot call GetUsedBorder on a dirty frame not currently being reflowed: 'nsLayoutUtils::sDisableGetUsedXAssertions || !NS_SUBTREE_DIRTY(this) || (GetStateBits() & NS_FRAME_IN_REFLOW)', file /Users/jruderman/central/layout/generic/nsFrame.cpp, line 565

than usual.
Yeah.  You probably need the patch from bug 478101 for those with that they should be more or less gone, I think.  Sorry about that.  I'd been hoping to have that checked in by now, but the tree's perma-broken.

I'll look at that other assertion; thanks for the testcase!
Alias: ireflow
OK, that assert isn't actually caused by this patch; I think this patch just makes it easier to trigger.  I filed bug 478465 on the real issue.
Depends on: 478465
Depends on: 478504
Depends on: 478527
Depends on: 478594
Attachment #362073 - Attachment is obsolete: true
Attachment #362784 - Flags: superreview?(dbaron)
Attachment #362784 - Flags: review?(roc)
Attachment #362073 - Flags: review?(roc)
Attachment #362073 - Flags: review?(dbaron)
The only thing here that doesn't quite work on Windows Mobile is the use of random/srandom, which I've commented out locally to proceed.
Comment on attachment 362784 [details] [diff] [review]
Should hopefully compile on Windows Mobile

>diff --git a/content/base/src/nsDocument.cpp b/content/base/src/nsDocument.cpp

>   // Since media queries mean that a size change of our container can
>   // affect style, we need to promote a style flush on ourself to a
>   // layout flush on our parent, since we need our container to be the
>   // correct size to determine the correct style.
>   if (mParentDocument && IsSafeToFlush()) {
>-    mozFlushType parentType = aType;
>-    if (aType == Flush_Style)
>-      parentType = Flush_Layout;
>-    mParentDocument->FlushPendingNotifications(parentType);
>+    mParentDocument->FlushPendingNotifications(PR_MAX(aType, Flush_Layout));
>   }

I don't see why we should promote flushes less than Flush_Style.  Why shouldn't the diff just be:
-    if (aType == Flush_Style)
-      parentType = Flush_Layout;
+    if (aType >= Flush_Style)
+      parentType = PR_MAX(Flush_Layout, aType);

> nsEventStateManager::FlushPendingEvents(nsPresContext* aPresContext)
>-    shell->FlushPendingNotifications(Flush_Display);
>+    shell->FlushPendingNotifications(Flush_InterruptibleLayout);

Did you test responsiveness while dragging?  Are there cases where we needed to force the paint?  (Especially on Windows?)

> nsGenericHTMLFormElement::SetFocusAndScrollIntoView(nsPresContext* aPresContext)
>-    nsIFormControlFrame* formControlFrame = GetFormControlFrame(PR_TRUE);
>+    aPresContext->Document()->
>+      FlushPendingNotifications(Flush_InterruptibleLayout);
>+    nsIFormControlFrame* formControlFrame = GetFormControlFrame(PR_FALSE);

Could you leave a comment that once bug 43114 is fixed, we probably don't need any layout flushing at all?

Likewise, we can remove a bunch of your pres shell changes once bug 43114 is fixed.  Probably worth leaving a comment in bug 43114 once this lands, pointing to the changeset.

>diff --git a/layout/base/nsIPresShell.h b/layout/base/nsIPresShell.h
>+  /**
>+   * Tell the presshell to mark as having dirty children a path from the given
>+   * frame to the nearest ancestor with a dirty subtree, or to the reflow root
>+   * currently being reflown if no such ancestor exists.  This is to be done
>+   * immediately after reflow of the current reflow root completes.  This
>+   * method will also mark the frame passed in (and possibly nothing else, if
>+   * it's already dirty).  This method must only be called during reflow, and
>+   * the frame it's being called on must be in the process of being reflown
>+   * when it's called.  This method doesn't mark any intrinsic widths dirty and
>+   * doesn't add any bits other than NS_FRAME_HAS_DIRTY_CHILDREN.

s/reflown/reflowed/g
fly-flew-flown (irregular), flow-flowed-flowed (regular)

Also, the third sentence is is a bit confusing.  Maybe remove it and just add "(inclusive)" in both places to the first sentence.

>diff --git a/layout/base/nsPresContext.cpp b/layout/base/nsPresContext.cpp

Up to here.  More later.
> I don't see why we should promote flushes less than Flush_Style.

This code is preceded by a check that guarantees that by the time we get here aType > Flush_ContentAndNotify.  So the two are equivalent.  I'm happy to do it your way if we want to be future-safe in case we end up with a flush type between ContentAndNotify and Style.

> Did you test responsiveness while dragging? 

No.  Certainly not on Windows.  Will do.

Will add comments pointing to bug 43114, and comment in that bug.

Will fix the nsIPresShell comments.
Attached patch refreshed to latest trunk (obsolete) — Splinter Review
I didn't change anything but resolved a couple of conflicts (IIDs, mainly).
Attachment #364202 - Attachment is obsolete: true
Comment on attachment 362784 [details] [diff] [review]
Should hopefully compile on Windows Mobile

I just realized the widget parts of this are so totally not ready it's not worth reviewing.  I'll work on those.
Attachment #362784 - Flags: superreview?(dbaron)
Attachment #362784 - Flags: superreview-
Attachment #362784 - Flags: review?(roc)
Attachment #362784 - Flags: review-
Blocks: 481129
Blocks: 481740
Attached patch With event detection all working (obsolete) — Splinter Review
This compiles on Windows, by just removing the "random" testing functionality there.  Also does event detection on all platforms now.  I did test that on Windows this catches at least some events; it doesn't help a huge amount with resizing HTML5 (not catching drag events?) but helps some.  I did test dragging on windows in general, per comment 63, and didn't seem to see any issues, at least in my VMWare VM.
Attachment #362784 - Attachment is obsolete: true
Attachment #364216 - Attachment is obsolete: true
Attachment #369548 - Flags: review?(roc)
Attachment #369548 - Flags: review?(dbaron)
Oh, the remaining issue is that this seems to time out on Linux Talos on the try server.  I haven't figured out why yet.  :(
I just tried turning off script and resizing html5 on Windows.  It certainly looks like GetInputState doesn't see drag events (but does see mouse down and up).  If someone has suggestions for detecting drag events too, I'm all ears.  ;)
For calculating "drag" events, would it be enough in GetInputState or wherever, to simply calculate some delta between the current pointer location and the previous, and if it's beyond some threshold, call that a "drag event"?
We might be able to, but that would catch mouse moves too...
That's okay, as long as we have focus, then mouse-moves are meaningful...  Content JS could be highlighting buttons or doing all sorts of cool stuff, as could chrome, so interrupting for mouse-moves is good, right?
True, but do we always want to interrupt reflow as soon as the user moves the mouse for any reason over our app?  That has potential for a lot of weird behavior, I think...

In any case, we can tweak the event detection as we go if the layout part works right.
Try using GetQueueStatus instead of GetInputState -- QS_INPUT includes QS_MOUSE which explicitly includes MOUSEMOVE.  GetInputState seems to just track mouse buttons.

However, GetQueueStatus may have effects on the rest of the event system; I believe that it marks any events it gets as "not new", and I'm not sure if we depend on that in the win32 appshell event code.
I did just notice the various bugs Jesse has been filing; I wasn't cced on them, sadly.  I'll take a look at them tomorrow...
Attachment #369548 - Attachment is obsolete: true
Attachment #369908 - Flags: review?(roc)
Attachment #369908 - Flags: review?(dbaron)
Attachment #369548 - Flags: review?(roc)
Attachment #369548 - Flags: review?(dbaron)
+  void SetInterruptState(PRBool aInterruptible);

Should probably comment here that this is only for nsPresContext to use. Any reflow code wanting to prevent interrupts should use InterruptPreventer.

+    PRBool mInterruptsEnabled;
+    PRBool mHasPendingInterrupt;

PRPackedBool...

+  nsAutoTArray<nsIFrame*, 16> mFramesToDirty;

Why is it worth making this an nsAutoTArray?

+  for (unsigned int i = 0; i < mFramesToDirty.Length(); ++i) {
+    if (aFrame == mFramesToDirty[i]) {
+      mFramesToDirty.RemoveElementAt(i);
+      --i;
+    }
+  }

Isn't it pretty easy for this to be bad perf-wise? Or is it near-impossible for mFramesToDirty to be more than a few frames? Wouldn't it be simple enough for mFramesToDirty to be a hashtable?

       // XXXwaterson for interruptible reflow, examine the tree and
       // re-enqueue any unflowed reflow targets.
-
-      mIsReflowing = PR_FALSE;

I think we can remove that waterson comment now...

+  // XXXbz do we need to invalidate text runs here?
+  aLine->MarkDirty();

I don't think we need to invalidate textruns here.

+  sEventCount =
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventLeftMouseDown) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventLeftMouseUp) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventRightMouseDown) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventRightMouseUp) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventLeftMouseDragged) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventRightMouseDragged) +
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventKeyDown),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventKeyUp),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventScrollWheel),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventTabletPointer),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventOtherMouseDown),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventOtherMouseUp),
+    CGEventSourceCounterForEventType(kCGEventSourceStateCombinedSessionState,
+                                     kCGEventOtherMouseDragged);

Probably worth just writing an array of the event types and looping over it.

The Mac GetHasPendingInputEvent will return false after returning true once, if no more events have happened. I think that reset is only supposed to happen after resetPendingEventState.

+  /**
+   * Reset the pending input event state.  Input events that happened before
+   * now won't get counted as pending for purposes of hasPendingInputEvent.
+   */
+  void resetPendingEventState();

We should also document that on some platforms, this does nothing and resetting happens automatically when input events are processed.

Maybe we should just take out ResetPendingEventState?
> Why is it worth making this an nsAutoTArray?

Because there usually won't be too many of them; is the question why not to just make this an nsTArray and pay the allocation cost every time we interrupt?

> Isn't it pretty easy for this to be bad perf-wise? Or is it near-impossible
> for mFramesToDirty to be more than a few frames? Wouldn't it be simple enough
> for mFramesToDirty to be a hashtable?

In a typical HTML document, mFramesToDirty is bounded by the depth of the block frame tree, more or less: the innermost block we interrupted on plus all its ancestors will be in the list.

In XUL, conceivably all blocks will end up in the array because xul layout doesn't bail out of reflowing later boxes on detecting an interrupt.

I could switch this to a hashtable just so we don't have this question, sure.

> The Mac GetHasPendingInputEvent will return false after returning true once

Er, yes.  Good catch.  Will fix.

> We should also document that on some platforms, this does nothing and
> resetting happens automatically when input events are processed.

Will do.

> Maybe we should just take out ResetPendingEventState?

It's really needed on mac.  If I take it out, then either I need to call GetHasPendingInputEvent where I currently reset and rely on the fact that on Mac it resets something, or every single reflow on mac will interrupt if the user did something between the start of the previous reflow and the start of this one...  Maybe we could nix this if we reset the event counter after every time we process events in mac appshell or something?
(In reply to comment #78)
> > Why is it worth making this an nsAutoTArray?
> 
> Because there usually won't be too many of them; is the question why not to
> just make this an nsTArray and pay the allocation cost every time we
> interrupt?

Yes, that's the real question sorry :-)

> I could switch this to a hashtable just so we don't have this question, sure.

Let's do that thanks.

> > Maybe we should just take out ResetPendingEventState?
> 
> It's really needed on mac.  If I take it out, then either I need to call
> GetHasPendingInputEvent where I currently reset and rely on the fact that on
> Mac it resets something, or every single reflow on mac will interrupt if the
> user did something between the start of the previous reflow and the start of
> this one...  Maybe we could nix this if we reset the event counter after every
> time we process events in mac appshell or something?

I think that would be more consistent with what the other platforms are doing, so I'd prefer it.
> +    PRBool mInterruptsEnabled;
> +    PRBool mHasPendingInterrupt;
> PRPackedBool...

Why bother, for a rarely-used stack-only class?  I can do this if you think it's worth it.

Switched to a hashtable for mFramesToDirty (would it kill nsTHashSet to have an enumeration and clearing api?).  Fixed the comments.

Switched Mac to reset the counter when it processes native events and there aren't any left.  Added mousemove events to the list on Mac, since we detect them on Windows and Linux. Seems to work ok-ish so far; need to test in an opt build too.
(In reply to comment #80)
> > +    PRBool mInterruptsEnabled;
> > +    PRBool mHasPendingInterrupt;
> > PRPackedBool...
> 
> Why bother, for a rarely-used stack-only class?  I can do this if you think
> it's worth it.

I just do it automatically so people don't get confused and start doing it where it matters. But it doesn't matter here so don't bother.

> Switched Mac to reset the counter when it processes native events and there
> aren't any left.  Added mousemove events to the list on Mac, since we detect
> them on Windows and Linux. Seems to work ok-ish so far; need to test in an opt
> build too.

I would have thought that interrupting on mouse moves would be really bad, but maybe it isn't if periods of mouse activity are frequently punctuated by idle periods when we can complete the reflows.
Yeah, I'm not convinced of the mousemove thing either.  We can take it out easily on Mac if it becomes an issue; and I think we might be able to manage on Windows.  Not sure.  I really have no idea about Linux, which still has the "time out on Tp" thing going on....
Still need to fix Jesse's bugs.
Attachment #369908 - Attachment is obsolete: true
Attachment #370555 - Flags: review?(roc)
Attachment #370555 - Flags: review?(dbaron)
Attachment #369908 - Flags: review?(roc)
Attachment #369908 - Flags: review?(dbaron)
OK, some bad news here.  Talos on Linux is just timing out, partway through cycle 3 of 10.  I did some testing on try server, and it looks like gdk_event_pending is returning true on try server.  A lot.  Somewhere between 3 and 50 times per page, depending on the page.  We have a total of 3500 or so reflow interrupts before the test harness kills us.

I'm guessing that gdk_event_pending sees pending paint events or something silly like that.  Sadly, I see no better API available on Linux...
gdk_event_pending is a pretty big sledgehammer, and it's going to return everything from expose events to window manager pings to mouse moves.  There are ways to filter to see if certain events are pending in the queue, but it means reaching down to the low-level X calls to do so.

It's also possible that the thread event queue code that I wrote 1000 years ago is still coming through the X event queue, which might make things exciting.  But I don't know if that code is still in use or not.
Would simply increasing sInterruptChecksToSkip help on Linux?
It'd help, but not really resolve the underlying issue...  But yes, that number s certainly something we still want to tune.
Pushed-to-try a debug addition to hopefully let us see what events tend to be at the front of the queue when we interrupt.  News in a few hours.
So bug 481566 added an nsIWidget API that more or less does what we want here (and is implemented in almost the same way on Windows, except it also checks for the window being dragged around).  It'd probably make sense to use that directly, right?
That is, undo the appshell api change here and move the code this patch adds to appshell on Linux and Mac into the relevant nsIWidget impls... Then have the prescontext check its widget.  Or we could nix the widget API and switch the sink to appshell, if we think that's better in our "widgets going away" plan.
There will still be a top-level nsIWidget. Personally I think having a per-window interrupted flag is good...
crowder had the idea of checking gdk_event_peek() after gdk_events_pending() returns true to see whether we "really" have an event pending.  The good news is that returning true for pending events only when gdk_events_pending() and gdk_event_peek() returns true makes tinderbox green.  The bad news is that this condition just seems to never be true, at least in my Linux VM.  Whenever gdk_events_pending() returns true in the widget code, gdk_event_peek() hands back null.  So we just never interrupt .  :(
Attached patch With maybe better Linux code (obsolete) — Splinter Review
This moves the check for pending events to nsIWidget, and uses a Linux implementation stuart suggested.

Another option, if we can add a custom XEvent to "the end" of the queue, would be to do that (so the queue contains an event we know about), then call XPeekIfEvent with a predicate that matches the events we care about plus our custom event.  Then remove our custom event from the queue...  Not sure whether there are decent X APIs for all that.
Attachment #370555 - Attachment is obsolete: true
Attachment #371601 - Flags: review?(roc)
Attachment #371601 - Flags: review?(dbaron)
Attachment #370555 - Flags: review?(dbaron)
Oh, that last patch interrupts as needed on Linux and passes Talos on the try server (doesn't even regress it time-wise).
That patch will listen for a lot more than just input events.  The ExposureMask and VisibilityChangeMask will also respond for expose (draw) events and changes in mapping.  You might want that, actually, but it's just FYI.
Er..  I didn't mean to have ExposureMask in there; I've removed it locally.  For the rest, I'd love to look for a smaller set of events, but the problem with this approach is that it can move an event in the set of events I'm looking for to in front of events I'm not looking for.  I'm hoping reordering expose events with others is ok, but VisibilityChange could affect focus, I would think, and reordering focus changes and user input seems bad.
I really wonder now what gdk_events_pending was seeing that this doesn't see, by the way.... ;)
I'd have to look at the actual source of gdk_events_pending but a guess-based response is that it's looking at other I/O sources including random file descriptors - not just input events.

Did you ever see if the thread event queue was still done with X Events or through the GIO loop?
Does gdk_events_pending count the idle-loop events? Do we use those?
Flags: blocking1.4-
I have no idea what the question in comment 98 means, nor what the answer to the question in comment 99 is.
Looking at the docs it looks like gdk_events_pending() only deals with events on the X queue - I suspect that a more generalized mainloop call would pick up anything from the gio queue.
Attached patch merged up to trunk (obsolete) — Splinter Review
Same patch as bz's, just merged up to trunk
Comment on attachment 371601 [details] [diff] [review]
With maybe better Linux code

I don't like the name MarkPathToFrameDirty; it makes the method sound
more general than it is.  Perhaps FrameNeedsToFinishReflow or
FrameNeedstoContinueReflow?

The way sGotInterruptEnv is set far away from where it's used seems odd.
Why not just set it to true at the caller instead of inside
GetInterruptEnv so that all the code using it is in one place?

In nsPresContext::CheckForInterrupt() would you mind deciding whether
your local style is to brace 1-line bodies?  Having four of them in
perfectly alternating styles is somewhat jarring.

Also, maybe add a comment explaining why the mInterruptsEnabled check is
after the mInterruptChecksToSkip check+reset?  (I don't quite
understand.)

I'm also not a big fan of the name SetInterruptState.  Maybe Reset
instead of Set, or maybe something entirely different like
ReflowStarted?

I find nsAbsoluteContainingBlock::MarkFramesDirty confusing since the
boolean parameter does two different things (only one of which is
covered by its name).  I think it would be clearer to have two separate
method names (although maybe one private implementation behind them).

In nsAbsoluteContainingBlock::Reflow, there's no need to check
kidNeedsReflow twice (or factor it into a variable).  However, you can
avoid reindenting by inverting the check and making it a |continue|, and
then making the rest of the function an
if(!aPresContext->HasPendingInterrupt()) { } else { } 
(or maybe the other way around).

Is the code in nsAbsoluteContainingBlock going to mess things up if it
interrupts in the middle of reflowing the absolute kids of a relatively
positioned inline?  (That can happen, since those absolute kids can have
lines inside of them...)

In nsBlockFrame::Reflow, I don't understand why you need to
MarkFramesDirty(PR_TRUE) when !HasPendingInterrupt().  (i.e., you're
changing the existing codepath for the case where
WillReflowAgainForClearance is true and NS_FRAME_IS_DIRTY is set, but
we're not interrupting, and I don't see why you want to change it, which
just seems like an unnecessary slowdown.)

There are three places (nsAbsoluteContainingBlock, nsBlockFrame,
nsColumnSetFrame) where you call MarkPathToFrameDirty if
HasPendingInterrupt is true.  Why do you need to bother?  Wouldn't it be
sufficient to call MarkPathToFrameDirty when you call CheckForInterrupt?
(Maybe CheckForInterrupt should even take the frame as a parameter?)

Maybe MarkAllDescendantLinesDirty should only be moved up to right
before MarkLineDirtyForInterrupt?  Or was there a reason you moved it
earlier other than the order you wrote the code?

What guarantees that a column set reflow gets further than it did the
last time around before interrupting again?

Is there some requirement in the column-set main reflow loop that you
*have* to break out if there's a pending interrupt?  Otherwise do you
risk propagating the effects of an incorrect reflow status to the later
siblings, or something like that?  (This sort of made sense in my head
for a few seconds, but no longer does.)  If so, could you add a comment?


Did the patch that dealt with the nsIFrame::GetUsed* assertions as a
result of this patch already land?  I can't remember what approach we
followed for that...

r=dbaron
Attachment #371601 - Flags: review?(dbaron) → review+
(In reply to comment #103)
> In nsAbsoluteContainingBlock::Reflow, there's no need to check
> kidNeedsReflow twice (or factor it into a variable).  However, you can
> avoid reindenting by inverting the check and making it a |continue|, and
> then making the rest of the function an
> if(!aPresContext->HasPendingInterrupt()) { } else { } 
> (or maybe the other way around).

Sorry, ignore this comment (per discussion on IRC).
> I don't like the name MarkPathToFrameDirty

I've renamed it to FrameNeedsToContinueReflow.

> Why not just set it to true at the caller instead of inside
> GetInterruptEnv so that all the code using it is in one place?

Done.

> In nsPresContext::CheckForInterrupt() would you mind deciding whether
> your local style is to brace 1-line bodies?  Having four of them in
> perfectly alternating styles is somewhat jarring.

I think that's me vs roc styles... I've normalized them all to brace.

> Also, maybe add a comment explaining why the mInterruptsEnabled check is
> after the mInterruptChecksToSkip check+reset?

There was no good reason, and in fact changing the order of those two makes sense.  Did that.
 
> I'm also not a big fan of the name SetInterruptState.  Maybe Reset
> instead of Set, or maybe something entirely different like
> ReflowStarted?

Hmm.  I guess we don't actually need to reset the interrupt state after reflow finishes, because we'll never call Reflow() on a frame without going through PresShell::DoReflow.  OK, changed this to ReflowStarted().

> I find nsAbsoluteContainingBlock::MarkFramesDirty confusing since the
> boolean parameter does two different things (only one of which is
> covered by its name).  I think it would be clearer to have two separate
> method names (although maybe one private implementation behind them).

Done.

> Is the code in nsAbsoluteContainingBlock going to mess things up if it
> interrupts in the middle of reflowing the absolute kids of a relatively
> positioned inline?  

It'll cause that rel pos inline to get flagged with NS_FRAME_HAS_DIRTY_CHILDREN, as well as all its ancestors, and will flag any unreflowed abs pos frames as NS_FRAME_IS_DIRTY if the inline is currently dirty.  That should work fine, I would think...

> I don't understand why you need to MarkFramesDirty(PR_TRUE) when
> !HasPendingInterrupt()

This is for the abs pos kids, right?  The only slowdown is in the case when !HasPendingInterrupt() and the blockframe has NS_FRAME_IS_DIRTY and the slowdown is precisely the time it takes to add NS_FRAME_IS_DIRTY to all the abs pos kids... It didn't seem like a big deal, but in any case it's easy to add a check here.  Did so.

> There are three places (nsAbsoluteContainingBlock, nsBlockFrame,
> nsColumnSetFrame) where you call MarkPathToFrameDirty

That's a holdover from an earlier version of the patch where MarkPathToFrameDirty didn't mark the ancestors dirty, or did it eagerly, I think...  At the time I was still trying to get this playing nice with foreignobject.

> (Maybe CheckForInterrupt should even take the frame as a parameter?)

I like that idea.  Done.

> Maybe MarkAllDescendantLinesDirty should only be moved up to right
> before MarkLineDirtyForInterrupt?

Yeah.  I used to call it directly from ReflowDirtyLines in an earlier incarnation of the patch, but that's certainly not happening now.  Fixed.

> What guarantees that a column set reflow gets further than it did the
> last time around before interrupting again?

Nothing, just like nothing guarantees that reflow in a XUL document with block kids gets further, or that reflow of a table gets further...  If you think I should do so, I can try to come up with a scheme that either eventually does a non-interruptible reflow "eventually" or that backs off on interrupts more and more until we complete.  I'd prefer that to be a followup patch, though, since it'll really only affect the prescontext code that decides whether it's ok to interrupt.

> Is there some requirement in the column-set main reflow loop that you
> *have* to break out if there's a pending interrupt?

No, it's just an optimization and a code clarity thing.  Otherwise we'd do all the setup for the block reflows and then bail out of them all after the first line; if we have a lot of columns and those first lines are inline, that might even take a while.  Or we'd need conditional branches inside the column-set main reflow loop for each frame to mark it dirty instead of reflowing it.  Or something.

I added a comment about this.

> Did the patch that dealt with the nsIFrame::GetUsed* assertions as a
> result of this patch already land?  I can't remember what approach we
> followed for that...

Yes, it's landed.  We disable those asserts while painting, basically.

I checked in a patch with the comments addressed as above, then backed it out due to sporadic Linux Talos orange.  Sporadic in terms of box and which cycle it hits on, not which page it happens on...
I'm not convinced that this really caused problems. All the Talos runs that Joe starred have this message:

> ************************************************************************************************
> *********** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MESSAGE EXPECTED **********
> ************************************************************************************************
> 
> remoteFailed: [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.

> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240362007.1240368337.7268.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240362007.1240372346.2981.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240366397.1240375282.12395.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240365057.1240372871.11342.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240364404.1240368709.7928.gz
> http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1240364404.1240369616.9901.gz

Maybe I'm missing something.
That's how every single talos run ends.  The error messages are above that, near the end of the previous build step, where it says:

FAIL: Busted: tp
FAIL: browser frozen
Oh, I see. I remember seeing different messages at the bottom, but OK. Would be nice if the summary was actually short, or if the failure would be referenced at the top.
Please file a bug every time you notice a summary that isn't actually short, or a failure that isn't referenced at the top.  Those bugs tend to get fixed quickly.
(In reply to comment #110)
> Please file a bug every time you notice a summary that isn't actually short, or
> a failure that isn't referenced at the top.  Those bugs tend to get fixed
> quickly.

In this case, that's bug 489524, which I filed yesterday.
OK, the hang seems to be reliably like so:

NOISE: Cycle 3: loaded http://localhost/page_load_test/pages/www.skype.com/www.skype.com/index.html (next: http://localhost/page_load_test/pages/www.mofile.com/tv.mofile.com/cn/index/main.do.html)
Failed tp: 
		Stopped Tue, 21 Apr 2009 22:32:37
FAIL: Busted: tp
FAIL: browser frozen

(as in, while loading mofile.com).  It's not consistent in terms of which cycle it hits.  It seems to mostly happen on Linux, but there were several failures on Windows too.
I think the next step is to fix the two known issues (in the blocking bugs) and hope that one of them was the problem here.

Note that I just tried a Talos run on try server with this patch and one of those testers went orange too....  So at least I'll have a way to maybe test this.

In that talos run I put in a printf whenever we actually find an event on Linux, and for what it's worth we're interrupting there a good bit, for these event types:

PropertyNotify
MapNotify
DestroyNotify
UnmapNotify
VisibilityNotify
ConfigureNotify
EnterNotify
LeaveNotify
Attachment #371601 - Attachment is obsolete: true
Attachment #372924 - Attachment is obsolete: true
Well, I've managed to catch the hang locally, sort of.  Talos keeps killing off my debugger, so debugging is sort of fun.  But at least I know one thing that's going on: we get a float of height > nscoord_MAX, which causes us to go into an infinite loop in CanPlaceFloat.  Looking into how it gets that height.
OK, the problem is a floating table which has an inner whose mRect.height is > nscoord_MAX.
And the reason for that is a row with a very large height.  And the reason for _that_ is a cell with a GetCellBaseline() of -0x5a5a5a5b.

And that happens because the first block inside the cell's block has an inline line with that value as the ascent (negative and very very big by absolute value).
And that happens because nsLineBox doesn't init mAscent in its constructor, so in this case it's 0xa5a5a5a5.  We end up subtracting it from 0, which gives -0x5a5a5a5b.

0xa5 is the uninitialized memory marker, of course.
Filed bug 491415 on that.
Depends on: 491415
Since this was checked in, I am seeing issues trying to use google reader.  If I click on an article in the list in the main window, it does not "open" until I navigate away from the google reader page.
Bill, do you mind filing a bug on that, blocking this one?  Include the following information:

1)  Your OS (Windows as I recall, but just to make sure).
2)  Precise steps to reproduce (and in particular, which of google reader's many
    different settings for layout and display you're using).

I can't reproduce the issue on Mac, but the behavior is quite different per OS, and the google reader settings almost certainly matter....
Documenting irc discussion on additional tests for this bugfix.
---------------
tchung: bz: any advice on how we can test this monster?
[11:27am] bz: tchung: ummm
[11:27am] bz: tchung: browse?  
[11:27am] bz: tchung: seriously, there's a followup patch with some test infrastructure and tests...
[11:27am] bz: tchung: you could also set the env vars to interrupt every so often instead of on user events, and run automated tests that way
[11:28am] tchung: bz: other than just browse, any content types of regressions that *could* happen?
[11:28am] bz: tchung: but Jesse and I did a fair amount of that
[11:28am] bz: tchung: what needs the most testing is the user event detection
[11:28am] bz: tchung: well.. there is one known crash regression (hard to hit; will work on fixing asap)
[11:28am] bz: tchung: and one known assertion regression
[11:28am] ted: bz: ooh, so we could flip that on for mochitest, say?
[11:28am] tchung: which bugs are those?
[11:28am] bz: ted: which?
[11:29am] ted: the env vars
[11:29am] bz: tchung: they're blocking bug 67752
[11:29am] bz: ted: yes
[11:29am] firebot: bz: Bug https://bugzilla.mozilla.org/show_bug.cgi?id=67752 maj, --, ---, bzbarsky@mit.edu, ASSI, [FIX]need to implement interruptible reflow
[11:29am] bz: ted: though
[11:29am] tchung: bz: okay i'll just comment in the bug to track our discussion here
[11:29am] sayrer: choppy
[11:30am] bz: tchung: one sec
[11:30am] sayrer: can't really hear
[11:30am] bz: tchung: see also bug 481740
[11:30am] firebot: bz: Bug https://bugzilla.mozilla.org/show_bug.cgi?id=481740 enh, --, ---, njain@sta.samsung.com, NEW, test framework for interruptible reflow - trigger a reflow interrupt on textnode containing vertical
Depends on: 491537
(In reply to comment #119)
> Since this was checked in, I am seeing issues trying to use google reader.  If
> I click on an article in the list in the main window, it does not "open" until
> I navigate away from the google reader page.

I filed bug 491537 on this issue.
Pushed http://hg.mozilla.org/mozilla-central/rev/a08d1947ec5a this morning.  Tree's green and all!

Fixed. I'll look into bug 491537 on Friday, unless something magic happens before then (out of town tomorrow and Thursday).
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 491700
Depends on: 491703
Depends on: 491842
Depends on: 491960
Depends on: 492080
Blocks: 462833
No longer depends on: 462833
I suspect this caused Bug 492133
Depends on: 492145
Depends on: 492522
Depends on: 492760
Depends on: 492837
Target Milestone: --- → mozilla1.9.2a1
Depends on: 493025
Depends on: 493863
Depends on: 496500
Depends on: 496742
Depends on: 498040
Depends on: 498593
Depends on: 499307
Depends on: 499138
Depends on: 496788
I am not sure whether bug 401679 is an interruptible reflow problem .
(see comment #1, not just the subject about 100% CPU, the page causes also a   responsiveness problem when loading)

In addition: Can someone explain how is this bug marked fixed, when many of its dependencies are not fixed? 
(I am not saying this wasn't the right thing to do, you know better than me. I am just curious for the explanation )
I doubt bug 401679 is an ireflow issue, since it was filed in 2007 and ireflow landed in 2009.

The unfixed "dependencies" are regressions caused by this patch, discovered after this patch landed.  We don't reopen old bugs when we discover that they caused regressions, because that would be noisy and misleading.
Depends on: 505482
Depends on: 505654
Depends on: 507565
No longer depends on: 507565
Depends on: 507566
Depends on: 508136
Depends on: 512410
Depends on: 516039
Depends on: 517772
Depends on: 519528
No longer depends on: 517772
Depends on: 517772
Depends on: 570090
Depends on: 1278983
Depends on: 1495169
No longer depends on: 1495169
You need to log in before you can comment on or make changes to this bug.