Open Bug 363698 Opened 16 years ago Updated 2 years ago

firefox chews CPU on large table

Categories

(Core :: Layout: Tables, defect, P3)

defect

Tracking

()

People

(Reporter: eagle.lu, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(1 file)

Go to

http://cvs.gnome.org/viewcvs/gtk%2B/ChangeLog.pre-2-8?rev=1.7010&view=log

firefox comsumes over 90% of CPU for over 6 minutes.

This bug occurs both on Linux and OpenSolaris
Product: Firefox → Core
QA Contact: general → general
It's taking forever for me on Windows.  

But how is this different than any of these bugs: bug 54542, bug 318474, bug 352367.

Tables that large are just not wise.  
OS: Linux → All
Hardware: Sun → All
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → DUPLICATE
Summary: firefox chews CPU → firefox chews CPU on large table
Duplicate of bug: 318474
Please don't ever mark performance bugs as duplicates unless:

1)  They have an identical small reduced testcase.
  or
2)  There is profile data indicating that they are in fact duplicates.

If you think they might be duplicates, mark a dependency.

In particular, bug 318474 is more or less useless, so marking something duplicate of it is a good way to make sure that nothing happens with it.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
So a profile shows an obvious difference from the testcase in bug 318474: there are hotspots in GetRowCount and GetEffectiveRowSpan:

Total hit count: 1522412
Count %Total  Function Name
490927   32.2     nsCellMap::GetRowCount(int) const
282600   18.6     nsTableCellMap::GetEffectiveRowSpan(int, int) const
94102   6.2     nsTableCellMap::GetEffectiveColSpan(int, int) const

Looking at the testcase, it looks like in addition to there being lots of rows, there are actually 17950 tbodies, which I suspect is what's triggering the above hotspots.
I wonder whether we could store the nsCellMaps in an array instead of a linked list, store an up-to-date start index for each one, and do binary search...  Right now, about 30% of the time on this testcase is spent under nsTableFrame::GetEffectiveRowSpan.  Each call is O(N) in number of rowgroups, and we have more or less one rowgroup per 2-3 rows, so overall table layout is O(N^2) in number of rows...
Status: REOPENED → NEW
Component: General → Layout: Tables
Keywords: perf
QA Contact: general → layout.tables
Alternately, we'd want to keep more state in the various callers of GetEffectiveRowSpan, so we don't have to keep recomputing the nsCellMap a given cell lives in.  Especially because the answer is the same for all cells in a given row!
So just to keep track of things, the main possible callstacks into nsCellMap::GetRowCount are (with hit counts along the way, and branches that are considered elsewhere marked with a 'R' after the count, for "repeat"):

490927 nsCellMap::GetRowCount
  156251 nsTableCellMap::GetEffectiveRowSpan
    126746 nsTableFrame::GetEffectiveRowSpan(nsTableCellFrame&, nsCellMap*)
       121623 nsTableRowFrame::CalculateCellActualSize
         60921 nsTableRowFrame::ReflowChildren
           252972 nsTableRowFrame::Reflow
         53397 nsTableRowFrame::CalcHeight
           84194 R nsTableRowFrame::ReflowChildren
       55093 nsTableRowFrame::InitHasCellWithStyleHeight
         55093 R nsTableRowFrame::Reflow
       54241 GetHeightOfRowsSpannedBelowFirst
         54241 nsTableRowFrame::DidResize
           54241 nsTableRowGroupFrame::DidResizeRows
             54241 nsTableRowGroupFrame::CalculateRowHeights
               170488 nsTableRowGroupFrame::ReflowChildren
       32813 nsTableRowFrame::UpdateHeight
         30797 R nsTableRowFrame::CalcHeight
         2017 R nsTableRowFrame::ReflowChildren
    29505 nsTableFrame::GetEffectiveRowSpan(int, nsTableCellFrame&)
      65164 R nsTableRowGroupFrame::CalculateRowHeights
  137025 R nsTableFrame::GetEffectiveRowSpan(nsTableCellFrame&, nsCellMap*)
  57082 nsTableFrame::GetEffectiveColSpan(nsTableCellFrame&, nsCellMap*)
    105840 R nsTableRowFrame::ReflowChildren
  48758 nsTableCellMap::GetEffectiveColSpan
    28578 R nsTableFrame::GetEffectiveColSpan(nsTableCellFrame&, nsCellMap*)
  35659 R nsTableFrame::GetEffectiveRowSpan(int, nsTableCellFrame&)
490927 nsCellMap::GetRowCount
  156251 nsTableCellMap::GetEffectiveRowSpan
    126746 nsTableFrame::GetEffectiveRowSpan(nsTableCellFrame&, nsCellMap*)
       121623 nsTableRowFrame::CalculateCellActualSize
         60921 nsTableRowFrame::ReflowChildren
           252972 nsTableRowFrame::Reflow

so this means for given row where we could once compute to which cellmaps it belongs, instead we search for the cellmap over and over again, by going the long way instead of supplying the cellmap at http://lxr.mozilla.org/seamonkey/source/layout/tables/nsTableRowFrame.cpp#631 but supplying null. Silly. More than this the cellmap could be computed at the rowgroup level above and be supplied to all its rows.
 
For my reference when testing whether performance has improved:  I let this page load from local disk for about two hours, and it's still not done.
Bernd, that's exactly what I think we should be doing.  Computing in nsTableFrame or nsTableRowGroupFrame which cellmap is the relevant one and then making use of it throughout.

Not sure what the cleanest way to do this is, yet.  Especially given a table split over multiple pages.
Our in other words

http://lxr.mozilla.org/seamonkey/search?string=GetEffectiveRowSpan

nearly all hits are potential performance issues. Only border collapse uses the fast way :-(
The way border collapse uses is buggy right now.  It gives the wrong answer if there is more than one rowgroup.
It looks like we have an nsTableRowGroupFrame::GetStartRowIndex... Is that guaranteed to return the same thing as adding up the nsCellMap::GetRowCount for the preceding rowgroups's cellmaps?
For that matter, would it make sense to just store the start index on nsCellMap ahd have a pointer from nsTableRowGroupFrame to the corresponding nsCellMap?
The reason I ask the question in comment 14 is that nsTableRowGroupFrame::GetStartRowIndex seems to be a lot more complicated than it really should be, somehow...
>nsTableRowGroupFrame::GetStartRowIndex seems to be a lot more complicated than
>it really should be, somehow...
Which part? The first one which is lame protection against wrong frame construction, where we should rather assert and die  or the second which handles empty rowgroups?


Both.  Why not just store the row index of the first row in the rowgroup directly?
I think thats a wrong design decision from the time (1998) where frames should have been as lean as possible, so the idea was probably to store the index only once in the row and then to look it up. Then lame protection  was necessary at that time 3.55 	buster%netscape.com	1998-09-15 13:36	 	nsTableRowGroupFrame no longer assumes all its children are rows, or that all row children are cells.

Storing means syncing it all the time, also when rows are removed from other rowgroups that are before. This should be done inside  nsTableRowGroupFrame::AdjustRowIndices
Hmm.  Rowgroups can split, right?  But the cellmap lives entirely on the first-in-flow table, correct?

How is this stuff supposed to work, exactly?   It looks to me like adding rows to a not-first-in-flow rowgroup currently just corrupts the cellmap.  We'll try to

  cellMap->InsertRows(aRowGroupFrame, aRowFrames, aRowIndex, aConsiderSpans,
                      damageArea);

Where aRowGroupFrame is the in-flow rowgroup, and then InsertRows() will fail to find the corresponding cellmap and do nothing at all.

Or am I missing something?
The reason this matters is that for the functions being called here we don't need a rowgroup-relative index.  We need an nsCellMap-relative one.

I suppose we could just store the start of each nsCellMap directly in the nsCellMap (which would still be broken with splitting, but whatever) and then make the nsCellMap list an array so we can binary search to find the right cellmap?  That would make this pageload O(N^2 log N) instead of the current O(N^3) in number of rows....  We could make it O(N^2) if we Just Knew in O(1) time which nsCellMap to use, and with what index, of course.

Another option is to not change anything about how nsTableRowGroupFrame works apart from having a pointer from the nsTableRowGroupFrame to the relevant nsCellMap, and store the start row index in the nsCellMap.  That might be the simplest thing to do here...

Instead of changing the mess that is the current rowgroup code (and then dragging along all the information needed 
So the issue is that maintaining the rowgroup-to-cellmap pointer is a little weird when we have continuation rowgroups.  Ideally those would share the same cellmap pointer as the first-in-flow.  But we do in fact give a (semi-bogus?) cellmap to the in-flows of a table, so I'm not quite sure, for example, whether in-flow rowgroups would end up going through nsTableFrame::InsertRowGroups as they get pulled.
I guess I'll assume they don't since that code uses the first-in-flow's cellmap?
Man.  All the InsertRows, etc stuff is just so broken with continuations, since it derives indices based on the frame tree (the nsRowGroupFrame methods) and then uses them to index into the cellmap (which uses first-in-flow-table-relative indexing throughout).

I really hope we don't split tables over columns...
Stopping work on this for now, and praying that someone will tell me that I'm wrong and that this stuff somehow works.... and explaining how.
>I really hope we don't split tables over columns...

We don't do this, it crashes within seconds if Mars ehm Jesse attacks. I wrote a special barrier against this (bug 362275).
see https://bugzilla.mozilla.org/show_bug.cgi?id=362275#c8 for a description of the situation
> it crashes within seconds

Ok.  That's consistent with my reading of this code, too.  OK.

In that case I guess we can just not worry about any of the updating stuff working with in-flows and file followup bugs?
To be precise: the basic assumption of the pagination code inside tables is that it will happen only during printing and NEVER incremental. (See http://lxr.mozilla.org/seamonkey/source/layout/tables/nsTableRowGroupFrame.cpp#1032).  If we decide to change this to be compatible with columns it requires a lot of changes. I think this out for the coming release. Having postponed it, one should seriously think where we are heading I would like to point to roc's comment https://bugzilla.mozilla.org/show_bug.cgi?id=397428#c14

"The continuation model itself is a flawed design."

On the other hand, I have seen the previous big rearchitecture attempt with Chris Karnaze and John Keiser in 2002/2003 and it virtually stopped all the work on this code and got never into trunk.

Flags: blocking1.9?
Flags: blocking1.9? → blocking1.9-
I would like to help with this test case:
http://rty24.webhelye.hu/editReportXml.jsp.html
This does seem to have somewhat regressed since Fx 2 but the testcase above is a really, really complex example with over 7,000 lines of nested tables.
(In reply to comment #30)
> This does seem to have somewhat regressed since Fx 2 but the testcase above is
> a really, really complex example with over 7,000 lines of nested tables.
> 

"the testcase above is a really, really complex example with over 7,000 lines of nested tables" that can be rendered within seconds with Fx2 but completely locks up Fx3 for minutes. That looks to be much more than "somewhat" regressed to me.
If something is a regression from Firefox 2 it is NOT this bug.  This bug is present in Firefox 2.  Please file a new bug with the new testcase and cc me on it?

In general, always do this for performance issues.  "slow" can happen for a large number of reasons, and different reasons need different bugs to address them.  If it's really a duplicate, it'll get marked that way.
I filed bug 438509 on the testcase from comment 29.  As expected, it has nothing to do with this bug, and also as expected it does show a serious performance regression that we need to fix.
Taking, I guess, though I'd love for some who wants to learn layout stuff to pick this up... ;)
Assignee: nobody → bzbarsky
Blocks: 234240
Priority: -- → P1
Blocks: 397829
Mats, any chance you want to take this?
Firefox becomes unusable for over 1 minute when rendering this 9 MB Apache directory listing (essentially a large table):

 http://ftp.mozilla.org/pub/mozilla.org/thunderbird/nightly/latest-comm-aurora-l10n/

I created a new profile and installed the profiler addon mentioned in [1], loaded the page above, and uploaded my profile here:

 http://people.mozilla.com/~bgirard/cleopatra/?report=a730d1b97e935d774ec42130e9f988a326854bec

I'm running the latest Nightly (2012-09-18) on OS X.  If you need me to regenerate a new profiler report with a newer build just let me know.

John

[1] https://developer.mozilla.org/en-US/docs/Performance/Reporting_a_Performance_Problem
John, could you please file a separate bug on that?  Mark it as depending on bug 539356, because a huge part of the profile is invalidation...
(In reply to Boris Zbarsky (:bz) from comment #37)
> John, could you please file a separate bug on that?  Mark it as depending on
> bug 539356, because a huge part of the profile is invalidation...

Ok - I filed bug 792413.
Decreasing the priority as no update for the last 2 years on this bug.
See https://github.com/mozilla/bug-handling/blob/master/policy/triage-bugzilla.md#how-do-you-triage 
about the priority meaning.
Priority: P1 → P5
Priority: P5 → P3
Assignee: bzbarsky → nobody
You need to log in before you can comment on or make changes to this bug.