Closed Bug 631035 Opened 9 years ago Closed 9 years ago

Tinderbox log sends memory usage through the roof

Categories

(Core :: Layout: Text and Fonts, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- final+

People

(Reporter: jdm, Assigned: jfkthame)

References

()

Details

(Keywords: regression, Whiteboard: [softblocker])

Attachments

(5 files, 4 obsolete files)

Viewing a specific full tinderbox log sent my memory usage shooting up to >1.5gb.  About:memory showed no obvious culprits on a repeat run which only pushed it to >1gb:

Overview
            Memory mapped:
            1,223,688,192
          	  
            Memory in use:
            1,202,889,248
    
      Other Information
	malloc/allocated 1,202,891,752
        malloc/mapped 1,223,688,192
        malloc/committed 1,223,688,192
        malloc/dirty 249,856

        js/gc-heap 37,748,736
        js/string-data 7,567,536
        js/mjit-code 21,001,594
        storage/sqlite/pagecache 33,448,616
        storage/sqlite/other 1,286,888
        
        images/chrome/used/raw 0
        images/chrome/used/uncompressed 125,644
        images/chrome/unused/raw 0
        images/chrome/unused/uncompressed 0
        images/content/used/raw 694,316
        images/content/used/uncompressed 3,686,000
        images/content/unused/raw 0
        images/content/unused/uncompressed 0

        layout/all 39,504,685
        layout/bidi1, 916
    
        gfx/surface/image 91,816
blocking2.0: --- → ?
There's clearly _something_ that's using a bunch of memory here that's not listed in that output.

The DOM here contains a 30MB textnode.  That's not included in the above breakdown.  It might be 60MB if there's non-ASCII text in it.

I _think_ layout/all is supposed to include the frame tree and style data and such.  That might be plausible; if there's no bidi in that file (which may be an incorrect assumption) there should be at least 20.5MB of text frames on a 32-bit system.  Were the numbers above 32 or 64 bit?

That still leaves us a few hundred MB short, right?
Summary: Tinderbox log send memory usage through the roof → Tinderbox log sends memory usage through the roof
These are 32bit numbers, fwiw.
What's the URL of the tinderbox log?  I tried one and it only went up to 560MB and I had several other tabs open already.
See the URL field.  Also, http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1296696733.1296700907.12147.gz&fulltext=1 .  The best candidates for the hogging logs in my experience is ones containing process crashes for some reason.
On my Mac, I see 956 MB (mapped) to start with, dropping down to 540 MB.

On my Linux64 box, I see 460 MB dropping down to 262 MB.
262mb should be totally believable.

We should get bugs filed to log the amounts of DOM and textrun data into about:memory.
Attached file Massif results
These are the ones worth looking at:

17.15% (361,126,656B)  gfxTextRun::AllocateStorage(void const*&, unsigned int, unsigned int) (mozalloc.h:247)
12.27% (258,379,116B)  nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:88)
11.55% (243,269,632B)  gfxTextRun::CopyGlyphDataFrom(gfxTextRun*, unsigned int, unsigned int, unsigned int, int) (mozalloc.h:241)
08.03% (169,025,536B)  PresShell::AllocateFrame(nsQueryFrame::FrameIID, unsigned long) (nsPresShell.cpp:2098)
06.26% (131,864,700B)  nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:84)
05.48% (115,343,360B)  nsTextFragment::Append(unsigned short const*, unsigned int) (nsMemory.h:71)
03.59% (75,497,472B)  nsHtml5TreeBuilder::accumulateCharacters(unsigned short const*, int, int) (mozalloc.h:241)
03.23% (67,956,228B)  nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:84)

You can look at the attached file for the full results, which includes the call stacks that each of the above functions appeared in.  Looks like all the space is being used by reflow.
gfxTextRun::AllocateStorage is obvious.  

nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity is almost all the text run's BreakAndMeasureText calling GetTabWidths.  

gfxTextRun::CopyGlyphDataFrom is also clear.

PresShell::AllocateFrame is all continuing textframes (that should be the 20+MB; that being 8% matches njn's numbers).

nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity is the same as the previous one, in GetTabWidths (so in total GetTabWidths is 18.5% here).

nsTextFragment::Append is the DOM node storage for that 30 or 60 MB string; I'm a little surprised this is smaller than the text frame amount, actually, unless there's a _bunch_ of bidi stuff involved.

nsHtml5TreeBuilder::accumulateCharacters should be a transient allocation, I wouls hope.

nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity is GetAdjustedSpacingArray, under textrun text measurement.

njn, are those total allocations or what's still live at the end?
Those numbers are from the point of peak memory usage.  You can see in the attached file that Massif took 82 snapshots, only one of which (the peak) has the full details.  I didn't wait around for things to subside, though the graph shows that memory usage had dropped a bit between the page load finishing and me exiting.
Ah, ok.  We cache textruns, but evict on a timer.
bz, sounds like there's nothing surprising you here.  Is that right?
Well, the size of the textrun allocations is a bit surprising to me.  It looks like the textrun stuff taken all together is about 10x bigger than the original text.  On the other hand, if this is an incremental load, we might have been throwing away and rebuilding textruns as we go...

The other things in comment 7 are about as expected, yes.
I hope I'm not misunderstanding, but sitting around waiting for 30 seconds would show me a drop of ~300mb, bringing me down to about 1GB on the larger runs.  I didn't see any other significant memory changes after that until I closed the tab, at which point everything went back to reasonable levels.
Textruns make a copy of the text (1 or 2 bytes per character) and store at least 4 bytes per character of glyph position data. So textrun storage in AllocateStorage being 3x the DOM storage is expected, which is about what we see in comment #7. As Boris said, it's a time-based cache with a 30s timeout. It's very hard to squeeze more out of the glyph position data, that's already packed very carefully. With significant work we could probably eliminate the text copy for preformatted text at least --- doable, but not easy.

I guess because we have really large textruns we are likely encounter some weird characters that trigger a "detailed glyph" record, so in gfxTextRun::CopyGlyphDataFrom we create an mDetailedGlyphs array --- one pointer per character --- so assuming 32bit that's twice the text memory, again looking about right in comment #7. We could use arraylets or something similar here to save almost all the memory without much effort, in the case where you have a few non-ASCII characters in a sea of ASCII (common).

GetTabWidths looks like huge, huge fail on my part. I basically optimized for the case where there are no preformatted tabs and didn't think about what happens once there's even one preformatted tab. Loading one sample tbox log I see exactly one preformatted tab, naturally :-).
1) mTabWidths is an array of gfxFloat, i.e. doubles, when it should be float (or maybe int).
2) We allocate a tabwidths array for each textframe belonging to a textrun with a preformatted tab, whether the text in that frame has a tab or not. (i.e. every text frame in this example) We can easily avoid allocating an array for the textframes (think lines) that have no tab. That would basically make those allocations go away for this workload. Right now it's 8 bytes per char!
Also
3) The tabwidths arrays never ever go away. That probably explains what Josh observes in comment #13. We could fix that but once we've fixed #2 it won't matter for this workload.

I suspect the HTML5 parser made these bugs worse because it puts all the text into a single text node, which makes us get giant textruns.
I'll try to find someone to fix the second and third paragraphs of comment #14.
More precisely, what I'm suggesting for mDetailedGlyphs is making it an nsAutoArrayPtr<nsAutoArrayPtr<DetailedGlyph>>, where each inner array is at most 256 entries long (or some other power of 2).

For GetTabWidths, we should switch it from gfxFloat to float. Then we should treat absence of a TabWidthProperty() as meaning "all zeroes". Then GetTabWidths would only allocate an array and set the property if it finds a tab character in the text (otherwise the array entries must be all zero).
(In reply to comment #17)
> More precisely, what I'm suggesting for mDetailedGlyphs is making it an
> nsAutoArrayPtr<nsAutoArrayPtr<DetailedGlyph>>, where each inner array is at
> most 256 entries long (or some other power of 2).

I'm not sure this is a good approach here. I'll do some exploration to check behavior with various kinds of pages, but I suspect that the common case is to need a very few detailed glyphs at widely-scattered offsets, so the tendency would be to end up allocating quite a lot of those inner arrays, but only using an occasional entry within each.

For the extremely-sparse distribution of detailed glyphs that I think we'll see, it might be a lot more space-efficient (and not overly costly to performance) to use a hashtable keyed on the offset in the textrun. Possibly converting on the fly to an array-based implementation for textruns that turn out to have a high density of detailed glyphs, if such runs actually occur.
The arraylet approach is strictly better than what we have now, but the hashtable approach makes sense.

Of course I was wrong above, each entry actually needs an nsAutoArrayPtr<DetailedGlyph>, since we can have multiple DetailedGlyphs per character.
Example from the tinderbox log in the URL field above: the vast majority of textruns have no detailed-glyphs, of course, but a couple of large ones do. I instrumented the textrun to determine how many characters required a detailed-glyph entry, and how many separate 256-character "blocks" these would involve:

  detailedGlyphs = 647 of 15510058, touched blocks = 551 of 60587
  detailedGlyphs = 1181 of 28824075, touched blocks = 1078 of 112595

With detailed glyphs as sparse as that, I'd favour a hashtable. But it's possible that pages heavy on complex script might show different characteristics.
I also wonder about combining the actual DetailedGlyph records into a single array, and recording indexes into it for each character that needs details. That could significantly reduce the number of separate allocations we end up doing.
Note that CopyGlyphDataFrom doesn't necessarily process glyphs in-order (at least, we don't want to assume that), so we want the ability to do random updates.

I'd probably just go with the hashtable until we find a workload that has dense DetailedGlyphs. Even then it's probably not too bad, and we could tack on a switch to an array representation later quite easily.
Actually, you could put all the DetailedGlyphs into a single append-only array attached to the textrun, create an append-only DetailedGlyphList array where each DetailedGlyphList is a glyph count and and index into the DetailedGlyphs array, and then use the 27 free bits that are free in CompressedGlyph for non-simple glyphs to index into the DetailedGlyphList array. We'd have to cap textruns at 128M characters...
Well, 128M characters with DetailedGlyphs.
Is this bug primarily about the detailed glyphs memory usage? If so, let's move the component (and roc, can you triage the blocker nom?)
blocking2.0: ? → final+
Component: General → Layout: Text
Keywords: regression
QA Contact: general → layout.fonts-and-text
Actually a 128MB limit on characters with DetailedGlyphs is no problem, each DetailedGlyph struct is 16 bytes so on 32bit machines we'd OOM at 2GB of memory anyway.
I considered using some of the CompressedGlyph bits like that, but I'm a bit reluctant to impose a limit (other than the already-existing limit on textRun length due to the use of 32-bit indexes). Even though we'd currently hit OOM on 32-bit systems, what about 64-bit? Unless we rework text layout in such a way that we never create "huge" textruns - which might be a good thing, but also sounds more invasive and riskier - I'd prefer not to go that way.

I tried the hashtable approach, and that seems to work fine, but on further reflection I think we can do somewhat better. The key to note is that we don't typically need random access to the glyph data; we normally generate it sequentially, and also access it sequentially, usually from the beginning of the run.

Because of this, I propose that we store all the DetailedGlyph data in a single array, and use an auxiliary _sorted_ array of <source offset, detailed-glyph index> pairs to locate the data for each character position that has details. We keep this array sorted (which is cheap, as we normally generate the entries in order) so that we can binary-search it for the entry we need; but in addition we keep a "cursor" pointing to the last-accessed element, so that we can usually find the next entry we need immediately without requiring a binary search at all.

In testing with the tinderbox log file here, this reduces peak memory usage during layout to about 360MB here (from 620MB originally), falling to 220MB (from 255MB) after waiting for cached runs to expire. (These are approximate figures; I also notice that they are _highly_ dependent on the browser window size, so don't expect to reproduce them exactly, but they give an idea of the improvement.)
(In reply to comment #27)
> 
> In testing with the tinderbox log file here, this reduces peak memory usage
> during layout to about 360MB here (from 620MB originally), falling to 220MB
> (from 255MB) after waiting for cached runs to expire.

Nice!  Are these figures from about:memory?
Those were actually from watching the process in Activity Monitor.

Unfortunately, a tryserver run last night showed several worrying unit-test crashes and a large Tp4 regression on MacOSX64 (Opt) _only_. I'm a bit puzzled by this, as I've been building/running on OSX64 locally without problems, and no other tryserver platform showed the same issues. Anyhow, I'm cancelling the review request for now, pending further investigation.....
Comment on attachment 510053 [details] [diff] [review]
patch, optimize storage of DetailedGlyph records in gfxTextRun

Cancelling r?, pending investigation of failures on tryserver.
Attachment #510053 - Flags: review?(roc)
Comment on attachment 510053 [details] [diff] [review]
patch, optimize storage of DetailedGlyph records in gfxTextRun

Looks fine on tryserver now; I think the first push must've hit a build/machine issue of some kind, as repeated pushes with the same patch are all-clear.

I'll run some local pageload tests as well, to check it doesn't regress performance, as tryserver talos results are pretty noisy.
Attachment #510053 - Flags: review?(roc)
Basically looks good.

Why are you adding mTotalAdvance? It's not used.

+    DetailedGlyphStore *mDetailedGlyphs;

nsAutoPtr?

+        struct compareToOffset {
+        struct compareRecordOffsets {

Uppercase C

+            if (mOffsetToIndex.Length() == 0 ||
+                aOffset > mOffsetToIndex[mLastUsed].mOffset)
+            {
+                if (!mOffsetToIndex.AppendElement(DGRec(aOffset, detailIndex)))
+                {
+                    return nsnull;
+                }

{ on the same line as the 'if'

I think the code in Get() would be a little simpler if you just updated mLastUsed to point to the right record, and then return details + mOffsetToIndex[mLastUsed].mIndex.
(In reply to comment #33)
> Why are you adding mTotalAdvance? It's not used.

Oops.... right. That was residue from an experiment that turned out not to be worth pursuing.

> I think the code in Get() would be a little simpler if you just updated
> mLastUsed to point to the right record, and then return details +
> mOffsetToIndex[mLastUsed].mIndex.

OK, I've restructured it in that way, avoiding the multiple early-returns and just doing a series of if-then-else statements to check the likely shortcircuit options before finally doing binary search.

Updated patch to follow after I've done some more performance tests. Also, this didn't apply cleanly on trunk as I had some other stuff in my queue, so I'll fix that.
This addresses the DetailedGlyph storage, and gives a big win on the tinderbox log examples. (I'm working on a separate patch addressing TabWidths, but that still has a reftest failure I need to figure out.)
Attachment #510053 - Attachment is obsolete: true
Attachment #510752 - Flags: review?(roc)
Attachment #510053 - Flags: review?(roc)
Comment on attachment 510752 [details] [diff] [review]
patch, v2 - optimize storage of DetailedGlyph records

+            return details + mOffsetToIndex[mLastUsed].mIndex;;

Lose extra ;
Attachment #510752 - Flags: review?(roc) → review+
This dramatically reduces the memory impact of storing tab widths, by (a) only allocating any tab-width storage at all if a tab character is actually present in the frame's text, and (b) storing a list of (char offset, tab width) records for the tab characters rather than a width for every character in the line.

The first optimization gives a big win for tinderbox logs, where very few of the frames actually need any tab widths at all; the second one is a significant win in cases where many lines include tabs, but the overall proportion of tabs to non-tab characters is still low-to-moderate (think of tabbed source code, for example).
Attachment #511058 - Flags: review?(roc)
+  PRUint32 mWidth;

Should be 'float'.

TabWidthRec can just be called TabWidth I think. Need to document exactly what TabWidth::mOffset means. You should mention that mLimit is used so that when we're measuring text during line breaking we don't have to compute all the tab widths for an entire potentially-huge textrun when only some short section is going to fit on the current line. (This matters for white-space:pre-wrap.)

+    if (mTabWidths) {
+      PRUint32 offset = aStart - mStart.GetSkippedOffset();
+      for (PRUint32 i = 0; i < mTabWidths->mWidthRecs.Length(); ++i) {
+        TabWidthRec& tw = mTabWidths->mWidthRecs[i];
+        if (tw.mOffset < offset) {
+          continue;
+        }
+        if (tw.mOffset - offset >= aLength) {
+          break;
+        }
+        aSpacing[tw.mOffset - offset].mAfter += tw.mWidth;
       }

I think this should be a method on TabWidthStore. Possibly we should do a binary search for the starting TabWidth record, but it may not be necessary so don't bother now.

I think you're introducing a major bug in CalcTabWidths. Currently GetTabWidths always recomputes a new tab store every time we enter it with mReflowing true and mTabWidths null. But in CalcTabWidths, you do a Get(TabWidthProperty()) and if it's non-null and tabsEnd < aStart + aLength, you won't recompute any tab positions. That could leave us with incorrect tab positions. We probably need a reftest to catch this (I don't recall us having many/any tests for preformatted tabs).
(In reply to comment #38)
> I think you're introducing a major bug in CalcTabWidths. Currently GetTabWidths
> always recomputes a new tab store every time we enter it with mReflowing true
> and mTabWidths null. But in CalcTabWidths, you do a Get(TabWidthProperty()) and
> if it's non-null and tabsEnd < aStart + aLength, you won't recompute any tab
> positions. That could leave us with incorrect tab positions. We probably need a
> reftest to catch this (I don't recall us having many/any tests for preformatted
> tabs).

Hmmm.... yep, you're right, looks like I can reproduce a bug on that with a fairly simple testcase. I'll fix the patch and add a reftest tomorrow.
OS: Linux → Windows CE
Pushed part 1 (DetailedGlyph storage):
http://hg.mozilla.org/mozilla-central/rev/199cb6282554

Keeping the bug open for an update of the TabWidths part.
This now avoids reusing a potentially stale TabWidths property, I believe.
Attachment #511058 - Attachment is obsolete: true
Attachment #511321 - Flags: review?(roc)
Attachment #511058 - Flags: review?(roc)
Simple reftest that changes -moz-tab-size in an onload() script. This failed under the previous version of part 2, but now works correctly.
Attachment #511323 - Flags: review?(roc)
OK, the only problem with the patch now is that every time we call CalcTabWidths in the same reflow, if there are no tabs we'll reanalyze the text from the beginning looking for a tab. When we are doing MeasureText on a line that ends up long, I think we may call CalcTabWidths many times and hit bad behavior here.

I think we need to add an mTabWidthsAnalyzedOffset or something like that to PropertyProvider to avoid that.
Comment on attachment 511323 [details] [diff] [review]
reftest for changing the tab width

I'm not sure why this test would fail as written. Normally we wouldn't take a snapshot until after the load event has finished processing.

To make this test work, you should make this a reftest-wait test and listen for MozReftestInvalidate to trigger do_test. See any of the other tests that use MozReftestInvalidate...
Depends on: 633322
Depends on: 633453
OS: Windows CE → All
Hardware: x86 → All
This adds the mTabWidthsAnalyzedLimit field to PropertyProvider as suggested in comment #43, to avoid re-scanning tabless initial text in a long run. It's important to note (as commented in the code) that this is ONLY valid when there's no TabWidthStore attached to the provider; when a store is present, the mLimit field there MUST be used instead.
Attachment #511321 - Attachment is obsolete: true
Attachment #512752 - Flags: review?(roc)
Attachment #511321 - Flags: review?(roc)
The initial test did in fact fail with an earlier version of the patch - apparently we must have triggered tab-width calculation before the class was removed. But I've added a MozReftestInvalidate version for good measure, and also a test that involves changing the text rather than the tab-size, to check the widths get recalculated properly then, too.
Attachment #511323 - Attachment is obsolete: true
Attachment #512933 - Flags: review?(roc)
Attachment #511323 - Flags: review?(roc)
Comment on attachment 512933 [details] [diff] [review]
reftests for changing tab widths

I think we really only need the MozReftestInvalidate versions here.
Attachment #512933 - Flags: review?(roc) → review+
I mean, I think we should probably remove the others.
Comment on attachment 512752 [details] [diff] [review]
part 2, v3 - optimize storage of tab widths, updated as per comment 43

Roc, do you want us to take this for 2.0, or hold it until later? It gives a significant reduction (though it's not as important as the DetailedGlyphs part that already landed) in memory footprint for certain cases such as the tinderbox log example where there are tabs in large chunks of preformatted text, but OTOH that's a fairly rare use-case, I expect. There's no detectable effect on Tp4 memory figures, for example.

The risk here seems pretty low, given that we have some specific test coverage, but given that the benefit only shows up in relatively unusual cases, I don't have a strong opinion on whether we should take it now or later.
Attachment #512752 - Flags: approval2.0?
Whiteboard: [softblocker] → [softblocker][roc to decide a?]
Comment on attachment 512752 [details] [diff] [review]
part 2, v3 - optimize storage of tab widths, updated as per comment 43

The tabs problem happens less often than the DetailedGlyphs problem, but it uses twice as much memory when it occurs *and* the extra memory lives as long as the page, as opposed to the DetailedGlyphs memory which is flushed out when we flush textruns. And the probability of a stray tab rises with the length of the plain-text file we've loaded. Also, I feel this code is simple and very well-understood.
Attachment #512752 - Flags: approval2.0? → approval2.0+
http://hg.mozilla.org/mozilla-central/rev/df69f380b727 (tab-width patch)
http://hg.mozilla.org/mozilla-central/rev/011e28125062 (tests)
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Whiteboard: [softblocker][roc to decide a?] → [softblocker]
Nick, it would be nice if you could verify this fix by rerunning the analysis of comment #7.
I tried (a) the revision before jkew landed the patch, (b) the one after, and (c) Tracemonkey tip.

The peak usages in (a) and (b) were almost the same:  3GB.  The individual entries jumped around a bit.

But (c)'s peak looks *worse*:  3.5GB.  gfxTextRun::AllocateStorage looks bad, see below, though I'm not sure I trust these results.

jkew, did you do any kind of ad-hoc profiling (eg. counting object allocations or something like that) to confirm that your change reduced memory usage as you expected?


o1o> 29.68% (1,037,051,876B) 0x634CD87: gfxTextRun::AllocateStorage(void const*&, unsigned int, unsigned int) (mozalloc.h:247)
o1o> 19.64% (686,346,396B) 0x54E8E38: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:88)
o1o> 10.06% (351,342,592B) 0x400681B: _dl_map_object_from_fd (dl-load.c:1199)
o1o> 07.27% (254,107,604B) 0x54E8E9A: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:84)
o1o> 06.53% (228,241,246B) in 3537 places, all below massif's threshold (01.00%)
o1o> 06.53% (228,182,748B) 0x54E8ECE: nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(unsigned int, unsigned int) (nsTArray.h:84)
o1o> 06.47% (226,099,200B) 0x56005FB: PresShell::AllocateFrame(nsQueryFrame::FrameIID, unsigned long) (nsPresShell.cpp:2097)
o1o> 04.80% (167,854,080B) 0x4C2B695: pthread_create@@GLIBC_2.2.5 (allocatestack.c:483)
o1o> 04.56% (159,383,552B) 0x581AE28: nsTextFragment::Append(unsigned short const*, unsigned int) (nsMemory.h:71)
o1o> 03.24% (113,246,208B) 0x5ADC4C5: nsHtml5TreeBuilder::accumulateCharacters(unsigned short const*, int, int) (mozalloc.h:241)
o1o> 01.21% (42,203,000B) 0x6269FB2: ChangeTable (pldhash.c:563)
He didn't change anything related to AllocateStorage. Are you sure you got the same log file?

I notice that significantly more storage was allocated through nsTextFragment::Append, which definitely shouldn't be affected!
One possibility is that the file loaded faster so we peaked with higher usage because we crammed more textruns into memory at the same time. Or something like that. How long did the entire load take?
The runs took over 5 minutes, Massif is slow and I was doing other stuff in parallel on the machine.  I'll try again to see how non-deterministic this is.  As I said, I don't entirely trust the Massif results, which is why I asked jkew if he did ad hoc profiling.
OK, if it takes over 5 minutes then the gfxTextRun time-based caching is definitely interfering with your results. We probably need a "valgrind scale factor" for all our time-based caches :-).

I guess what we really need to know here are the stacks for the EnsureCapacity calls that are still in your profile.
Interpret these how you will.

If you're looking at a single allocation point that's changed, ad hoc profiling via printfs is a better idea.
I didn't explicitly profile the builds, but I tested by repeatedly loading the same log file (locally, to reduce variations due to network timings, etc) and watching the process in Activity Monitor to see the peak value for "real mem" during loading/reflow, and the final value after allowing time for caches to expire. Not a precise, scientific method, but it seemed consistent enough across multiple runs to give me confidence that the patches were working as expected.

The Massif results in comment #58 don't appear to include a build with the TabWidths patch; I guess it hasn't landed in tracemonkey, as that profile reflects pre-patch code.
(In reply to comment #59)
> I didn't explicitly profile the builds, but I tested by repeatedly loading the
> same log file (locally, to reduce variations due to network timings, etc) and
> watching the process in Activity Monitor to see the peak value for "real mem"
> during loading/reflow, and the final value after allowing time for caches to
> expire. Not a precise, scientific method, but it seemed consistent enough
> across multiple runs to give me confidence that the patches were working as
> expected.

Yeah, good enough for me.

> The Massif results in comment #58 don't appear to include a build with the
> TabWidths patch; I guess it hasn't landed in tracemonkey, as that profile
> reflects pre-patch code.

Is this the right changeset?

 changeset:   62083:199cb6282554
 user:        Jonathan Kew <jfkthame@gmail.com>
 date:        Thu Feb 10 06:50:47 2011 +0000
 summary:     bug 631035 part 1 - optimize storage of DetailedGlyph records. r=roc a=blocking2.0

The date looks wrong -- comment 51 says you landed this on Feb 18... isn't there some problem with hg where it can get dates wrong?

Looking at m-c, I see you also landed this:

 changeset:   62821:df69f380b727
 user:        Jonathan Kew <jfkthame@gmail.com>
 date:        Fri Feb 18 09:07:12 2011 +0000
 summary:     bug 631035 part 2 - optimize storage of tab widths. r+a=roc

which wasn't on the TM build I tried.  I'll remeasure.
The date of an hg changeset is the date it was created; it's unaffected by when it was pushed (because it can get pushed on different dates to different places, etc, etc).  So if I check in something locally today and push it to m-c tomorrow it'll always show up with today's date.
Ok, measuring m-c tip the results are *much* better.  1.8GB peak vs. 3GB I got yesterday.

High level Massif results:

30.77% (554,118,808B): gfxTextRun::AllocateStorage(...) (mozalloc.h:247)
09.55% (171,966,464B): nsTextFragment::Append(...) (nsMemory.h:71)
07.05% (126,943,232B): PresShell::AllocateFrame(...) (nsPresShell.cpp:2110)
04.60% (82,852,816B): nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(...) (nsTArray.h:88)
04.37% (78,687,908B): nsTArray_base<nsTArrayDefaultAllocator>::EnsureCapacity(...) (nsTArray.h:84)
04.25% (76,550,144B): nsHtml5TreeBuilder::accumulateCharacters(...) (mozalloc.h:241)
01.81% (32,579,584B): NS_NewLineBox(...) (nsLineBox.cpp:94)

My "mapped" figures in about:memory also went from ~1.5GB to ~550MB.

Nice work, guys!
Great - that looks much more reasonable. gfxTextRun::AllocateStorage is expected to dominate, as we're creating textruns for an awful lot of text, and they have to store the text itself (usually) and the associated glyphs and widths; but we no longer see the huge extra contributions from CopyGlyphDataFrom and GetTabWidths, allocating extra per-character storage that remained largely empty.
You need to log in before you can comment on or make changes to this bug.