Last Comment Bug 717175 - Crash @ gfxFont::AgeCacheEntry
: Crash @ gfxFont::AgeCacheEntry
Status: VERIFIED FIXED
[qa!]
: crash, regression, topcrash
Product: Core
Classification: Components
Component: Graphics (show other bugs)
: 12 Branch
: x86 Windows 7
: -- critical (vote)
: ---
Assigned To: Jonathan Kew (:jfkthame)
:
Mentors:
: 728133 (view as bug list)
Depends on: 981080 716345
Blocks: 703100 889365
  Show dependency treegraph
 
Reported: 2012-01-11 00:04 PST by Scoobidiver (away)
Modified: 2014-03-07 12:54 PST (History)
12 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
verified
+
verified


Attachments
possible experiment/band-aid (698 bytes, patch)
2012-02-17 03:28 PST, Jonathan Kew (:jfkthame)
jd.bugzilla: review+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
patch, protect the word cache with a mutex (3.79 KB, patch)
2012-03-10 03:27 PST, Jonathan Kew (:jfkthame)
no flags Details | Diff | Splinter Review
patch, remove the word-cache expiration timer (1.08 KB, patch)
2012-03-20 15:09 PDT, Jonathan Kew (:jfkthame)
roc: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
patch, flush shaped-word caches if they get excessively large (992 bytes, patch)
2012-04-03 03:58 PDT, Jonathan Kew (:jfkthame)
roc: review+
akeybl: approval‑mozilla‑aurora+
akeybl: approval‑mozilla‑beta+
Details | Diff | Splinter Review
patch for mozilla-beta (ff12): full backout of 703100 and following bugs that depended on it (466.43 KB, patch)
2012-04-03 09:31 PDT, Jonathan Kew (:jfkthame)
no flags Details | Diff | Splinter Review

Description Scoobidiver (away) 2012-01-11 00:04:12 PST
It's a new crash signature that first appeared in 12.0a1/20120106.
The regression range might be:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=4795500b7c1d&tochange=8ae16e346bd0
It might be caused by bug 711063.

Signature 	gfxFont::AgeCacheEntry(gfxFont::CacheHashEntry*, void*) More Reports Search
UUID	6691beb3-f45e-472e-9201-51c642120109
Date Processed	2012-01-09 09:51:08.704829
Uptime	942
Install Age	20.7 hours since version was first installed.
Install Time	2012-01-08 22:10:20
Product	Firefox
Version	12.0a1
Build ID	20120108031024
Release Channel	nightly
OS	Windows NT
OS Version	6.1.7600
Build Architecture	x86
Build Architecture Info	AuthenticAMD family 16 model 4 stepping 2
Crash Reason	EXCEPTION_ACCESS_VIOLATION_WRITE
Crash Address	0x20004e
App Notes 	
AdapterVendorID: 0x10de, AdapterDeviceID: 0x0dc4, AdapterSubsysID: 00000000, AdapterDriverVersion: 8.17.12.8562
D2D? D2D+
DWrite? DWrite+
D3D10 Layers? D3D10 Layers+
EMCheckCompatibility	True

Frame 	Module 	Signature [Expand] 	Source
0 	xul.dll 	gfxFont::AgeCacheEntry 	gfx/thebes/gfxFont.cpp:1154
1 	xul.dll 	nsTHashtable<nsIdentifierMapEntry>::s_EnumStub 	obj-firefox/dist/include/nsTHashtable.h:472
2 	xul.dll 	PL_DHashTableEnumerate 	obj-firefox/xpcom/build/pldhash.cpp:754
3 	xul.dll 	nsTHashtable<gfxFont::CacheHashEntry>::EnumerateEntries 	obj-firefox/dist/include/nsTHashtable.h:241
4 	xul.dll 	gfxFont::AgeCachedWords 	gfx/thebes/gfxFont.h:1402
5 	xul.dll 	nsTHashtable<nsIdentifierMapEntry>::s_EnumStub 	obj-firefox/dist/include/nsTHashtable.h:472
6 	xul.dll 	PL_DHashTableEnumerate 	obj-firefox/xpcom/build/pldhash.cpp:754
7 	xul.dll 	nsTHashtable<gfxFontCache::HashEntry>::EnumerateEntries 	obj-firefox/dist/include/nsTHashtable.h:241
8 	xul.dll 	gfxFontCache::WordCacheExpirationTimerCallback 	gfx/thebes/gfxFont.cpp:1102
9 	xul.dll 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:524
10 	xul.dll 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:660
11 	nspr4.dll 	PR_Unlock 	nsprpub/pr/src/threads/combined/prulock.c:347
12 	xul.dll 	MessageLoop::RunHandler 	ipc/chromium/src/base/message_loop.cc:201
13 	xul.dll 	_SEH_epilog4 	
14 	xul.dll 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:175
15 	xul.dll 	nsInlineFrame::BuildDisplayList 	layout/generic/nsInlineFrame.cpp:188
16 	xul.dll 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:189
17 	dbghelp.dll 	WTI::pSymWidenTi 	

More reports at:
https://crash-stats.mozilla.com/report/list?signature=gfxFont%3A%3AAgeCacheEntry%28gfxFont%3A%3ACacheHashEntry*%2C%20void*%29
Comment 1 Jonathan Kew (:jfkthame) 2012-01-11 03:07:05 PST
This code was introduced in bug 703100, so that's presumably the culprit.
Comment 2 Jonathan Kew (:jfkthame) 2012-02-17 02:50:42 PST
See also bug 728133, which will probably turn out to be the same underlying problem. Not marking as a dupe just yet, though, until we're more certain of that.
Comment 3 Scoobidiver (away) 2012-02-17 03:17:28 PST
*** Bug 728133 has been marked as a duplicate of this bug. ***
Comment 4 Jonathan Kew (:jfkthame) 2012-02-17 03:28:31 PST
Created attachment 598194 [details] [diff] [review]
possible experiment/band-aid

(from bug 728133)
I don't currently understand what scenario could be leading to this crash; would love some steps to reproduce, but I suspect it's too unpredictable for the.

For want of any better ideas at the moment, we could try this change, which aims to clear entries from the word caches a bit earlier during the expiration of fonts; I'd be curious to see if this would affect crash-stats at all.
Comment 5 John Daggett (:jtd) 2012-02-19 16:44:15 PST
Comment on attachment 598194 [details] [diff] [review]
possible experiment/band-aid

Maybe one way to try to reproduce this is to run tests with the notify timer set to a much smaller value to increase the chance of hitting whatever scenario is causing this.
Comment 6 Jonathan Kew (:jfkthame) 2012-02-20 01:01:49 PST
(In reply to John Daggett (:jtd) from comment #5)
> Maybe one way to try to reproduce this is to run tests with the notify timer
> set to a much smaller value to increase the chance of hitting whatever
> scenario is causing this.

Yes, that sounds worth trying - I'll run some more tests locally and see if I can get it to occur.
Comment 7 Jonathan Kew (:jfkthame) 2012-02-20 01:02:05 PST
Pushed experimental patch to m-i:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3588de29a03f

** Please do NOT close this bug when merging to m-c **
Comment 8 Ed Morley [:emorley] 2012-02-21 08:44:01 PST
(In reply to Jonathan Kew (:jfkthame) from comment #7)
> Pushed experimental patch to m-i:
> https://hg.mozilla.org/integration/mozilla-inbound/rev/3588de29a03f

https://hg.mozilla.org/mozilla-central/rev/3588de29a03f
Comment 9 Jonathan Kew (:jfkthame) 2012-02-24 03:49:23 PST
Crash is still present in the 2012022304 build; not enough data yet to see if there's any change to the frequency, but it's definitely there.
Comment 10 Sheila Mooney 2012-03-01 13:35:19 PST
I had a look at the most recent data. The numbers seems unchanged. It would be nice to fix since it's a regression but it's at #74 on the list for 12 so not super high volume.
Comment 11 Scoobidiver (away) 2012-03-06 02:19:36 PST
It's #19 top browser crashes in 12.0a2.
Comment 12 Jonathan Kew (:jfkthame) 2012-03-06 07:23:12 PST
Just an update: I tried drastically reducing the cache timeouts, as suggested in comment #5, to see if I could reproduce this locally, but so far haven't had any success. :(  John, have you encountered this at all?
Comment 13 Marcia Knous [:marcia - use ni] 2012-03-08 13:55:16 PST
One comment mentions "Constantly crashes when macro express is running- regardless if firefox is involved"

http://www.macroexpress.com/
Comment 14 Marcia Knous [:marcia - use ni] 2012-03-09 15:20:02 PST
I did install Macro Express on a few machines and so far haven't had any luck reproducing the crash.
Comment 15 Jonathan Kew (:jfkthame) 2012-03-10 03:27:21 PST
Created attachment 604624 [details] [diff] [review]
patch, protect the word cache with a mutex

I'm concerned that the basic problem here - which I haven't been able to reproduce locally, even dogfooding builds with much shorter timeouts - may be a threading issue, due to aging/expiring the cached words from the timer thread, which could race with hashtable access and modification from the GetShapedWord on the main thread. Plausible?

To protect against that risk, this patch adds a mutex to control access to the shaped-word hashtable. If this seems reasonable, I'd like to try landing it and see if it stops these crash signatures appearing.
Comment 16 Jonathan Kew (:jfkthame) 2012-03-10 09:32:34 PST
(In reply to Jonathan Kew (:jfkthame) from comment #15)
> Created attachment 604624 [details] [diff] [review]
> patch, protect the word cache with a mutex
> 
> I'm concerned that the basic problem here - which I haven't been able to
> reproduce locally, even dogfooding builds with much shorter timeouts - may
> be a threading issue, due to aging/expiring the cached words from the timer
> thread, which could race with hashtable access and modification from the
> GetShapedWord on the main thread. Plausible?

OK, I've realised this is not plausible, because timer callbacks aren't actually executed on the timer thread; the timer posts events that result in the callback itself actually being run on the main thread.
Comment 17 JP Rosevear [:jpr] 2012-03-13 14:09:51 PDT
Adding Bas because of Azure.
Comment 18 Alex Keybl [:akeybl] 2012-03-19 15:46:46 PDT
gfxFont::AgeCacheEntry
Top crash 19 on 12, 70 on 14

PL_DHashTableEnumerate
13 on 12, 35 on 14

So cumulatively, this is much worse on FF12 than it appears.

Jonathan/Bas - what are the next steps here? Can we perform code inspection or look for changes in related code in FF12?
Comment 19 Scoobidiver (away) 2012-03-20 03:11:31 PDT
(In reply to Alex Keybl [:akeybl] from comment #18)
> So cumulatively, this is much worse on FF12 than it appears.
The band aid landed in Fx 13, so it should be transplanted to Beta 12.
Comment 20 Jonathan Kew (:jfkthame) 2012-03-20 03:26:29 PDT
(In reply to Scoobidiver from comment #19)
> (In reply to Alex Keybl [:akeybl] from comment #18)
> > So cumulatively, this is much worse on FF12 than it appears.
> The band aid landed in Fx 13, so it should be transplanted to Beta 12.

Did it actually make any difference? It wasn't clear to me that it helped at all - although transplanting it would be harmless.
Comment 21 Jonathan Kew (:jfkthame) 2012-03-20 03:52:13 PDT
(In reply to Alex Keybl [:akeybl] from comment #18)
> Jonathan/Bas - what are the next steps here? Can we perform code inspection
> or look for changes in related code in FF12?

We know that this signature must originate with bug 703100, as that's where gfxFont::AgeCacheEntry was introduced (in patch 4). However, I've read and re-read that code, and I can't see anything dubious about it that could explain these crashes (Bas, another pair of eyes wouldn't be a bad thing - I've looked at it so many times I might not be seeing things clearly any more!).

What concerns me is that the real problem may lie elsewhere, not directly in the code from 703100 that's showing up in the stack here. The hashtable management there looks pretty straightforward; but if a bug elsewhere is damaging one of the hashtables, that could easily lead to a later crash when we try to enumerate it.

I note that this signature (and the related gfxFont::AgeCachedWords()) only seems to appear on Windows, whereas the font & word cache code in bug 703100 was platform-independent; if there were an error in the hashtable management there, I'd expect to see crashes on all platforms. Which makes me suspect there could be a bug (in the platform-specific font code, maybe) that's causing the damage, and even if we were to back out 703100, thereby eliminating *this* crash signature, all we'd really achieve would be to move the underlying problem and it'd show up with some other stack.

If we could reproduce the crash (even sporadically) and look at it in a record-and-replay debugger I expect it'd be pretty simple to figure out the origin of the hashtable brokenness. But as yet, I haven't been able to reproduce it, even when aggressively shortening the timers to make the cache-aging code run much more frequently.

A possible mitigation might be to back out just part 4 of bug 703100 (the timer-based expiration of shaped words from the per-font caches). That could lead to slightly higher memory use, as we'd keep things hanging around until the font itself gets expired from the font cache; but at that point, we'd discard the word cache en masse rather than by enumerating and aging the individual entries - which might be safer, if there's some risk that the cache has been damaged so that it cannot be safely enumerated. In practice, I don't think the memory impact would be severe - while bug 703100 was in progress, I ran builds that didn't yet include the word expiration, and (because we generally expire font instances before all that long) it wasn't a problem.

Such a backout would certainly eliminate this crash signature (as it would no longer exist in the tree). If, as I currently suspect, the real cause is a bug elsewhere, it might only move the crashiness to a different place; but it might also avoid some of the crashes altogether by safely throwing away the damaged records. Maybe we could try it on trunk for a couple of weeks and see what the stats look like (and whether there's a noticeable memory impact)?
Comment 22 Scoobidiver (away) 2012-03-20 03:58:38 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #20)
> Did it actually make any difference? It wasn't clear to me that it helped at
> all
With combined signatures, it's #8 top browser crasher in 12.0b1, #18 in 13.0a2 (not representative with only 1300 ADU), #31 in 14.0a1.

> although transplanting it would be harmless.
Let's do that before the 12.0b2 go-to-build.
Comment 23 Alex Keybl [:akeybl] 2012-03-20 11:28:41 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #20)
> Did it actually make any difference? It wasn't clear to me that it helped at
> all - although transplanting it would be harmless.

The effect of the band-aid will be most obvious on Beta 12 when comparing across betas. I agree with Scoobidiver that given the low risk nature of the band-aid, we can uplift to beta today on the off chance it reduces crashiness. 

(In reply to Jonathan Kew (:jfkthame) from comment #21)
> Maybe we could try it on trunk for a couple of weeks
> and see what the stats look like (and whether there's a noticeable memory
> impact)?

Sounds like a very reasonable way forward if we don't think this will cause cumulative memory pressure when using a number of fonts. Let's get the backout landed on m-c asap, and target Firefox 12 beta 3 for uplift to mitigate any associated risk (next Tuesday, 3/27). Thanks for investigating Jonathan.
Comment 24 Jonathan Kew (:jfkthame) 2012-03-20 14:21:51 PDT
Comment on attachment 598194 [details] [diff] [review]
possible experiment/band-aid

[Approval Request Comment]
Regression caused by (bug #): 703100
User impact if declined: continued crashiness
Testing completed (on m-c, etc.): has been on m-c for several weeks now with no adverse effects
Risk to taking this patch (and alternatives if risky): minimal risk, though degree of benefit is uncertain
String changes made by this patch: none
Comment 25 Jonathan Kew (:jfkthame) 2012-03-20 15:09:06 PDT
Created attachment 607737 [details] [diff] [review]
patch, remove the word-cache expiration timer

Backing out the whole of bug 703100 pt 4 would require a somewhat tricky/messy merge, as a bunch of other patches have landed in nearby areas of code. However, we can achieve the same effect much more simply by just removing the word-cache expiration timer, so that the functions to "age" and expire individual cached words, though still present, won't ever be triggered.

I'd suggest we land this on trunk to see whether we get an overall reduction in crashes, or merely trade this signature for others; and to observe the impact on memory use.
Comment 26 Alex Keybl [:akeybl] 2012-03-20 15:11:41 PDT
Comment on attachment 598194 [details] [diff] [review]
possible experiment/band-aid

[Triage Comment]
Low risk, approved for Beta.
Comment 27 Jonathan Kew (:jfkthame) 2012-03-20 15:20:48 PDT
Pushed to beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/74655b354316

Not updating any flags, as this doesn't fix the issue, just (maybe) makes it a bit less bad.
Comment 28 Jonathan Kew (:jfkthame) 2012-03-21 02:05:45 PDT
Pushed the timer-removal patch to m-c:
https://hg.mozilla.org/mozilla-central/rev/3f965690ce1f

I believe this will eliminate this crash signature, but may not solve a possible underlying problem. Leaving this bug open for now to track what actually happens over the coming week....
Comment 29 Alex Keybl [:akeybl] 2012-03-28 16:09:27 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #28)
> Pushed the timer-removal patch to m-c:
> https://hg.mozilla.org/mozilla-central/rev/3f965690ce1f
> 
> I believe this will eliminate this crash signature, but may not solve a
> possible underlying problem. Leaving this bug open for now to track what
> actually happens over the coming week....

[@ PL_DHashTableEnumerate] has gone down from 16 crashes per week down to 5

[@ gfxFont::AgeCacheEntry(gfxFont::CacheHashEntry*, void*)] has gone from 9 crashes per week down to 2

I'm not sure if that allows us to make any conclusions as to the effectiveness of the fix. If we think there is something to the latest patch, and it's low risk, perhaps we should consider uplifting to Aurora 13 and Beta 12. What do you think Jonathan?
Comment 30 Jonathan Kew (:jfkthame) 2012-03-28 17:22:40 PDT
(In reply to Alex Keybl [:akeybl] from comment #29)
> [@ PL_DHashTableEnumerate] has gone down from 16 crashes per week down to 5

And the only two PL_DHashTableEnumerate crashes since the patch landed on m-c show stacks that look completely unrelated to fonts/text layout.

> [@ gfxFont::AgeCacheEntry(gfxFont::CacheHashEntry*, void*)] has gone from 9
> crashes per week down to 2

And none in builds since the patch landed.

> I'm not sure if that allows us to make any conclusions as to the
> effectiveness of the fix. If we think there is something to the latest
> patch, and it's low risk, perhaps we should consider uplifting to Aurora 13
> and Beta 12. What do you think Jonathan?

I think the risk is low enough that it's worth uplifting, given the apparent improvement on Nightly, even though with these small numbers it's hard to be sure exactly what they mean - they certainly give the impression of moving in the right direction.

In itself, the patch is very low risk: it introduces no new code, just removes the timer that's firing the cache-aging signals. The risk this brings, indirectly, is that it can increase our memory usage (or delay our releasing of memory after a peak), but I think that's worth it to eliminate these crashes.
Comment 31 Jonathan Kew (:jfkthame) 2012-03-28 17:28:35 PDT
On 12.0beta, we've got around 1000 crashes/week for these two signatures combined, so if we push out a beta build with the timer-removal patch, we should be able to see the effect pretty quickly.
Comment 32 Jonathan Kew (:jfkthame) 2012-03-29 09:06:22 PDT
Comment on attachment 607737 [details] [diff] [review]
patch, remove the word-cache expiration timer

[Approval Request Comment]
Regression caused by (bug #): 703100

User impact if declined: crashiness (on windows)

Testing completed (on m-c, etc.): the related crash numbers on Nightly look significantly reduced since this landed (3/21)

Risk to taking this patch (and alternatives if risky): Minimal code risk; may increase memory usage somewhat (mitigated by the fact that the cached records will still be released on memory-pressure notification). A backout of 703100 would involve extensive code changes and be much riskier.

String changes made by this patch: none
Comment 33 Alex Keybl [:akeybl] 2012-03-29 15:32:32 PDT
Comment on attachment 607737 [details] [diff] [review]
patch, remove the word-cache expiration timer

[Triage Comment]
Approving for Aurora 12. Let's make sure the crash frequency drops there before approving for Beta (4/3 at the latest).

When this came up during today's channel meeting triage, the only concern was the possibility for memory usage increases:

> Risk to taking this patch (and alternatives if risky): Minimal code risk;
> may increase memory usage somewhat (mitigated by the fact that the cached
> records will still be released on memory-pressure notification). A backout
> of 703100 would involve extensive code changes and be much riskier.

Can you provide any more information about how much we expect memory usage to rise by (is this cache always small?), and what operations QA should consider performing to verify that memory pressure isn't an issue?
Comment 34 Jonathan Kew (:jfkthame) 2012-03-30 08:02:09 PDT
Pushed https://hg.mozilla.org/releases/mozilla-aurora/rev/277a265da740.

The relevant memory can be monitored in current Nightly by checking the value for gfx/font-shaped-words in about:memory. (This memory reporter only recently landed, so it's not available in Aurora/Beta.)

In my experience (light use of the browser, a dozen or so tabs), I only see a few MB there, though I have seen it go as high as several tens of MB when deliberately seeking to push it up by visiting a wide variety of text-heavy pages. I'm sure it could be pushed to 100MB or more in extreme cases, though this would require lots of text in a large number of different fonts/styles, which means that our overall memory use for DOM/layout/etc would be higher, too.

The main effect of the crash-fix here is that gfx/font-shaped-words will tend not to drop as readily after such a peak, even if most of the text-heavy pages are closed - as long as any page is keeping the font instance alive, all its shaped-word data will hang around.

So what QA could look at - using Nightly - would be how badly gfx/font-shaped-words grows in comparison to overall memory usage in "intensive" browsing scenarios. I do think that's something we'll eventually want to address again, by restoring the expiration mechanism, but we need to do so without the crashiness.
Comment 35 Alex Keybl [:akeybl] 2012-03-30 13:21:01 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #34)
> So what QA could look at - using Nightly - would be how badly
> gfx/font-shaped-words grows in comparison to overall memory usage in
> "intensive" browsing scenarios. I do think that's something we'll eventually
> want to address again, by restoring the expiration mechanism, but we need to
> do so without the crashiness.

Thanks. Adding qawanted to address this.
Comment 36 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-30 13:50:29 PDT
Here is some initial testing -- please advise on what more I can do...

First Run:
├───0.68 MB (01.26%) -- gfx
│   ├──0.33 MB (00.62%) ── font-tables
│   ├──0.14 MB (00.27%) ── font-cache
│   ├──0.14 MB (00.26%) ── font-list
│   ├──0.04 MB (00.07%) ── font-shaped-words
│   └──0.03 MB (00.05%) ── font-charmaps

Load 10 tabs from differnt links on cbc.ca
├────2.81 MB (01.46%) -- gfx
│    ├──1.31 MB (00.68%) ── font-charmaps
│    ├──0.61 MB (00.32%) ── font-tables
│    ├──0.52 MB (00.27%) ── font-shaped-words
│    ├──0.23 MB (00.12%) ── font-cache
│    └──0.14 MB (00.07%) ── font-list

Load 10 more tabs with different videos from Youtube HTML5
     ├──3.37 MB (00.88%) -- gfx
     │  ├──1.31 MB (00.34%) ── font-charmaps
     │  ├──0.90 MB (00.24%) ── font-shaped-words
     │  ├──0.70 MB (00.18%) ── font-tables
     │  ├──0.32 MB (00.08%) ── font-cache
     │  └──0.14 MB (00.04%) ── font-list

Load 10 more tabs with articles from Huffingtonpost.com
│   ├───4.86 MB (00.70%) -- gfx
│   │   ├──1.88 MB (00.27%) ── font-shaped-words
│   │   ├──1.31 MB (00.19%) ── font-charmaps
│   │   ├──0.87 MB (00.12%) ── font-tables
│   │   ├──0.66 MB (00.09%) ── font-cache
│   │   └──0.14 MB (00.02%) ── font-list

Start Private browsing
│   ├───4.90 MB (00.91%) -- gfx
│   │   ├──1.92 MB (00.35%) ── font-shaped-words
│   │   ├──1.31 MB (00.24%) ── font-charmaps
│   │   ├──0.87 MB (00.16%) ── font-tables
│   │   ├──0.66 MB (00.12%) ── font-cache
│   │   └──0.14 MB (00.03%) ── font-list

Load top 10 sites from Alexa
├────3.11 MB (01.50%) -- gfx
│    ├──1.31 MB (00.63%) ── font-charmaps
│    ├──0.99 MB (00.48%) ── font-tables
│    ├──0.41 MB (00.20%) ── font-shaped-words
│    ├──0.26 MB (00.13%) ── font-cache
│    └──0.14 MB (00.07%) ── font-list

Stop Private Browsing
└────3.25 MB (01.72%) -- gfx
     ├──1.31 MB (00.69%) ── font-charmaps
     ├──1.06 MB (00.56%) ── font-tables
     ├──0.46 MB (00.24%) ── font-shaped-words
     ├──0.28 MB (00.15%) ── font-cache
     └──0.14 MB (00.07%) ── font-list

Quit and restore session
│   ├───2.71 MB (00.64%) -- gfx
│   │   ├──1.17 MB (00.28%) ── font-shaped-words
│   │   ├──0.87 MB (00.20%) ── font-tables
│   │   ├──0.55 MB (00.13%) ── font-cache
│   │   ├──0.08 MB (00.02%) ── font-charmaps
│   │   └──0.04 MB (00.01%) ── font-list
Comment 37 Jonathan Kew (:jfkthame) 2012-03-30 14:06:45 PDT
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #36)
> Here is some initial testing -- please advise on what more I can do...

If this is at all typical of people's browsing behavior, I don't see any cause for concern.

You should be able to push font-shaped-words significantly higher by loading a bunch of very text-heavy pages such as tinderbox logs, source files in mxr, the single-page version of the HTML5 spec, cycling through lots of wikipedia articles, etc.
Comment 38 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-30 14:42:46 PDT
(In reply to Jonathan Kew (:jfkthame) from comment #37)
Continuing where I left off with comment 36...

Add 10 more tabs with tbpl.mozilla.org:
    ├───3.62 MB (00.45%) -- gfx
    │   ├──1.88 MB (00.23%) ── font-shaped-words
    │   ├──0.89 MB (00.11%) ── font-tables
    │   ├──0.72 MB (00.09%) ── font-cache
    │   ├──0.08 MB (00.01%) ── font-charmaps
    │   └──0.04 MB (00.01%) ── font-list

Add 10 more tabs with different mxr pages:
│   ├───6.68 MB (00.83%) -- gfx
│   │   ├──3.45 MB (00.43%) ── font-shaped-words
│   │   ├──1.31 MB (00.16%) ── font-charmaps
│   │   ├──0.99 MB (00.12%) ── font-tables
│   │   ├──0.79 MB (00.10%) ── font-cache
│   │   └──0.14 MB (00.02%) ── font-list

Add 10 more tabs with single-page HTML5 spec:
│     ├──10.50 MB (00.63%) -- gfx
│     │  ├───6.90 MB (00.42%) ── font-shaped-words
│     │  ├───1.31 MB (00.08%) ── font-charmaps
│     │  ├───1.13 MB (00.07%) ── font-tables
│     │  ├───1.01 MB (00.06%) ── font-cache
│     │  └───0.14 MB (00.01%) ── font-list
* Becomes quite unresponsive -- hangs for several minutes but does not crash -- about:memory throws "slow script" warning which I allow to continue -- eventually recovers (explicit memory ballooned to ~1.7GB)

Add 10 more tabs with random wikipedia articles:
│     ├──14.92 MB (00.87%) -- gfx
│     │  ├───8.31 MB (00.48%) ── font-shaped-words
│     │  ├───3.97 MB (00.23%) ── font-cache
│     │  ├───1.31 MB (00.08%) ── font-charmaps
│     │  ├───1.20 MB (00.07%) ── font-tables
│     │  └───0.14 MB (00.01%) ── font-list
Again, I got slow script warnings when loading about:memory

I suspect the hangs/slow-memory-warnings are more of a symptom of the number of tabs and content I was loading (80 tabs by the end of the test).
Comment 39 Jonathan Kew (:jfkthame) 2012-03-30 14:47:13 PDT
Yes, loading the big HTML5 spec tends to bring the browser to its knees - though I believe that's as much to do with scripts that it runs as with text layout issues.

The font-shaped-words usage here is still very reasonable, for the amount of stuff you've loaded. If you then close most of those tabs and continue "normal" browsing (e.g. reading news sites), how much of that memory gets released after a minute or two?
Comment 40 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-30 14:56:48 PDT
And continuing where I left off in comment 38...

Start Private Browsing mode:
├───2.28 MB (04.19%) -- gfx
│   ├──1.31 MB (02.41%) ── font-charmaps
│   ├──0.57 MB (01.05%) ── font-shaped-words
│   └──0.40 MB (00.73%) -- (3 tiny)
│      ├──0.21 MB (00.39%) ── font-tables
│      ├──0.14 MB (00.26%) ── font-list
│      └──0.05 MB (00.08%) ── font-cache

Stop Private Browsing mode:
├─────14.64 MB (01.22%) -- gfx
│     ├───8.01 MB (00.67%) ── font-shaped-words
│     ├───2.96 MB (00.25%) ── font-tables
│     ├───2.22 MB (00.18%) ── font-cache
│     ├───1.31 MB (00.11%) ── font-charmaps
│     └───0.14 MB (00.01%) ── font-list

Right click last tab > Close Other Tabs
├───6.78 MB (07.25%) -- gfx
│   ├──4.08 MB (04.37%) ── font-shaped-words
│   ├──1.31 MB (01.40%) ── font-charmaps
│   ├──0.96 MB (01.02%) ── font-tables
│   └──0.42 MB (00.45%) -- (2 tiny)
│      ├──0.28 MB (00.30%) ── font-cache
│      └──0.14 MB (00.15%) ── font-list

Reload about:memory page after 1 minute:
├───2.58 MB (05.12%) -- gfx
│   ├──1.31 MB (02.60%) ── font-charmaps
│   ├──0.72 MB (01.43%) ── font-shaped-words
│   └──0.55 MB (01.09%) -- (3 tiny)
│      ├──0.35 MB (00.69%) ── font-tables
│      ├──0.14 MB (00.28%) ── font-list
│      └──0.06 MB (00.12%) ── font-cache

Open a tab to news.google.ca:
├───2.87 MB (03.12%) -- gfx
│   ├──1.56 MB (01.69%) -- (4 tiny)
│   │  ├──0.83 MB (00.90%) ── font-shaped-words
│   │  ├──0.48 MB (00.52%) ── font-tables
│   │  ├──0.14 MB (00.15%) ── font-list
│   │  └──0.11 MB (00.12%) ── font-cache
│   └──1.31 MB (01.42%) ── font-charmaps

No crashes experienced
Comment 41 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-03-30 20:53:38 PDT
@jfkthame has qawanted been satisfied, or is there more that QA can do to assist with this bug?
Comment 42 Jonathan Kew (:jfkthame) 2012-03-30 22:03:36 PDT
Thanks for your testing - I'm comfortable that the shaped-word memory usage isn't likely to grow excessively under normal use, given the figures you're seeing.

@akeybl, does this help? ISTM that to create a real problem here, someone would need to deliberately create a pathological example that constantly shapes new words without ever allowing the font instances to expire - yes, it could be done, and would eventually DoS the browser (OOM), but there are already easy ways to do that. It doesn't look like normal users should run into issues with this.
Comment 43 Robert O'Callahan (:roc) (email my personal email if necessary) 2012-04-01 20:13:03 PDT
What about a page that just updates a counter, e.g. a clock? That's effectively a memory leak now?
Comment 44 blinky 2012-04-02 10:24:16 PDT
I tested on a large page size.

When open the page use more than 1GB memory, Then it starts low memory.



Firefox 12.0b3

464.52 MB (100.0%) -- explicit
├──154.68 MB (33.30%) -- dom
│  └──154.68 MB (33.30%) -- window-objects
│     ├──154.57 MB (33.27%) -- active
│     │  ├──154.40 MB (33.24%) -- top=8 (inner=17)
│     │  │  └──154.40 MB (33.24%) ── inner-window(id=17, uri=file:///C:/path.log)
│     │  └────0.17 MB (00.04%) ++ (4 tiny)
│     └────0.12 MB (00.02%) ++ cached
├──137.31 MB (29.56%) ── heap-unclassified
├──132.05 MB (28.43%) -- layout
│  ├──130.30 MB (28.05%) -- shell(file:///C:/path.log)
│  │  ├──130.24 MB (28.04%) ── arenas
│  │  └────0.06 MB (00.01%) ── styledata
│  └────1.75 MB (00.38%) ++ (4 tiny)
├───31.45 MB (06.77%) -- js
│   ├──13.99 MB (03.01%) -- compartment([System Principal], 0x797e000)
│   │  ├───7.83 MB (01.69%) ++ gc-heap
│   │  └───6.16 MB (01.33%) ++ (9 tiny)
│   ├───9.43 MB (02.03%) ── gc-heap-decommitted
│   └───8.03 MB (01.73%) ++ (9 tiny)
├────6.48 MB (01.39%) -- storage
│    ├──5.25 MB (01.13%) ++ sqlite
│    └──1.22 MB (00.26%) ++ prefixset
└────2.54 MB (00.55%) ++ (7 tiny)




Firefox 14.0a1

841.84 MB (100.0%) -- explicit
├──460.07 MB (54.65%) -- gfx
│  ├──459.87 MB (54.63%) ── font-shaped-words
│  └────0.20 MB (00.02%) ++ (3 tiny)
├──286.23 MB (34.00%) -- window-objects
│  ├──284.74 MB (33.82%) -- top(file:///c:/path.log, id=8)
│  │  └──284.74 MB (33.82%) -- active
│  │     └──284.74 MB (33.82%) -- window(file:///c:/path.log)
│  │        ├──154.40 MB (18.34%) ── dom [2]
│  │        └──130.34 MB (15.48%) -- layout
│  │           ├──130.24 MB (15.47%) ── arenas
│  │           └────0.10 MB (00.01%) ── style-sets
│  └────1.49 MB (00.18%) ++ (3 tiny)
├───57.59 MB (06.84%) ── heap-unclassified
├───28.66 MB (03.40%) -- js
│   ├──16.74 MB (01.99%) -- compartment([System Principal], 0x7a88000)
│   │  ├──10.11 MB (01.20%) ++ gc-heap
│   │  └───6.63 MB (00.79%) ++ (7 tiny)
│   └──11.93 MB (01.42%) ++ (10 tiny)
└────9.29 MB (01.10%) ++ (10 tiny)
Comment 45 Jonathan Kew (:jfkthame) 2012-04-03 02:52:24 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #43)
> What about a page that just updates a counter, e.g. a clock? That's
> effectively a memory leak now?

Yes, but it's a sufficiently small/slow leak that it's probably not critical - as a test, I opened a tab containing four counters in different fonts (so they can't share cached shaped-word entries), continuously updating; by the time they'd reached about 1,500,000 (comparable to a single counter reaching 4 million, or a clock updating every second for over 6 weeks) the gfx/font-shaped-words usage had climbed to 200MB.

Note also that this memory will still be released if we get a memory-pressure notification (even if the page concerned is still open and busy). I'm not sure how good we are at firing this notification when things get tight, but it certainly behaves as expected when clicking Minimize Usage in about:memory.

(Strictly speaking the shaped-words aren't leaked, as they will still be released when the font instance goes away - it's just that this could take an indefinite amount of time, if the same font is constantly being used.)

This situation isn't ideal, obviously - we'd like to restore the timed-expiration behavior. But I think we could live with it for now; I'm much more worried about the potential disruption of backing out the whole of bug 703100 at this stage than about the longevity of shaped-word records.
Comment 46 Jonathan Kew (:jfkthame) 2012-04-03 03:58:26 PDT
Created attachment 611770 [details] [diff] [review]
patch, flush shaped-word caches if they get excessively large

Here is a possible mitigation for the continual cache growth in the case where a page continually shapes new "words" using the same font, so that we're constantly adding entries to the word cache and never discarding it. WDYT? It seems less risky than the timer because it only touches the cache in a single, well-understood place (when we're about to look up a word anyway), and it doesn't even need to enumerate all the existing entries, it just checks the count.

(In terms of behavior, this is inferior to age-based expiration, of course, but it's actually somewhat comparable to the old gfxTextRunWordCache behavior, where the lifetime of a word-cache entry was arbitrarily tied to the lifetime of the particular text-run where the word was first encountered, regardless of whether it's being frequently re-used to build newer runs.)
Comment 47 Jonathan Kew (:jfkthame) 2012-04-03 04:24:43 PDT
Pushed the cache-growth mitigation to inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab9a5eb63487

I suggest we should uplift this to aurora (where we already landed the timer-removal), and to beta if we take attachment 607737 [details] [diff] [review] there as well.
Comment 48 Jonathan Kew (:jfkthame) 2012-04-03 04:32:22 PDT
Comment on attachment 611770 [details] [diff] [review]
patch, flush shaped-word caches if they get excessively large

[Approval Request Comment]
Regression caused by (bug #): 703100, 717175

User impact if declined: potentially unconstrained growth of shaped-word caches, due to disabling the time-based expiration

Testing completed (on m-c, etc.): just landed on inbound, see above

Risk to taking this patch (and alternatives if risky): Small and simple patch with little risk of regression; theoretically could impact performance slightly by making us re-shape text more, but only for pathological cases where a huge number of different words occur in a single font. Alternatives are to accept continued growth of the cache in this case, or to back out all of bug 703100 and followups, which I believe would be substantially riskier.

String changes made by this patch: none
Comment 49 Jonathan Kew (:jfkthame) 2012-04-03 04:33:29 PDT
** Note to inbound sheriffs **

Please do not close bug when merging the latest patch to m-c.
Comment 50 Jonathan Kew (:jfkthame) 2012-04-03 09:31:40 PDT
Created attachment 611846 [details] [diff] [review]
patch for mozilla-beta (ff12): full backout of 703100 and following bugs that depended on it

FTR, I'm attaching a patch for beta that would fully back out bug 703100. This required fixing up merges by hand, especially in gfxFont.h; I have pushed it to tryserver to check whether it builds successfully on all platforms.

Altogether, this backs out the following changesets/bugs:
   99a6a46456ac bug 703100 pt2.1
   62e53dcfcd61 bug 703100 pt2.2
   4bc3a3ee012d bug 703100 pt2.3
   fd151d941e29 bug 703100 pt2.4
   561d06710107 bug 703100 pt2.4.1
   e053ff76503d bug 703100 pt2.5
   102dff1e0bb5 bug 703100 pt3
   433b37e097c7 bug 703100 pt4
   daadf6827faa bug 703100 pt5
   7b612dae71ec bug 703100 pt6
   c0b62edd2917 bug 703100 pt7
   26d7324c8d37 bug 703100 followup
   4117cd7064ee bug 715471
   0a9fd06dea54 bug 716402
   f80d6519ad29 bug 716402 reftest
   6290daa65189 bug 708075
   9c2ca2a79d79 bug 716229
   1b67951de0a2 bug 716229
   74655b354316 bug 717175 bandaid

Of these, 708075 and 716229 were followup issues that are irrelevant when 703100 is backed out; 715471 and 716402 are separate bugs that I backed out because of merge conflicts, though I think 715471 could easily be re-landed without the conflicting chunk.

However, IMO it would be *much* less risky to take the fixes in attachment 607737 [details] [diff] [review] and attachment 611770 [details] [diff] [review] than this backout.
Comment 51 Matt Brubeck (:mbrubeck) 2012-04-03 10:59:13 PDT
https://hg.mozilla.org/mozilla-central/rev/ab9a5eb63487 - flush shaped-word caches if they grow excessively large
Comment 52 Alex Keybl [:akeybl] 2012-04-03 11:24:20 PDT
Comment on attachment 607737 [details] [diff] [review]
patch, remove the word-cache expiration timer

[Triage Comment]
A full backout of bug 703100 does not appear to be feasible from the perspective of risk, so we're going to instead work to mitigate the opportunity for crashes while keeping memory down.
Comment 53 Jonathan Kew (:jfkthame) 2012-04-03 11:41:27 PDT
Pushed expiration-timer-disable and large-cache-flush patches to beta:
https://hg.mozilla.org/releases/mozilla-beta/rev/0cb0a962e548
https://hg.mozilla.org/releases/mozilla-beta/rev/a1f7942cbd93
Comment 54 Jonathan Kew (:jfkthame) 2012-04-03 11:45:24 PDT
And pushed large-cache-flushing patch to aurora:
https://hg.mozilla.org/releases/mozilla-aurora/rev/ac19f9e7096e

FTR, the changeset ID listed in comment 34 for the timer-disabling patch is wrong; the actual changeset was:
https://hg.mozilla.org/releases/mozilla-aurora/rev/7b955f363ee9
Comment 55 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-04-03 11:47:58 PDT
Removing QAWANTED -- please re-add if there is something more QA can do.
Comment 56 Michael 2012-04-05 17:55:55 PDT
I have no idea if this will help anyone.... I am not a developer... but here goes. I have read through these comments (as I often do) and see that there is at least a chance of the bug being elsewhere entirely. There has been a bug that has shown up in Firefox 12 when attempting to build with Intel C++. MSVC doesn't hit it, but the error occurs here in Intel C++ shown below......... the compiler it seems is complaining about a missing ";" someewhere. I hope that I haven't just spammed everyone here, but felt I should at least mention this.. Firefox 11 did not have this issue.



make[6]: Entering directory `/e/Mozilla/beta-build/obju-icc/gfx/2d'  
Factory.cpp  
icl -FoFactory.obj -c -D_HAS_EXCEPTIONS=0 -I../../dist/stl_wrappers  -DMOZ_GFX -DUSE_CAIRO -DSK_A32_SHIFT=24 -DSK_R32_SHIFT=16 -DSK_G32_SHIFT=8 -DSK_B32_SHIFT=0 -DWIN32 -DMOZILLA_INTERNAL_API -D_IMPL_NS_COM -DEXPORT_XPT_API -DEXPORT_XPTC_API -D_IMPL_NS_GFX -D_IMPL_NS_WIDGET -DIMPL_XREAPI -DIMPL_NS_NET -DIMPL_THEBES  -DOSTYPE=\"WINNT6.1\" -DOSARCH=WINNT  -I/e/Mozilla/beta-build/mozilla-beta/gfx/2d -I. -I../../dist/include -I../../dist/include/nsprpub  -Ie:/Mozilla/beta-build/obju-icc/dist/include/nspr -Ie:/Mozilla/beta-build/obju-icc/dist/include/nss        -TP -nologo -W3 -Gy -Fdgenerated.pdb -wd4800 -we4553  -DNDEBUG -DTRIMMED -O2  -Oy -Ie:/Mozilla/beta-build/obju-icc/dist/include/cairo -MD            -FI ../../dist/include/mozilla-config.h -DMOZILLA_CLIENT /e/Mozilla/beta-build/mozilla-beta/gfx/2d/Factory.cpp  
Factory.cpp  
  
e:\Mozilla\beta-build\obju-icc\dist\include\nsExpirationTracker.h(321): error: expected a ";"  
  
      class ExpirationTrackerObserver MOZ_FINAL : public nsIObserver {  
  
                                                                             ^  
  
          detected during instantiation of class "nsExpirationTracker&lt;T, K&gt; [with T=gfxFont, K=3U]" at line 690 of "e:\Mozilla\beta-build\obju-icc\dist\include\gfxFont.h"  
  
  
  
compilation aborted for e:/Mozilla/beta-build/mozilla-beta/gfx/2d/Factory.cpp (code 2)  
  
make[6]: *** [Factory.obj] Error 2
Comment 57 Jonathan Kew (:jfkthame) 2012-04-05 22:00:38 PDT
(In reply to Michael from comment #56)
> I have no idea if this will help anyone.... I am not a developer... but here
> goes. I have read through these comments (as I often do) and see that there
> is at least a chance of the bug being elsewhere entirely. There has been a
> bug that has shown up in Firefox 12 when attempting to build with Intel C++.
> MSVC doesn't hit it, but the error occurs here in Intel C++ shown
> below......... the compiler it seems is complaining about a missing ";"
> someewhere. I hope that I haven't just spammed everyone here, but felt I
> should at least mention this.. Firefox 11 did not have this issue.

Thanks for letting us know - at this point, I'm guessing this indicates an incompatibility with the intel compiler (maybe as a result of bug 704127). Could you please file a separate bug report about this, where we can investigate it further without cluttering this one, as I don't think it's actually related, but it would be good to resolve it somehow.
Comment 58 Sheila Mooney 2012-04-10 10:09:07 PDT
Still no crashes in 12b4 so far.
Comment 59 Sheila Mooney 2012-04-12 10:13:22 PDT
Johnathan, when can we resolve this bug? Still no crashes in 12b4.
Comment 60 Jonathan Kew (:jfkthame) 2012-04-12 10:27:14 PDT
I also didn't see any crashes in recent builds of Aurora and Nightly, only from builds before the fix here landed. So let's call this Fixed.
Comment 61 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-05-10 10:55:06 PDT
I'm not seeing any crash reports for this since Firefox 12.0b3 -- calling this verified.

Note You need to log in before you can comment on or make changes to this bug.