Closed Bug 1488164 Opened 6 years ago Closed 6 years ago

Hang, memory inflation and crash on www.friskyradio.com

Categories

(Core :: General, defect)

All
Unspecified
defect
Not set
critical

Tracking

()

RESOLVED INACTIVE
Tracking Status
firefox63 --- fix-optional
firefox64 --- fix-optional

People

(Reporter: mark, Unassigned)

References

()

Details

(Keywords: regression, regressionwindow-wanted, topcrash, Whiteboard: [MemShrink:P2])

Crash Data

Attachments

(2 files)

Loading the URL indicated causes Firefox nightly to hang, inflate memory to many GBs, and often crash as a result (probably due to OOM).

I'm unsure which component this belongs to, so filing this as untriaged.
Hi Mark!

Could you please specify the OS and Firefox version and Build ID you have had the issue on? 

I've tried accessing different parts of the URL you've mentioned in the summary, listened to the live show also to recorded shows, but didn't observe any inflated memory usage or crash. I've tried both on Win 10 x64 and Win 7 on the latest Release 62.0 Build ID: 20180830143136 and latest Nightly 64.0a1 Build ID: 20180905123750.

Could you retest using safe mode (https://goo.gl/AR5o9d), maybe even a new clean Firefox profile (https://goo.gl/AWo6h8), to eliminate add-ons or custom settings as a possible cause. 

Also if you encounter a crash please send us a crash report (you can find more details here: https://support.mozilla.org/en-US/kb/mozillacrashreporter)
Flags: needinfo?(mark)
Status: NEW → UNCONFIRMED
Ever confirmed: false
Whiteboard: [MemShrink]
OS: Windows 7 SP1 x64 fully updated, 16GB RAM
Browser used: Firefox Nightly x64. First report was on the build of 20180903
Updated nightly today, problem still occurs: 64.0a1 (2018-09-06) (64-bit)

about:memory when inflation and hang/nonresponsiveness occurs gives a warning at the top of the main process measurement:

WARNING: the following values are negative or unreasonably large.

    explicit/js-non-window/(3 tiny)/gc-heap/unused-arenas
    js-main-runtime/gc-heap/(2 tiny)
    js-main-runtime/gc-heap/(2 tiny)/unused-arenas
    js-main-runtime-gc-heap-committed/unused/arenas 

This indicates a defect in one or more memory reporters. The invalid values are highlighted. 

The content process measurement is clearly out of bounds:

Web Content (pid 6604)
Explicit Allocations

3,346.05 MB (100.0%) -- explicit
├──1,729.75 MB (51.70%) ── heap-unclassified
├──1,397.44 MB (41.76%) -- window-objects
│  ├──1,395.52 MB (41.71%) -- top(https://www.friskyradio.com/show/artist_of_the_week/01-01-2013, id=4294967297)
│  │  ├────725.84 MB (21.69%) -- active
│  │  │    ├──645.18 MB (19.28%) ++ (197 tiny)
│  │  │    └───80.66 MB (02.41%) ++ window(https://www.friskyradio.com/show/artist_of_the_week/01-01-2013)
│  │  ├────666.59 MB (19.92%) -- js-zone(0xb43f000)
│  │  │    ├──293.72 MB (08.78%) ── type-pool
│  │  │    ├───94.41 MB (02.82%) -- shapes
│  │  │    │   ├──52.26 MB (01.56%) -- gc-heap
│  │  │    │   │  ├──34.16 MB (01.02%) ── tree
│  │  │    │   │  └──18.10 MB (00.54%) ++ (2 tiny)
│  │  │    │   └──42.15 MB (01.26%) -- malloc-heap
│  │  │    │      ├──34.05 MB (01.02%) ── tree-tables
│  │  │    │      └───8.10 MB (00.24%) ++ (2 tiny)
│  │  │    ├───85.94 MB (02.57%) ── unused-gc-things
│  │  │    ├───75.15 MB (02.25%) -- lazy-scripts
│  │  │    │   ├──58.66 MB (01.75%) ── gc-heap
│  │  │    │   └──16.50 MB (00.49%) ── malloc-heap
│  │  │    ├───61.74 MB (01.85%) ++ (13 tiny)
│  │  │    └───55.63 MB (01.66%) -- object-groups
│  │  │        ├──54.71 MB (01.64%) ── gc-heap
│  │  │        └───0.92 MB (00.03%) ── malloc-heap
│  │  └──────3.10 MB (00.09%) ++ cached/window(https://www.mozilla.org/en-US/firefox/64.0a1/whatsnew/?oldversion=63.0a1)
│  └──────1.92 MB (00.06%) ++ top(none)/detached
├────106.13 MB (03.17%) -- js-non-window
│    ├───73.31 MB (02.19%) -- runtime
│    │   ├──39.00 MB (01.17%) ++ (14 tiny)
│    │   └──34.31 MB (01.03%) ++ code
│    └───32.82 MB (00.98%) ++ (4 tiny)
├─────95.08 MB (02.84%) -- heap-overhead
│     ├──57.91 MB (01.73%) ── bin-unused
│     ├──36.08 MB (01.08%) ── bookkeeping
│     └───1.10 MB (00.03%) ── page-cache
└─────17.66 MB (00.53%) ++ (22 tiny)

I'll also upload a saved memory report.
Note: this problem is specifically present on the URL indicated -- it doesn't seem to happen on the main page of that site or other pages. Probably due to the content. Sometimes blocking "slow loading trackers" in disconnect helps, sometimes it does not. Disabling it almost certainly causes the issue.
Crash occurs on 32-bit Firefox due to OOM (as expected with this kind of inflation) but  switched to 64-bit so it just eats a few gigs and then seems to settle down some.
Flags: needinfo?(mark)
Hi!

Thanks for your feed back. Indeed with the details provided I managed to see the issue here and with the Nightly x32 to even obtain a crash on Win7 x64 (see crash signature attached), on the Nightly x64 indeed it's the same behavior as you mentioned, the memory consumption jumps to around 3GB with a single tab opened.

I've tried on Win 10 also with latest Nightly x64, the memory consumption jumps while on the x32 version only the tab crashes, without the crash report.

Steps to reproduce:
1.Open Nightly
2.Disable Content Blocking
3.Open https://www.friskyradio.com/show/artist_of_the_week/01-01-2013
4.Wait or press play on an available playlist from the link, by scrolling down to ex. Artist of the Week 1 January 2013 Monodice

Mike, could you please take a look at the attached memory report and assign an appropriate component for this issue?
Status: UNCONFIRMED → NEW
Crash Signature: [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC ]
Ever confirmed: true
Flags: needinfo?(mconley)
Hardware: Unspecified → All
3,763.52 MB (100.0%) -- explicit
├──2,061.39 MB (54.77%) ── heap-unclassified
├──1,326.59 MB (35.25%) -- window-objects
│  ├──1,323.53 MB (35.17%) -- top(https://www.friskyradio.com/show/artist_of_the_week/01-01-2013, id=4294967297)
│  │  ├────761.98 MB (20.25%) ++ js-zone(0xb43f000)
│  │  ├────558.54 MB (14.84%) -- active
│  │  │    ├──475.36 MB (12.63%) ++ (207 tiny)
│  │  │    └───83.18 MB (02.21%) ++ window(https://www.friskyradio.com/show/artist_of_the_week/01-01-2013)
│  │  └──────3.01 MB (00.08%) ++ cached/window(https://www.mozilla.org/en-US/firefox/64.0a1/whatsnew/?oldversion=63.0a1)
│  └──────3.07 MB (00.08%) ++ top(none)/detached
├────223.51 MB (05.94%) -- heap-overhead
│    ├──178.56 MB (04.74%) ── bin-unused
│    ├───43.75 MB (01.16%) ── bookkeeping
│    └────1.19 MB (00.03%) ── page-cache
├────123.24 MB (03.27%) ++ js-non-window
└─────28.80 MB (00.77%) ++ (22 tiny)



*whistles*, in this report, there's about 2GB in heap-unclassified. That's probably where we'll want to figure out what's going on. I also wonder if it might be related to all 207 of those tiny windows attached to the main one.

I'm not certain which component to place this in, but I'm going to guess DOM for now. Since this is so easily reproducible, hopefully someone from MemShrink can use a DMD build and find out what all of that heap-unclassified is.
Component: Untriaged → DOM
Flags: needinfo?(mconley)
Product: Firefox → Core
(Moving to Generic, since nothing here hints about DOM. But someone running DMD needs to take a look)
Component: DOM → General
Looks like we need some memory reporters for HTML parser at least.
And XDR

That wouldn't fix this, but would ease analyzing the issue.
Testing requires disable all the tracking protection. If that is done, some script
seems to be adding new iframe elements all the time. I think those are connect.facebook.net iframes.
[Tracking Requested - why for this release]:
There are 708 crashes in beta 63 and is ranked #3 in content top-crashers.
Keywords: topcrash
Tracking for 63 given the crash spike.
Top crash, updating the severity.
Severity: normal → critical
(regression without regression window isn't too useful)
according to crash data from 63.0a1 this likely regressed in build 20180706224413 - these would be the changes going into that version:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=fa376b&tochange=9849ea3937e2e7c97f79b1d2b601f956181b4629
Olli, does it fall into your bucket or should we get other people involved into this bug? Is that a DOM issue in the end? Thanks
Flags: needinfo?(bugs)
This smells like some script change (which would mean not a Gecko regression), or frameloader/docshell change.
Fission folks would know about the latter.
Flags: needinfo?(bugs) → needinfo?(nika)
(In reply to Olli Pettay [:smaug] (r- if the bug doesn't explain what the change(s) are about.) from comment #16)
> This smells like some script change (which would mean not a Gecko
> regression), or frameloader/docshell change.
> Fission folks would know about the latter.

As far as I know there haven't been any major changes to docshell/frameloader in that timeframe, other than basic refactorings. qDot might have a better memory though as he's in the docshell more than me nowadays.
Flags: needinfo?(nika) → needinfo?(kyle)
facebook.com is very visible in the URL list of the 63 crashes, as is youtube.com and Gmail. There is a 48.5% correlation at the top of beta correlations: GFX_ERROR "[Tiling:Client] Failed to allocate a TextureClient" = true [98.18% vs 03.08% if startup_crash = null]. Not sure if that is helpful or not.
Nothing I'm aware of coming into Docshell/frameloader there yet either.
Flags: needinfo?(kyle)
Ideas on what else to look into?
Flags: needinfo?(bugs)
Whiteboard: [MemShrink] → [MemShrink:P2]
Selena, could we have somebody assigned to evaluate this top crash? Given the volume of crashes on beta it is very likely that when we hit release this will be a major stability issue for 63, thanks!
Flags: needinfo?(sdeckelmann)
Adding a similar signature.
Crash Signature: [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC ] → [@ OOM | large | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC] [@ OOM | unknown | js::AutoEnterOOMUnsafeRegion::crash | js::AutoEnterOOMUnsafeRegion::crash | js::gc::AllocateCellInGC]
Hello Paul - Bug 1472062 seems quite similar to this one. If it is the same underlying issue, is there anything we can do in the 63 cycle to make it better?
Flags: needinfo?(pbone)
Hey Maire -- see comment #16 -- can you have someone take a look? we have a reproducible memory leak here and we're just not sure where exactly it's occurring.
Flags: needinfo?(sdeckelmann) → needinfo?(mreavy)
If anyone who can reproduce this could find the regression range, that would help here a lot!
Flags: needinfo?(bugs)
sfink, could this have something to do with bug 1471949?
(See comment 14)
Flags: needinfo?(sphink)
Looking at other crashes, isn't Bug 1472062 the same issue as this bug?
See Also: → 1472062
(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #24)
> Hey Maire -- see comment #16 -- can you have someone take a look? we have a
> reproducible memory leak here and we're just not sure where exactly it's
> occurring.

Hi Selena, If I understand smaug's report and follow-on comments (comment 7 and comment 8), the heap-unclassified memory looks like mostly XDR and HTML5 Parser stuff.  Henri Sivonen would be a good person to ask -- he could investigate why the parser uses that much memory, but it's likely that the site is just being stupid. 

Hi Henri, I've needinfo'd you to ask if it's worth your taking a look -- or asking someone else to -- to understand why the parser uses so much memory?

------
The rest of my comments below are my guess at what's likely going on and how severe a problem this likely is when 63 is released (FWIW):

IIUC this has gotten hot because we're concerned that this is a recent regression that will blow up on us once Fx 63 goes to the field, but (after doing a quick test) Fx 56 appears to use around the same amount of memory on this particular page. If someone wants to sanity-check my quick test, that'd be great.

I think it's likely that the signature may have changed from earlier versions -- possibly due to changes to the nursery/gc code?    This particular site is evil and could cause OOM, but the signatures tracked here are not specific to this site.

The XDR/HTML5 Parser stuff may have a bug, but the most likely explanation is that the spot where we happen to hit OOM changed (e.g. due to GC/JS/Nursery code changes).  To check that, we could examine the overall OOM rate for all OOM signatures (assuming we can reliably pull out these signatures, and that they're reliably being reported).

We could then track OOM with all signatures against browser version, and see if the overall OOM rate has gone up or down.  However, OOMs are something that are influenced by a lot of factors - ironically a more-stable browser or one that degrades in performance less over time becomes more likely to OOM (since it didn't crash and the user didn't kill it and restart).  And the OOM rate can vary over time.
Flags: needinfo?(mreavy) → needinfo?(hsivonen)
laszlo, could you or someone else from QA to try to find the actual regression?
Assuming we have a regression in Gecko code, finding the regressing bug is the first thing to do.
Flags: needinfo?(laszlo.bialis)
An initial look at attachment 9007441 [details] suggests that the memory usage in the HTML parser is primarily attributable to text node content parsed from a network stream. It's been almost two years since memory management changes to this area of the parser, so if the regression is more recent, chances are we aren't seeing a new parser bug.
In particular, attachment 9007441 [details] suggests that there is a single huge text node that arrives from the network relatively quickly compared to our timers that try to handle incremental display of slowly-received huge text nodes.
If Tracking Protection is disabled, this page loads a 1.1 MB HTML document from Facebook unreasonably often (many times per second?). The document contains a bit of boilerplate followed by and inline script that contains a huge text node as payload. So decoded to UTF-16, the text node is about 2 MB in size.

So the page basically asks the HTML parser to internally collect and then copy to the main thread a 2 MB text node unreasonably frequently.

So not a parser bug but a badly behaved Facebook iframe.
Flags: needinfo?(hsivonen)
> IIUC this has gotten hot because we're concerned that this is a recent 
> regression that will blow up on us once Fx 63 goes to the field, but 
> (after doing a quick test) Fx 56 appears to use around the same amount 
> of memory on this particular page. If someone wants to sanity-check my 
> quick test, that'd be great.

For what it's worth, I pulled up the page in Fx52-esr latest, and seeing very much the same behavior and issue there, so it's most definitely not a recent regression.
FWIW, https://www.friskyradio.com/show/artist_of_the_week/01-01-2013 crashed also on Chrome (and actually lead to a tab which can't be closed and all the chrome processes had to be killed with -9).

The page seems to be doing almost endless loading, loading same stuff from Facebook. Then after quite a long time, loading ends and memory usage stabilizes, with gigs of memory. Chrome's child process seems to use then a bit less memory than Firefox's child process.
(In reply to Olli Pettay [:smaug] (r- if the bug doesn't explain what the change(s) are about.) from comment #8)
> Looks like we need some memory reporters for HTML parser at least.

Filed bug 1494940.
Hi Olli,

This is the result of mozregression:

Got as far as we can go bisecting nightlies...
Last good revision: d1a54ae63da7ebc4bc1eeb5b613e8ec29bfcb80a (2016-02-09)
First bad revision: ac39fba33c6daf95b2cda71e588ca18e2eb752ab (2016-02-10)
Pushlog:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=d1a54ae63da7ebc4bc1eeb5b613e8ec29bfcb80a&tochange=ac39fba33c6daf95b2cda71e588ca18e2eb752ab

Please note that we couldn't go any further with mozregression because the build were missing "CRITICAL: First build d1a54ae63da7 is missing, but mozregression can't find a build before - so it is excluded, but it could contain the regression!"

I've considered a good revision a version were the memory usage wouldn't jump above 800 MB and a bad revision where the memory usage was above 2.5 GB.
Flags: needinfo?(laszlo.bialis) → needinfo?(bugs)
(In reply to Marcia Knous [:marcia - needinfo? me] from comment #23)
> Hello Paul - Bug 1472062 seems quite similar to this one. If it is the same
> underlying issue, is there anything we can do in the 63 cycle to make it
> better?

Good call, these are definitily related.  They crash in the same place but may have different causes.  I want to consider:

 1) Like in Bug 1472062 this crash is happening during nursery collection.
 2) The 1,729.75 MB (51.70%) ── heap-unclassified memory usage needs looking at.

If we were to fix #1 (I'll reply on that bug) then we'd still have this bug using a lot of memory that we're not accounting for.
Flags: needinfo?(pbone)
(In reply to Maire Reavy [:mreavy] Plz needinfo from comment #28)
> 
> ------
> The rest of my comments below are my guess at what's likely going on and how
> severe a problem this likely is when 63 is released (FWIW):

It could be that Bug 1472062 is the bigger problem for 63.  Most of the crashes found by these signatures are might not be related to the almost 2GB of heap-unclassified memory.

> IIUC this has gotten hot because we're concerned that this is a recent
> regression that will blow up on us once Fx 63 goes to the field, but (after
> doing a quick test) Fx 56 appears to use around the same amount of memory on
> this particular page. If someone wants to sanity-check my quick test, that'd
> be great.
> 
> I think it's likely that the signature may have changed from earlier
> versions -- possibly due to changes to the nursery/gc code?    This
> particular site is evil and could cause OOM, but the signatures tracked here
> are not specific to this site.

I think so too.
:pbone do you think this could be a meta bug at this point?
Flags: needinfo?(pbone)
(In reply to Selena Deckelmann :selenamarie :selena use ni? pronoun: she from comment #41)
> :pbone do you think this could be a meta bug at this point?

Hi Selena,

I'm going to make 1472062 the meta-bug for the "OOM while evicting the nursery" part.  And let this bug track the big memory use for HTML parsing:

(In reply to Henri Sivonen (:hsivonen) from comment #34)
> If Tracking Protection is disabled, this page loads a 1.1 MB HTML document
> from Facebook unreasonably often (many times per second?). The document
> contains a bit of boilerplate followed by and inline script that contains a
> huge text node as payload. So decoded to UTF-16, the text node is about 2 MB
> in size.
> 
> So the page basically asks the HTML parser to internally collect and then
> copy to the main thread a 2 MB text node unreasonably frequently.
> 
> So not a parser bug but a badly behaved Facebook iframe.

This is the problem this bug should track.  Can we do anything about it at all?  Maybe add resource limits controls to give users more control?
Flags: needinfo?(pbone)
QA Contact: sdeckelmann
(In reply to Paul Bone [:pbone] from comment #42)
> And let this bug track the big memory use for HTML parsing:

There's no mystery about why the HTML parser makes large allocations: The Facebook HTML contains a large text node and that kind of HTML is loaded again and again from Facebook unreasonably often.

> (In reply to Henri Sivonen (:hsivonen) from comment #34)
> > If Tracking Protection is disabled, this page loads a 1.1 MB HTML document
> > from Facebook unreasonably often (many times per second?). The document
> > contains a bit of boilerplate followed by and inline script that contains a
> > huge text node as payload. So decoded to UTF-16, the text node is about 2 MB
> > in size.
> > 
> > So the page basically asks the HTML parser to internally collect and then
> > copy to the main thread a 2 MB text node unreasonably frequently.
> > 
> > So not a parser bug but a badly behaved Facebook iframe.
> 
> This is the problem this bug should track.  Can we do anything about it at
> all?

We could try to figure out if the bad script logic on the page that causes the repeated loading of content from Facebook was triggered by some browser change in the range indicated by comment 38. However, since the page also misbehaves in Chrome (per comment 36), I think we shouldn't put too much effort into it.

> Maybe add resource limits controls to give users more control?

We already provide users control to avoid this problem: The former Tracking Protection part of the current Content Blocking feature set.

I think it doesn't make sense to provide controls that would relate to quantities rather than the list-based not-useful-to-user purpose (tracking): It's probably too hard to characterize the combination of too large text node too often, when it would probably break too much to block either only the text node size or only the network activity frequency.
(In reply to Henri Sivonen (:hsivonen) from comment #43)
> > Maybe add resource limits controls to give users more control?
> 
> We already provide users control to avoid this problem: The former Tracking
> Protection part of the current Content Blocking feature set.

Good point.

> I think it doesn't make sense to provide controls that would relate to
> quantities rather than the list-based not-useful-to-user purpose (tracking):
> It's probably too hard to characterize the combination of too large text
> node too often, when it would probably break too much to block either only
> the text node size or only the network activity frequency.

I think the only thing we could possibly do is say "no tab/page may use more than X memory/CPU/whatever" this feature was requested already (some other bug) but it's hard because of DOM/JS/other intereactions.

Selena, can we close this?  I would but I'm not sure if it's "cool" to close a bug in someone else's component.
Flags: needinfo?(sdeckelmann)
QA Contact: sdeckelmann
Closing per :pbone. If someone knows what that other bug is, feel free to resolve duplicate. I can't find it after a few minutes of searching.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(sdeckelmann)
Resolution: --- → INACTIVE
It's not really a duplicate though.  The other bug is Bug 1472062 but I"m not going to mark it as a duplicate because its not.
Removing tracking status for 63 as there is nothing actionnable here we could ship in 63.
Flags: needinfo?(sphink)
Flags: needinfo?(bugs)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: