Closed Bug 1273024 Opened 4 years ago Closed 3 years ago

[e10s] loading https://500px.com/popular and clicking a photo causes Nightly to hang/use large amount of memory, round 2

Categories

(Core :: Layout, defect)

Unspecified
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s m9+ ---
firefox46 --- unaffected
firefox47 - wontfix
firefox48 + fixed
firefox49 + fixed

People

(Reporter: Gavin, Assigned: kats)

References

Details

(Keywords: crash, Whiteboard: [MemShrink])

Attachments

(3 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #1251527 +++

This bug has reared it's ugly head again.

STR:
1) Start Nightly (I'm currently testing the May 15 build, built off mozilla-central 4a8ed77f6bb573f20980056bf8c1dadd125c1a85) with a new profile
2) load https://500px.com/popular
3) click one of the photos to "zoom in"

Nightly hangs. At first just the content process is non-responsive (switching tabs results in content spinner), but eventually the whole app gets unresponsive too. This has also caused OS X to freak out and pop up the "System out of memory" dialog, which in one case caused all my other open apps to be paused and hang and required a reboot.
I'm not sure whether this is a regression or if the previous bug was just never fixed completely. I tried looking for a regression range but ran into issues with this only be intermittently reproducible :(
No longer depends on: 1251527
I can reproduce this on the latest beta (FIREFOX_47_0b5_BUILD1) and developer edition (f00f9e754ce71fc5efa3e4334afea5cf6ad6c215), but not on release.
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:49.0) Gecko/20100101 Firefox/49.0
Build ID: 20160516030211

I have tested your issue on latest Nightly build (20160516030211) and managed to reproduce it, but only if e10s is enabled. I have opened the link you provided and when clicking on a image to enlarge it, Nightly hangs. I've also performed a regression and below are my results:

Last good revision: 12b082de19be6a96d0f834771fb81dbdd4840721
First bad revision: 664f46e3d8845b5e9e5b78608ddd867ba072237c
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=12b082de19be6a96d0f834771fb81dbdd4840721&tochange=664f46e3d8845b5e9e5b78608ddd867ba072237c

Looks like the following bug has the changes which introduced the regression:
https://bugzilla.mozilla.org/show_bug.cgi?id=1111386
Tracking for 48/49 based on the bad user experience and the fact that it can make the entire system unresponsive.
Paul, can you re-run your regression finding? Bug 1111386 sounds really unlikely.
Flags: needinfo?(paul.pasca)
I did perform another regression on this issue and I've obtained different results. I don't know if this is relevant, but this time I've used OS X 10.11 and last time it was OS X 10.10.

Last good revision: c525c5164cdde373a636d525e53c77f52d8fd693
First bad revision: c99e8abb5e00eaa439a3be7864df0e4135326349
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c525c5164cdde373a636d525e53c77f52d8fd693&tochange=c99e8abb5e00eaa439a3be7864df0e4135326349
Flags: needinfo?(paul.pasca)
I can reproduce:

1) open most popular photo page
2) click on any photo which should load a photo specific web page

result: beach balled nightly for about 15 seconds (old mac air). Once the photo page loads it responds normally.

3) click back button to the most popular page
4) click on the photo again

result: another long hang, then the page loads
Assignee: nobody → gkrizsanits
Sorry Paul but I have to ask you to do this a third time. The regression range you found now is the one during which the original bug (bug 1251527, referenced in comment 0) was caused. But that regression was fixed on March 9th, 2016. Can you re-run your regression check with a starting window that starts after March 9th?
Flags: needinfo?(paul.pasca)
Not sure if this profiler output is extremely helpful but I link it anyway, it reflects the steps described in Comment 7. https://cleopatra.io/#report=dd18c637dee1c72d04b3ddfb60a0889458d90cde

In short if I turn APZ off this problem is gone.

This looks really bad on the latest MBP retina as well. The whole browser becomes unresponsive, so I wonder why the profiler does not show huge chrome process janks.
Assignee: gkrizsanits → bugmail.mozilla
Flags: needinfo?(bugmail.mozilla)
(In reply to Markus Stange [:mstange] from comment #8)
> Sorry Paul but I have to ask you to do this a third time. The regression
> range you found now is the one during which the original bug (bug 1251527,
> referenced in comment 0) was caused. But that regression was fixed on March
> 9th, 2016. Can you re-run your regression check with a starting window that
> starts after March 9th?

Sorry, but the build from 10 March 2016 is bad. If I perform a regression starting from 10 March 2016, I get an error that says that the build was expected to be good.
Flags: needinfo?(paul.pasca)
Blocks: apz-desktop
Attached file Layers dump (obsolete) —
So when I follow the STR (local m-c build) I definitely see slowness and it takes a while for the photo to show up, but it does work and doesn't crash. I see some interesting things in the console output such as:

[Child 23831] WARNING: Failed to create DrawTarget of sufficient size.: file /Users/kats/zspace/mozilla-git/gfx/thebes/gfxContext.cpp, line 1259

I added more logging to this, and the size being requested is 1436 x 999999 which seems rather large indeed. It might be that if you have a system with enough RAM (>5G ?) it actually tries to allocate that buffer, eating up all the RAM and then some. On my machine I guess it fails and goes into the fallback path at [1].

The 999999 is coming from a fixed-position div in the lightbox that pops up when you click on a photo. I'm also attaching a layers dump, which shows a ContainerLayer and a ColorLayer with that height.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxContext.cpp?rev=c992422247b7#1258
Flags: needinfo?(bugmail.mozilla)
If I use the devtools to shrink that fixed-pos element to 1000px instead of 999999px tall the page responds much better, so I'm fairly confident that's the cause of the slowness. We are probably not clamping that height somewhere that we should be. It has an opacity of 0.97 so it's going through some group opacity code - stacktrace attached to the point of the allocation failure, if that's of any use.
This is more useful. The app unit height is 59999940, grepping for that in the file gives you the relevant display items.
Attachment #8754970 - Attachment is obsolete: true
Based on the patch in bug 1251527, I tried commenting out the two lines at [1] and that fixed the problem. But I don't know if that's the right fix - Matt, do you want to take this?

[1] http://mxr.mozilla.org/mozilla-central/source/layout/base/nsDisplayList.cpp?rev=34eda1b91ca3&mark=4916-4917#4916
Assignee: bugmail.mozilla → nobody
Flags: needinfo?(matt.woodrow)
No longer blocks: apz-desktop
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
> Based on the patch in bug 1251527, I tried commenting out the two lines at
> [1] and that fixed the problem. But I don't know if that's the right fix -
> Matt, do you want to take this?
> 
> [1]
> http://mxr.mozilla.org/mozilla-central/source/layout/base/nsDisplayList.
> cpp?rev=34eda1b91ca3&mark=4916-4917#4916

That seems right, setting the visible region to the entirety of the bounds is unlikely to ever be what we want.

We should see if this passes try.
Flags: needinfo?(matt.woodrow)
(In reply to Matt Woodrow (:mattwoodrow) from comment #15)
> We should see if this passes try.

let's see:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec1584aff2da
This seems to be an e10s only issue. As Fx47 will ship with e10s off, this is now a wontfix. Please let me know if this also affects non-e10s users.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #16)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec1584aff2da

Very noisy run on windows, but all the oranges seem to be unrelated intermittents, but it's hard to tell for sure... locally I could not reproduce any of the failures but I'm on OSX. I would risk a landing attempt...

Kats, do you want to file the patch for review and land it? I don't want to steal the bug from you, just had a few extra cycles to test it for you...
Flags: needinfo?(bugmail.mozilla)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #18)
> (In reply to Gabor Krizsanits [:krizsa :gabor] from comment #16)
> > https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec1584aff2da
> 
> Very noisy run on windows, but all the oranges seem to be unrelated
> intermittents, but it's hard to tell for sure... locally I could not
> reproduce any of the failures but I'm on OSX. I would risk a landing
> attempt...

Yeah, I was seeing those same windows oranges on other pushes, I think something to do with the clipboard is busted on windows try. Pretty sure it's unrelated to the patch.

> Kats, do you want to file the patch for review and land it? I don't want to
> steal the bug from you, just had a few extra cycles to test it for you...

Yup, I'll put the patch up for review. Thanks for doing the try push!
Flags: needinfo?(bugmail.mozilla)
Attached patch PatchSplinter Review
Assignee: nobody → bugmail.mozilla
Attachment #8755873 - Flags: review?(matt.woodrow)
Attachment #8755873 - Flags: review?(matt.woodrow) → review+
https://hg.mozilla.org/mozilla-central/rev/8c64c8c5474b
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Comment on attachment 8755873 [details] [diff] [review]
Patch

Approval Request Comment
[Feature/regressing bug #]: bug 1231538
[User impact if declined]: some pages use up a ton of memory or are slow to respond because they are eating too much memory
[Describe test coverage new/current, TreeHerder]: not really :/
[Risks and why]: low risk, the code is well understood and the bug appears to have been mostly an oversight.
[String/UUID change made/needed]: none
Attachment #8755873 - Flags: approval-mozilla-aurora?
Comment on attachment 8755873 [details] [diff] [review]
Patch

Reduces e10s memory usage, Aurora48+, has been in Nightly for a few days.
Attachment #8755873 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.