Closed Bug 947523 Opened 11 years ago Closed 10 years ago

[B2G][Browser] The phone will crash if the user keeps zooming in on MoPad webpage

Categories

(Core :: Graphics, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
blocking-b2g koi+
Tracking Status
firefox27 --- wontfix
firefox28 --- wontfix
firefox29 --- wontfix
b2g18 --- unaffected
b2g-v1.1hd --- unaffected
b2g-v1.2 --- fixed
b2g-v1.3 --- wontfix

People

(Reporter: KTucker, Assigned: kats)

References

Details

(5 keywords, Whiteboard: dogfood1.2 [osrestartoom] [MemShrink:P2][u=memory p=5 s= u=1.2])

Attachments

(7 files, 8 obsolete files)

469.84 KB, text/plain
Details
660.86 KB, video/mp4
Details
26.35 KB, text/plain
Details
3.64 KB, patch
kats
: review+
Details | Diff | Splinter Review
4.04 KB, patch
kats
: review+
Details | Diff | Splinter Review
3.38 KB, patch
kats
: review+
Details | Diff | Splinter Review
1.71 KB, patch
kats
: review+
Details | Diff | Splinter Review
Attached file Zoomingincrashlog.txt
Description:
If the user keeps zooming in on an etherpad webpage, the phone will crash.

Repro Steps:
1)  Updated Buri to Build ID: 20131206040203
2)  Tap on the "Browser" icon.
3)  Navigate to https://etherpad.mozilla.org
4)  Tap on "Create new public pad".
5)  Keep zooming in on the webpage as far as the user can go. If the crash doesn't occur keeping zooming in on the webpage while changing the phone's orientation.

Actual:
The device will either crash and reset or the browser will crash.

Expected:
The user can zoom in without a crash occurring.

Environmental Variables
Device: Buri v 1.3.0 Mozilla RIL
Build ID: 20131206040203
Gecko: http://hg.mozilla.org/mozilla-central/rev/1401e4b394ad
Gaia: 8fca2ca67e8a6022fe6ed8cb576e5d59dfb5237f
Platform Version: 28.0a1

Notes:
Repro frequency: 100%
See attached: video clip, logcat
Whiteboard: dogfood1.2
This issue does not occur on Leo v 1.1.0 COM RIL

Environmental Variables
Device: Leo v 1.1.0 COM RIL
Build ID: 20131024041202
Gecko: http://hg.mozilla.org/releases/mozilla-b2g18/rev/e4cb5a852e3d
Gaia: 39b0203fa9809052c8c4d4332fef03bbaf0426fc
Platform Version: 18.1
RIL Version: 01.01.00.019.264 

The browser crash does not occur when zooming in on the page.
Attached video BrowserCrash.mp4
This issue occurs on the Buri v 1.2.0 COM RIL 

Environmental Variables
Device: Buri v 1.2.0 COM RIL
Build ID: 20131204004003
Gecko: http://hg.mozilla.org/releases/mozilla-b2g26_v1_2/rev/758f3fb32dda
Gaia: 8d762f3376318fd6be390432db750ae4904c9ab6
Platform Version: 26.0
RIL Version: 01.02.00.019.102 

The browser or device will crash when zooming in on the MoPad webpage.
Uh oh. That's an OS restart crash, which is really bad.

Can we get a crash report ID for this bug?
blocking-b2g: --- → koi?
Whiteboard: dogfood1.2 → dogfood1.2 [b2g-crash] [osrestartcrash]
Severity: normal → critical
Component: Gaia::Browser → General
QA Contact: gbennett
Attached file 947523_dmesg
During the few hours of reproing this issue I was only able to get the OS crash three times and there was no dump in the Crash Reports folder on the phone. I talked to Kevin and he said that it could be an OOM issue so I snagged a dmesg. Hopefully that should help you guys out. I was able to get the OS crash on today's 1.2 and 1.3, but the attached dmesg is from the build below.

Environmental Variables:
Device: Buri 1.3 mozRIL
BuildID: 20131209053402
Gaia: 1d45d1dc3201059d5c8f2efdeb92c04576d8e161
Gecko: 9f12a9fab080
Version: 28.0a1
Firmware Version: 20131115
Keywords: qawanted
Keywords: crash
Whiteboard: dogfood1.2 [b2g-crash] [osrestartcrash] → dogfood1.2 [osrestartoom]
Does this reproduce on a 1.1 Buri device?
Keywords: qawanted
This does not repro on a 1.1 Buri. The same functionality is seen in Leo 1.1 as it simply doesn't let you zoom in far enough and keeps rubber-banding back to to a set zoom if you try to keep pinching to zoom in.

Environmental Variables:
Device: Buri 1.1 mozRIL
BuildID: 20131209041202
Gaia: 6ff3a607f873320d00cb036fa76117f6fadd010f
Gecko: 05117f42088f
Version: 18.0
Firmware Version: 20131115
Keywords: qawanted
Just wondering, should we also add "crash" in the keyword field?
(In reply to Kevin Hu [:khu] from comment #8)
> Just wondering, should we also add "crash" in the keyword field?

Nope - this is an OOM of the parent process, not a crash.
Whiteboard: dogfood1.2 [osrestartoom] → dogfood1.2 [osrestartoom] [MemShrink]
<6>[  434.353606] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
<6>[  434.353648] [   98]     0    98       78       41   0     -16          -941 ueventd
<6>[  434.353668] [  128]  1000   128      213       15   0     -16          -941 servicemanager
<6>[  434.353686] [  129]     0   129     1009       45   0     -16          -941 vold
<6>[  434.353703] [  132]     0   132      834       29   0     -16          -941 fakeperm
<6>[  434.353719] [  136]     0   136    76681    39295   0       0             0 b2g
<6>[  434.353738] [  137]     0   137     1958       67   0     -16          -941 netd
<6>[  434.353754] [  138]     0   138      180       14   0     -16          -941 debuggerd
<6>[  434.353773] [  139]  1001   139     5093      370   0     -16          -941 rild
<6>[  434.353789] [  140]  1013   140     4665      340   0     -16          -941 mediaserver
<6>[  434.353808] [  141]  1002   141      338       27   0     -16          -941 dbus-daemon
<6>[  434.353826] [  142]  1017   142      437       35   0     -16          -941 keystore
<6>[  434.353843] [  152]  1000   152      481       23   0     -16          -941 mm-pp-daemon
<6>[  434.353861] [  155]  1000   155      763       76   0     -16          -941 mm-qcamera-daem
<6>[  434.353879] [  160]     0   160      297       27   0     -16          -941 gpu_dcvsd
<6>[  434.353896] [  161]     0   161      863        2   0     -16          -941 adbd
<6>[  434.353914] [  187]     0   187      218       40   0     -16          -941 location-mq
<6>[  434.353933] [  188]  1001   188     1508       58   0     -16          -941 qmuxd
<6>[  434.353949] [  196]  1001   196     1549       72   0     -16          -941 netmgrd
<6>[  434.353968] [  197]     0   197     3651       73   0     -16          -941 xtwifi-inet-age
<6>[  434.353984] [  205]  2000   205      198       16   0     -16          -941 sh
<6>[  434.354003] [  208]     0   208     3046       83   0     -16          -941 xtwifi-client
<6>[  434.354021] [  408]  1007   408      176       14   0     -16          -941 logwrapper
<6>[  434.354039] [  409]  1010   409      662       78   0     -16          -941 wpa_supplicant
<6>[  434.354058] [  421]  1000   421      337       27   0     -16          -941 qosmgr
<3>[  434.354074] Out of memory: Kill process 136 (b2g) score 827 or sacrifice child
<3>[  434.354268] Killed process 136 (b2g) total-vm:306724kB, anon-rss:52324kB, file-rss:104856kB

Near the top it says total_vm is 76681 KB, but the "Killed process" line says 306724 KB...
Whiteboard: dogfood1.2 [osrestartoom] [MemShrink] → dogfood1.2 [osrestartoom] [MemShrink:P2]
This issue looks to have started reproducing on the 11/08 1.2 Buri build. On builds before 11/08, the tab would need to be reloaded every so often, but B2G would remain stable.

- Works -
Environmental Variables:
Device: Buri v1.2 MOZ RIL
BuildID: 20131107004003
Gaia: 590eb598aacf1e2136b2b6aca5c3124557a365ca
Gecko: 26f1e160e696
Version: 26.0
RIL Version: 01.02.00.019.102
Firmware Version: 20131115

- Broken -
Environmental Variables:
Device: Buri v1.2 MOZ RIL
BuildID: 20131108004004
Gaia: 4cf40fb30c7b8380ea83ed0d4efe043b8b81737f
Gecko: a886c641a306
Version: 26.0
Firmware Version: 20131115
The regression here is likely on the gecko side, as this involves killing the parent process.

A suspected patch for causing this regression is bug 927633 because is involving IPC.

Can we get a regression range on 1.3 as well? That will help isolate with bug caused this.

Mason - Any ideas if bug 927633 caused this?
Flags: needinfo?(mchang)
blocking+ - we should never cause the full phone to restart.

Also marking security sensitive because a web page should be able to cause the full phone to restart, which could be exploited by a web developer to force phone restarts.
Group: core-security
blocking-b2g: koi? → koi+
Something similar to this issue looks to have started on the first working 1.3 build from 9/19. The OS will crash and display the crash report page on each build up to 9/23. 

From 9/24 and on, there is no crash report screen displayed and appears to instead be an OOM issue... exactly as specified in comment 5 and comment 9.

There are builds between 9/24 and 12/11 where it is really difficult to get a repro and instead the Browser tab will need to be reloaded since the webpage becomes unresponsive many times.

I am making the window be between 9/23 and 9/24 since that is when it looks to have become and OOM issue which is what this bug is about. If the crash is considered relevant to this bug, then it occurred on the first 1.3 build from 9/19. 

- "Works" (OS crashes) -
Environmental Variables:
Device: Buri v1.3 MOZ RIL
BuildID: 20130923040214
Gaia: 3408cc3f6b190c8cd31832fbb8cd2ae571041f29
Gecko: f97307cb4c95
Version: 27.0a1
Firmware Version: 20131115

- Broken (OS restarts with an OOM) -
Environmental Variables:
Device: Buri v1.3 MOZ RIL
BuildID: 20130924040201
Gaia: a22ba4a3a9efd99f94adf9ece8a2b391d4df295b
Gecko: 1fda74e33e06
Version: 27.0a1
Firmware Version: 20131115
I just tried the broken gaia and gecko revisions from comment 15. I was able to reproduce the crash with and without bug 927633. The patch also reduced memory usage by 1.5 mb per process, so if we're getting an OOM, it should just happen sooner without bug 927633.
Flags: needinfo?(mchang)
Andrew - Can you find someone to look into this full phone OOM?
Flags: needinfo?(overholt)
Denial of service is not a security issue.  Especially when user interaction is required to trigger the bug.
Group: core-security
Kyle suggests testing this with skiagl off to narrow out that as a cause => qawanted.

Dale, with Ben out can you tell us how the gaia browser zooming functionality is implemented?  Do you think it could be triggering a gecko crash with graphics changes around the beginning of September?
Flags: needinfo?(overholt) → needinfo?(dale)
Keywords: qawanted
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #18)
> Denial of service is not a security issue.  Especially when user interaction
> is required to trigger the bug.

I disagree with this. DOSs are definitely security issues if it's outside of the user's control. I'm not sure if this particular bug is outside of the user's control however, as it's dependent whether the root cause can happen in other user scenarios that are triggered by visiting a malicious website/app aiming to reboot the phone.

(In reply to Andrew Overholt [:overholt] from comment #19)
> Kyle suggests testing this with skiagl off to narrow out that as a cause =>
> qawanted.

I'm doubtful this is related to Skia GL. This is OOMing the parent process, which shouldn't technically ever be possible on the phone with anything driven by untrusted web content. This is likely a process management regression, which is why this why filed in IPC.
(In reply to Jason Smith [:jsmith] from comment #20)
> (In reply to Andrew Overholt [:overholt] from comment #19)
> > Kyle suggests testing this with skiagl off to narrow out that as a cause =>
> > qawanted.
> 
> I'm doubtful this is related to Skia GL. This is OOMing the parent process,
> which shouldn't technically ever be possible on the phone with anything
> driven by untrusted web content. This is likely a process management
> regression, which is why this why filed in IPC.

If you do want to rule out the extra memory usage due to GL acceleration, just set:
pref("gfx.canvas.azure.accelerated", false);
The browser doesnt do any zooming, its purely implemented inside Gecko/APZC, the graphics changes are very likely to effect it
Flags: needinfo?(dale)
Milan

Please have this assigned. We would like to have a patch for the same by 12/20.
Flags: needinfo?(milan)
Why does this block?  Do we expect that users will commonly try to zoom in like this?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #24)
> Why does this block?  Do we expect that users will commonly try to zoom in
> like this?

I think it's not the likelihood that's caused to be marked as a blocker - instead, it was marked as a blocker because it should never be possible to cause a full reboot of the phone. During 1.01, a lot our partners were pretty adamant on having a no known full phone crashes on release policy, as that ensures a user never sees an unexpected reboot of the phone to think that something is fundamentally broken with the phone.
(In reply to Jason Smith [:jsmith] from comment #25)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #24)
> > Why does this block?  Do we expect that users will commonly try to zoom in
> > like this?
> 
> I think it's not the likelihood that's caused to be marked as a blocker -
> instead, it was marked as a blocker because it should never be possible to
> cause a full reboot of the phone. During 1.01, a lot our partners were
> pretty adamant on having a no known full phone crashes on release policy, as
> that ensures a user never sees an unexpected reboot of the phone to think
> that something is fundamentally broken with the phone.

Another way to explain this - the blocker here is the full phone reboot, not the fact that we're OOMing in the STR in general. If we fixed this bug by causing the OOM to happen in the child process, rather than the parent process, then that would fix the blocking factor on this bug.
I tried to reproduce this issue after setting pref("gfx.canvas.azure.accelerated", false); and I could not reproduce this issue on the latest Buri 1.3 Build ID: 20131217004001

I also checked without setting the preference on the latest Buri 1.3 and could easily reproduce the bug. 

Gaia   dca0a3dcf062ce3e422a9c56d141c14543c816fb
SourceStamp 1f7db4cc788e
BuildID 20131217004001
Version 28.0a2
Keywords: qawanted
(In reply to Sarah Parsons from comment #27)
> I tried to reproduce this issue after setting
> pref("gfx.canvas.azure.accelerated", false); and I could not reproduce this
> issue on the latest Buri 1.3 Build ID: 20131217004001
> 
> I also checked without setting the preference on the latest Buri 1.3 and
> could easily reproduce the bug. 
> 
> Gaia   dca0a3dcf062ce3e422a9c56d141c14543c816fb
> SourceStamp 1f7db4cc788e
> BuildID 20131217004001
> Version 28.0a2

Guess I wasn't right - the OOM is being caused by Skia GL. Still puzzled why the OOM is hitting the parent process rather than the child process.
Component: General → Graphics
Product: Firefox OS → Core
There's only two gfx bugs in the regression range here:

* bug 844819
* bug 934886

Which one caused this regression? Maybe bug 934886?

Peter - What do you think?
Flags: needinfo?(pchang)
(In reply to Jason Smith [:jsmith] from comment #28)
> Guess I wasn't right - the OOM is being caused by Skia GL. Still puzzled why
> the OOM is hitting the parent process rather than the child process.

Perhaps because the browser app runs in the parent process?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #30)
> (In reply to Jason Smith [:jsmith] from comment #28)
> > Guess I wasn't right - the OOM is being caused by Skia GL. Still puzzled why
> > the OOM is hitting the parent process rather than the child process.
> 
> Perhaps because the browser app runs in the parent process?

Well right, but I thought each browser tab is supposed to run in a separate child process. So that means if the child process runs out of memory, then the tab should get killed with the sad face error page.
(In reply to Jason Smith [:jsmith] from comment #29)
> There's only two gfx bugs in the regression range here:
> 
> * bug 844819
> * bug 934886
> 
> Which one caused this regression? Maybe bug 934886?
> 
> Peter - What do you think?
bug 934886 is trying to fix graphic artifact when taking screenshot. I didn't see the related function called (GetThebesSurfaceForDrawTarget) when I zoom in/out the browser pages.
Flags: needinfo?(pchang)
If this is "just" OOM, then anything that causes us to use more memory would cause the regression.  The memory used is no different than that used by non-graphics code.
Flags: needinfo?(milan)
Whiteboard: dogfood1.2 [osrestartoom] [MemShrink:P2] → dogfood1.2 [osrestartoom] [MemShrink:P2][c=memory p= s= u=1.2]
I will investigate why the parent process is getting killed and if there is anything we can do to address it.
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: dogfood1.2 [osrestartoom] [MemShrink:P2][c=memory p= s= u=1.2] → dogfood1.2 [osrestartoom] [MemShrink:P2][u=memory p=5 s= u=1.2]
I've reproduced the browser crash.  I zoomed in a fair amount and ran b2g-info:

                  |     megabytes    |
   NAME  PID NICE  USS  PSS  RSS VSIZE OOM_ADJ USER
    b2g 1531    0 49.5 63.2 77.0 208.1       0 root
Browser 1975    1 23.2 36.9 50.6 104.1       2 app_1975

System memory info:

            Total 179.7 MB
     Used - cache  94.6 MB
  B2G procs (PSS) 100.1 MB
    Non-B2G procs  -5.4 MB
     Free + cache  85.1 MB
             Free  52.4 MB
            Cache  32.6 MB

I then zoomed just a little bit more and the browser tab was killed.  I got the "oops, this is embarrassing" dialog.  The logcat indicated that "memory pressure" was over, so that suggests it was an OOM.

I find it interesting that just a tiny bit of zooming could cause us to consume over 50MB of free memory, though.

I will continue investigating tomorrow.
Adding instrumentation I can see that zooming results in extremely large gralloc buffers getting created.  For example, with a zoom of 3.75 in APZC I see a gralloc buffer request for 2238x13249.  This asks the system for ~110MB of mlocked memory which ends up killing b2g altogether.  Sometimes we get moderately smaller requests for gralloc buffers around ~30MB which only end up killing the browser tab.

I'm trying to trace where this size is getting calculated.  Are we zooming the entire layer and trying to alloc a buffer for the entire thing?  Even if that is the case it seems the zoom is being over-accumulated over time or something.
(In reply to Ben Kelly [:bkelly] from comment #36)
> Adding instrumentation I can see that zooming results in extremely large
> gralloc buffers getting created.  For example, with a zoom of 3.75 in APZC I
> see a gralloc buffer request for 2238x13249.  This asks the system for
> ~110MB of mlocked memory...

Let's get the explanation - going a bit trigger happy on the ni...
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bugmail.mozilla)
Flags: needinfo?(bgirard)
The gralloc buffer request is probably coming from layout/gfx code as it tries to rasterize the requested area. The "requested area" is going to be a function of the resolution and the displayport. In 1.2 the code that sets these values are in TabChild.cpp, at [1], [2] and [3]. We should verify that as the resolution increases the displayport size (in CSS pixels) decreases.

[1] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/ipc/TabChild.cpp;h=8325a51b1091cc8c6a162f139e3d0d860b4ca969;hb=cd74e4ad7d9dd3d699b13f7262ee898e4cf3f628#l1639
[2] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/ipc/TabChild.cpp;h=8325a51b1091cc8c6a162f139e3d0d860b4ca969;hb=cd74e4ad7d9dd3d699b13f7262ee898e4cf3f628#l1682
[3] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=dom/ipc/TabChild.cpp;h=8325a51b1091cc8c6a162f139e3d0d860b4ca969;hb=cd74e4ad7d9dd3d699b13f7262ee898e4cf3f628#l1648
Flags: needinfo?(bugmail.mozilla)
It appears that the display port being set from ProcessUpdateFrame() shrinks with the scale, but not when its set from ProcessUpdateSubFrame().  Here is my instrumentation while zooming the etherpad:

I/Gecko   (12250): ### ### TabChild::ProcessUpdateSubFrame() calling SetDisplayPortForElement() width:598.000000 height:2525.000000
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetResolution() scale:0.339769
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetDisplayPortForElement() width:980.000000 height:1117.816650
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetResolution() scale:0.466176
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetDisplayPortForElement() width:980.000000 height:1117.816650
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetResolution() scale:0.466176
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetDisplayPortForElement() width:980.000000 height:1117.816650
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetResolution() scale:1.090141
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetDisplayPortForElement() width:439.500000 height:835.000000
I/Gecko   (12250): ### ### TabChild::ProcessUpdateSubFrame() calling SetDisplayPortForElement() width:598.000000 height:2525.000000
I/Gecko   (12250): ### ### TabChild::ProcessUpdateSubFrame() calling SetDisplayPortForElement() width:598.000000 height:2525.000000
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetResolution() scale:2.474219
I/Gecko   (12250): ### ### TabChild::ProcessUpdateFrame() calling SetDisplayPortForElement() width:193.500000 height:367.500000

I also started to instrument the ThebesLayerData::Accumulate method and got this in my last run:

I/Gecko   (12250): ### ### ThebesLayerData::Accumulate() OR mVisibleRegion:0/0 aVisibleRect:0/0
I/Gecko   (12250): ### ### ThebesLayerData::Accumulate() OR mVisibleRegion:0/0 aVisibleRect:0/0
I/Gecko   (12250): ### ### ThebesLayerData::Accumulate() OR mVisibleRegion:0/0 aVisibleRect:1408/612
8
I/Gecko   (12250): ### ### ThebesLayerData::Accumulate() OR mVisibleRegion:1408/6128 aVisibleRect:13
88/6126
I/Gecko   (12250): ### ### ThebesLayerData::Accumulate() OR mVisibleRegion:1408/6128 aVisibleRect:69
/6128
I/Gecko   (12250): ### ### ContainerState::PopThebesLayer() calling SetVisibleRegion() width:1477 he
ight:6128

I still can't map the height of 6128 in this case back to the display port values.
Oh... the display port of 598x2525 in ProcessUpdateSubFrame() does map directly to the visible region if you multiply it by the scale:

  598x2525 * 2.474219 = 1480x6247
Well that's not good. Seems like a sub-apzc bug.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(bgirard)
So part of the problem here is that the composition bounds on the child APZC is getting scaled up as we zoom. This doesn't happen for the parent because the parent's composition bounds is clamped to the screen bounds [1]. (We have plans to remove that behaviour but it doesn't affect this bug on 1.2).

The next contributing factor is that the displayport calculation at [2] doesn't account for this increase in composition bounds. For the parent APZC the composition bounds don't change, so the division by the zoom at line [2] has the desired effect of shrinking the displayport as the zoom increases. For the child APZC the division merely cancels out the increase in composition bounds, and so we end up with a constant-sized displayport (in CSS pixels) regardless of zoom.

So the straightforward fix here would be to just divide again by the zoom. However, we can't do that because in general the child APZC doesn't request content repaints (and hence doesn't recalculate its displayport) while the parent is getting zoomed. So if the child last set a displayport of 500x1000 CSS pixels, and then the parent zooms to 5x what it was before, we would want the child to shrink its displayport down to 100x200 CSS pixels but it never gets a chance. The parent sets the resolution without the child knowing about it, and at that point layout tries to render the child's 500x1000 displayport at a 5x resolution which causes the OOM.

For a fix to work we would either need the parent APZC to notify all other APZCs in its subtree that it's about to change the zoom (and then have the children recalculate their displayport) or, in TabChild::UpdateRootFrame, find all the displayports inside a presShell and shrink them by the right factor. I'll try to see if either of these approaches are viable.

[1] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=layout/base/nsDisplayList.cpp;h=7e11b99bedc4b3e80b216e4ba05a8c73f40d6a2b;hb=cd74e4ad7d9dd3d699b13f7262ee898e4cf3f628#l748
[2] http://git.mozilla.org/?p=releases/gecko.git;a=blob;f=gfx/layers/ipc/AsyncPanZoomController.cpp;h=a6878295d0c476f71855003f73b403930e339e7f;hb=cd74e4ad7d9dd3d699b13f7262ee898e4cf3f628#l911
Throwing this over to :kats as he is working on a patch to propagate the zoom info to child APZCs as described in comment 42.

Please let me know if you would like me to test anything.  Thanks!
Assignee: bkelly → bugmail.mozilla
So I tried propagating the zoom info to child APZCs as I described above but ran into a threading problem. Specifically, I need to propagate the new zoom to child APZCs in ScheduleContentRepaint *before* posting the message to the GeckoContentController. However propagating the zoom to child APZCs requires holding the tree lock, and that results in a deadlock scenario because we're already holding the APZC lock there.

In addition this solution felt pretty hacky and brittle so I'm working on a different solution now which modifies how the displayport is stored. In nsDOMWindowUtils::SetDisplayPortForElement I'm going to multiply in the cumulative resolution and in nsLayoutUtils::GetDisplayPort I'm going to divide out the cumulative resolution. This way when the resolution of intermediate presshells change, the retrieved displayport is automatically scaled accordingly. I *think* this should work because the displayport is stored relative to the scroll offset and so scaling it won't move it around improperly, but I'm building the patch now to test it and see how it fares.
Since I don't have a working 1.2 build could you guys apply this patch and see if it fixes the problem without introducing any new regressions? Possible regressions would include the painted content (the displayport) not being positioned correctly, specially when zoomed in or out.

I did some light testing of this patch on master and I couldn't repro the OOM although there does still appear to be a fair amount of memory pressure and I think the HWC is dropping layers or something.
Attachment #8355612 - Flags: feedback?(botond)
Attachment #8355612 - Flags: feedback?(bkelly)
While attempting to build the 1.2 branch with the attached patch applied, I encountered the following compiler errors:

/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp: In function 'void DestroyGfxRect(void*, nsIAtom*, void*, void*)':
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: reference to 'gfx' is ambiguous
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/ipc/chromium/src/chrome/common/transport_dib.h:20: error: candidates are: namespace gfx { }
../../dist/include/mozilla/gfx/Types.h:15: error:                 namespace mozilla::gfx { }
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: 'rect' was not declared in this scope
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: reference to 'gfx' is ambiguous
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/ipc/chromium/src/chrome/common/transport_dib.h:20: error: candidates are: namespace gfx { }
../../dist/include/mozilla/gfx/Types.h:15: error:                 namespace mozilla::gfx { }
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: expected type-specifier before 'gfx'
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: expected '>' before 'gfx'
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: expected '(' before 'gfx'
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: expected primary-expression before '>' token
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:308: error: expected ')' before ';' token
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:309: error: type '<type error>' argument given to 'delete', expected pointer
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp: In member function 'virtual nsresult nsDOMWindowUtils::SetDisplayPortForElement(float, float, float, float, nsIDOMElement*)':
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:391: error: reference to 'gfx' is ambiguous
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/ipc/chromium/src/chrome/common/transport_dib.h:20: error: candidates are: namespace gfx { }
../../dist/include/mozilla/gfx/Types.h:15: error:                 namespace mozilla::gfx { }
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:391: error: expected type-specifier before 'gfx'
/home/botond/dev/mozilla/b2g-hamachi-1.2/gecko/dom/base/nsDOMWindowUtils.cpp:391: error: expected ')' before 'gfx'

Attached is an updated patch that fixes these errors.
Attachment #8355612 - Attachment is obsolete: true
Attachment #8355612 - Flags: feedback?(botond)
Attachment #8355612 - Flags: feedback?(bkelly)
Comment on attachment 8356186 [details] [diff] [review]
Change how the displayport is stored

Review of attachment 8356186 [details] [diff] [review]:
-----------------------------------------------------------------

:tn, what do you think of this change? In a nutshell it changes the displayport stored in layout to be in CSS pixels multiplied by the resolution, rather than just CSS pixels. That way when we increase the resolution on an ancestor layer, the descendants texture sizes automatically shrink and we don't OOM.
Attachment #8356186 - Flags: feedback?(tnikkel)
Comment on attachment 8356186 [details] [diff] [review]
Change how the displayport is stored

My first thought is that we will now have to update every display port under a presshell (and it's child documents) anytime it's resolution changes. Is this feasible/desirable?
Comment on attachment 8356186 [details] [diff] [review]
Change how the displayport is stored

I tested this on my 1.2 hamachi device, and I cannot reproduce the crash with it. I also didn't notice any regressions.
Attachment #8356186 - Flags: feedback+
With this patch I still see Browser crash while zooming the gaia meeting etherpad:

  https://etherpad.mozilla.org/gaia-meeting-notes

I would like to test a bit more with instrumentation to see if its the same OOM issue, but probably won't be able to until tomorrow morning.
(In reply to Ben Kelly [:bkelly] from comment #50)
> With this patch I still see Browser crash while zooming the gaia meeting
> etherpad:
> 
>   https://etherpad.mozilla.org/gaia-meeting-notes

On this page, I see the crash as well with the patch (I saw a phone crash, not a browser crash), though it seemed it required zooming in further to get the crash with the patch than without.
Comment on attachment 8356186 [details] [diff] [review]
Change how the displayport is stored

Talked to kats on irc about this. If we don't change AZPC code too then it will just end up setting the same display port after zooming again, so it won't fix the main problem. We need to at least pair this with an AZPC change. So clearing feedback flag for now.
Attachment #8356186 - Flags: feedback?(tnikkel)
On the buri you can monitor the texture memory usage by looking at the Mlocked field in /proc/meminfo:

  while true; do sleep 1; adb shell cat /proc/meminfo | grep Mlocked; done

This shows how much memory the device has mlocked as ashmem.
Updated patch to shrink the displayport for subframes in APZC code.
Attachment #8356186 - Attachment is obsolete: true
I've been discussing this issue with :tn and :bkelly and thinking about it, and we might need to take a step back here with the approach taken.

The fundamental problem is that when async subframe scrolling is enabled, we need more memory to render content. This is expected behaviour, but it was always unclear just how much more memory we would need, how it would be bounded, and whether we would often run into OOM situations. For the most part we went with a "let's do it and see" approach.

Turns out that with our current implementation (in 1.2+) the memory usage is unbounded, and can easily run into OOM situations. A simple conceptual example is a page that looks something like this:

<body>
 <iframe style="height: 10000px" src="reallylongpage.html"></iframe>
</body>

The iframe is 10,000 pixels tall, and we end up requesting a giant displayport (some multiple of 10000px) for this, because we could scroll either the outer page or the iframe, and need to have enough pre-rendered content to paint all of that. This, of course, will trigger an OOM. Even if the patches I uploaded to this bug worked perfectly we would still probably OOM on this page.

Going forward I think we have to use the visible region of the iframe (i.e. clipped by parent frames and the screen) rather than the full composition bounds as the base for the displayport. However I think that doing this is too risky for 1.2.

For 1.2 I see four main options:
1) Disable subframe apzc. This eliminates the memory requirement and also reduces responsiveness. I suspect this is off the table because of commitments we made to partners.
2) Apply some tweaks to make the OOMs less frequent. This is what I've been trying to do with the above patches, but it's hard to reduce the OOM threshold enough to stop OOMing on this etherpad page because the iframe is huge. It also doesn't guarantee OOMs will not happen.
3) Try some other method to clamp the memory usage, such as putting a hard limit on texture sizes. (I don't know if this can be easily implemented, and it would almost certainly result in scenarios where parts of subframes remain blank.) Another option is to just kill the entire tab if the requested displayport is too big. That limits the crash to a browser tab rather than a phone reboot.
4) Do nothing for 1.2; leave this bug unfixed and ship subframe APZC as-is. Focus efforts on getting a proper fix in for 1.3+

Based on comment 26, I'm thinking of abandoning my current attempt at option #2 and going with option #3 instead. If anybody has other thoughts please let me know.
[3] sounds okay - the blocking factor is the fact that we're OOMing the phone, rather than the browser tab. If we OOM the browser tab here instead, then that's okay enough to remove the blocking factor on the bug.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #55)
> 3) Try some other method to clamp the memory usage, such as putting a hard
> limit on texture sizes. (I don't know if this can be easily implemented, and
> it would almost certainly result in scenarios where parts of subframes
> remain blank.) Another option is to just kill the entire tab if the
> requested displayport is too big. That limits the crash to a browser tab
> rather than a phone reboot.

I think clamping would be preferable to predictably killing the browser tab.  While parts of the screen may not be visible, at least they can zoom back out to a more reasonable level without losing state.  The user doesn't get any feedback or options to proceed if the tab dies.

Would it be possible to make the existing scale threshold dynamic based on the iframe size?

  http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/AsyncPanZoomController.cpp#189

So for very large iframes, perhaps zoom is capped at 2.0 or even 1.0?
(In reply to Ben Kelly [:bkelly] from comment #57)
> I think clamping would be preferable to predictably killing the browser tab.
> While parts of the screen may not be visible, at least they can zoom back
> out to a more reasonable level without losing state.  The user doesn't get
> any feedback or options to proceed if the tab dies.

I was discussing this a bit with bjacob and we came to the conclusion it would be better to kill the tab because it's a much simpler codepath and we lose a lot of information by the time we get to the code that allocates the texture.

> 
> Would it be possible to make the existing scale threshold dynamic based on
> the iframe size?
> 
>  
> http://mxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/
> AsyncPanZoomController.cpp#189
> 
> So for very large iframes, perhaps zoom is capped at 2.0 or even 1.0?

The zoom is always done on the root frame of the page, and it doesn't always know how large the iframes are. So we'd have to plumb all of that in to go with this approach, and even a 2.0 zoom might trigger an OOM if the iframe is large enough. So it still wouldn't completely guard against this case.

I have a set of patches now which crash the tab if the texture size would be too large. I will upload them shortly; there is a rollup at http://people.mozilla.org/~kgupta/tmp/rollup.patch to test with.
I'll test this new patch.

Just to play devil's advocate, maybe we should consider disabling sub-apzc.  I am seeing a lot of jumpiness and general instability on this page in v1.2 compared to v1.1:

  v1.1 on helix:  https://www.dropbox.com/s/hwhgfz4t8nne00m/20140107_apzc_etherpad_v1_1.mp4
  v1.2 on buri:   https://www.dropbox.com/s/g2y35jd5llpnh7o/20140107_apzc_etherpad_v1_2.mp4

It just seems there are other regressions in usability here beyond the OOM.
We still need this patch in order to not blow up in the "transient" state, where the parent has increased its resolution but the child has not yet requested a new displayport. If we paint in this state then we will apply the old (small) displayport with the new (high) resolution which could OOM. So this patch prevents that.
Attachment #8356790 - Flags: review?(tnikkel)
We could probably do without this patch but it should help reduce the occurrence of the tab crashes. This contracts the subframe displayport multipliers if those subframes are getting rendered at a resolution > 1.
Attachment #8356791 - Flags: review?(botond)
If the texture size exceeds 4k then we can't render it anyway so that seems like a sensible limit to hard-code to, and crash if we exceed. We actually deal ok with a page with many small iframes; it's only when the iframe gets big on one or both axes that the total texture area blows up a lot.
Attachment #8356593 - Attachment is obsolete: true
Attachment #8356793 - Flags: review?(botond)
Attached patch Rollup parts 1-4 (obsolete) — Splinter Review
Please let me know if you still see the OOM with this. I didn't see it on my master build for Gaia; I will try to borrow a hamachi and test there as well.
Attachment #8356796 - Flags: feedback?(bkelly)
Comment on attachment 8356789 [details] [diff] [review]
1/4 - Add a GetCumulativeResolution function

IID bump on presshell please.
Attachment #8356789 - Flags: review?(tnikkel) → review+
Comment on attachment 8356790 [details] [diff] [review]
2/4 - Change the displayport storage units

What's the plan for this longer term? Make this be stored in layer pixels? Only land this on 1.2?

We should have a comment explaining why we do this.
Comment on attachment 8356791 [details] [diff] [review]
3/4 - Shrink the displayport multiplier at high zooms

Review of attachment 8356791 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/AsyncPanZoomController.cpp
@@ +898,5 @@
> +ScaleDisplayPortMultiplier(float aMultiplier, const FrameMetrics& aMetrics)
> +{
> +  // This will only have an effect on subframes where the parent frame
> +  // is zoomed in to > 1.0 (for the root frame GetParentResolution() will
> +  // return 1.0). The effect is to scale down the given displayport multiplier

Is this patch meant for trunk too? On trunk it's not necessarily the case that the root frame's GetParentResolution() returns 1.0, at least it won't be after bug 951936 and bug 935219 land, because then the resolution will be on the non-scrollable root layer which will be the parent of the rootmost scrollable layer.

Perhaps we can condition on IsRootForLayersId() instead (or mIsRoot, although see bug 957221)?
Comment on attachment 8356793 [details] [diff] [review]
4/4 - Crash the tab if the texture size is going to exceed the max GL texture size

This feels fairly specific to b2g; some environments (metro when it goes with remote tabs) might not be bound by such a limit.
Attachment #8356791 - Flags: review?(botond) → review+
Comment on attachment 8356790 [details] [diff] [review]
2/4 - Change the displayport storage units

kats tell me this is for 1.2 only. So just add a comment about thus and r+.
Attachment #8356790 - Flags: review?(tnikkel) → review+
Comment on attachment 8356793 [details] [diff] [review]
4/4 - Crash the tab if the texture size is going to exceed the max GL texture size

Review of attachment 8356793 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/TabChild.cpp
@@ +1682,5 @@
> +    if (aContent->GetPrimaryFrame()) {
> +      resolution = aContent->GetPrimaryFrame()->PresContext()->PresShell()->GetCumulativeResolution();
> +    }
> +    if (aMetrics.mDisplayPort.width * resolution.width > 4096 ||
> +        aMetrics.mDisplayPort.height * resolution.height > 4096) {

Should we be multiplying per aMetrics.mDevPixelsPerCSSPixel as well here?
Attachment #8356793 - Flags: review?(botond) → review+
Comment on attachment 8356793 [details] [diff] [review]
4/4 - Crash the tab if the texture size is going to exceed the max GL texture size

Review of attachment 8356793 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/TabChild.cpp
@@ +1686,5 @@
> +        aMetrics.mDisplayPort.height * resolution.height > 4096) {
> +      // The displayport is too big. At this size we can't even turn it into a GL texture
> +      // and all it's going to do is OOM the phone so we are better off if we just crash
> +      // this child process now. See bug 947523.
> +      MOZ_CRASH();

Would it be possible to log something indicating why the crash occurred?  I could see this saving us some investigation time on future bug reports.
(In reply to Timothy Nikkel (:tn) from comment #65)
> IID bump on presshell please.

Will do.

(In reply to Botond Ballo [:botond] from comment #67)
> Is this patch meant for trunk too? On trunk it's not necessarily the case
> that the root frame's GetParentResolution() returns 1.0, at least it won't
> be after bug 951936 and bug 935219 land, because then the resolution will be
> on the non-scrollable root layer which will be the parent of the rootmost
> scrollable layer.
> 
> Perhaps we can condition on IsRootForLayersId() instead (or mIsRoot,
> although see bug 957221)?

The patch is 1.2-only. I have a much bigger change in mind for 1.3 to try and fix this properly.

(In reply to Timothy Nikkel (:tn) from comment #68)
> This feels fairly specific to b2g; some environments (metro when it goes
> with remote tabs) might not be bound by such a limit.

Agreed. I can wrap it in an #ifdef if you prefer but since I only plan on having the code land on b2g-1.2 I don't think it matters.

(In reply to Timothy Nikkel (:tn) from comment #69)
> kats tell me this is for 1.2 only. So just add a comment about thus and r+.

Will do.

(In reply to Botond Ballo [:botond] from comment #70)
> Should we be multiplying per aMetrics.mDevPixelsPerCSSPixel as well here?

Yeah, good catch. I'll add that in.

(In reply to Ben Kelly [:bkelly] from comment #71)
> Would it be possible to log something indicating why the crash occurred?  I
> could see this saving us some investigation time on future bug reports.

Yup, I'll do that.

I was able to consistently reproduce the OOM on the hamachi I borrowed from Botond using a 1.2 build without my patch. The STR basically involved zooming in the page as much as possible using the area to the left of the iframe (because putting focus in the frame brings up the keyboard and that thing is impossible to work with) and then trying to scroll the iframe. Usually it would OOM before I even tried scrolling the iframe.

With my patch the OOM crash was replaced by a tab crash, and I wasn't able to trigger an OOM with any other panning/zooming actions on the page. So unless Ben has any objections I'll address the comments and get it landed.
Comment on attachment 8356796 [details] [diff] [review]
Rollup parts 1-4

I also was unable to get b2g to crash with the patch.  The tab reliably got killed.

I did, however, see some content fail to draw with this patch.  See time 45s to 50s in this movie:

  https://www.dropbox.com/s/vguya4xf6rgeqwh/20140107_apzc_etherpad_v1_2_patch.mov

It appears to occur when you zoom in while simultaneously dragging up a bit causing a scroll.  I could not reproduce without the patch.

On IRC kats indicated this might be explainable given that the patch shrinks the displayport.  Is this regression fixable?

If not, is this regression serious enough to hold back the patch?  Jason, what do you think?

I will try to see if I can reproduce it on any other websites tomorrow.  Maybe its isolated to sites with very large iframes like this etherpad.
Attachment #8356796 - Flags: feedback?(bkelly)
Flags: needinfo?(jsmith)
(In reply to Ben Kelly [:bkelly] from comment #73)
> On IRC kats indicated this might be explainable given that the patch shrinks
> the displayport.  Is this regression fixable?

I believe this regression could be caused by part 3 of this patch series (although I'm not 100% sure). If we take out that part then it might make the regression go away at the expense of killing the tab far more frequently. And if this is regression is a result of my patch then it should be limited to subframes and at high zoom levels.
I think it would be better to OOM the tab more often and not risk the painting regression seen in the video mainly because I'm concerned about the fallout risk that this has other sites. The blocking factor on the bug is to ensure that we don't OOM the phone. If we end up OOMing the tab a lot, but not the phone, then that's still okay.
Flags: needinfo?(jsmith)
I was able to reproduce the painting problem - my hypothesis in comment 74 was wrong. It's caused by part 2; as we zoom in the outer page the displayport (in CSS pixels) of the inner frame will shrink (by design). It remains the same size in layer pixels, but the composition bounds of the inner frame will expand in layer pixels. This results in the displayport shrinking to be smaller than the composition bounds until the APZC for the inner frame triggers a repaint for it.

From the user's point of view the net effect is that if you zoom in on a page with an iframe, the iframe might end up clipped like in Ben's video, but it will repaint properly as soon as you touch it or scroll it.
Would it be possible detect the condition and force a repaint after the zoom completes?  As a user I think having some artifacts during zoom would be ok if they corrected once I stopped touching the screen.  This is similar to how tiled map applications work, so users are probably used to the behavior.
I seem to be able to get the painting defect fairly easily on this iframe example site (see #3):

  http://nunzioweb.com/iframes-example.htm

Sometimes when the painting defect occurs I can no longer interact with the screen.  If I go to the homescreen and then back into the browser, though, touch input starts working again.

I can also cause the tab to get killed reliably by zooming in.  The iframe in question is not particularly large.
(In reply to Ben Kelly [:bkelly] from comment #77)
> Would it be possible detect the condition and force a repaint after the zoom
> completes?  As a user I think having some artifacts during zoom would be ok
> if they corrected once I stopped touching the screen.  This is similar to
> how tiled map applications work, so users are probably used to the behavior.

I tried to do this quickly but was unsuccessful. At this point I'm not going to work on this bug any further - unfortunately there are too many pressing 1.3 issues for me to spend more time on this. We can either land the patchset as-is, which fixes the OOMs, and file follow-up bugs for the regressions. Or we can leave it out.
I think given the options it would be better to prevent the b2g OOM.  So I would vote for landing the patchset in v1.2.  Jason, does this sound ok to you?

I believe bug 957668 is the follow-up to address the memory issues in v1.3.
Flags: needinfo?(jsmith)
See Also: → 957668
Updated versions of patches with comments addressed and reviewer info included; ready to land.
Attachment #8356789 - Attachment is obsolete: true
Attachment #8357242 - Flags: review+
Yup, that sounds fine. Let's land this.
Flags: needinfo?(jsmith)
RyanVM, could you please land this on 1.2? Parts 1-4. Thanks!
Flags: needinfo?(ryanvm)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: