Closed Bug 965945 Opened 6 years ago Closed 6 years ago

[B2G][Browser] Accessing files on github.com causes the phone to oom and restart

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.3+, firefox28 wontfix, firefox29 wontfix, firefox30 fixed, b2g-v1.2 unaffected, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed)

RESOLVED FIXED
1.4 S2 (28feb)
blocking-b2g 1.3+
Tracking Status
firefox28 --- wontfix
firefox29 --- wontfix
firefox30 --- fixed
b2g-v1.2 --- unaffected
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: rkunkel, Assigned: kats)

References

Details

(Keywords: perf, regression, Whiteboard: [c=memory p= s= u=1.3] dogfood1.3)

Attachments

(4 files)

Attached file dmesg
When the user accesses a file shared on github.com and scrolls down to view more of the intended file, the browser app will freeze and become unresponsive. If the user continues to tap on the frozen webpage, the webpage's display will turn black and the phone will eventually restart.

Repro Steps:
1) Update Buri to BuildID: 20140127004002
2) Open the browser app. 
3) Go to url: http://bit.ly/1dbggkX.
4) When the intended webpage loads scroll down.
5) Once the webpage freezes, continue tapping on the screen.

Actual:
Webpage freezes, the screen turns black and the phone eventually restarts.

Expected:
Webpage is displayed without any issues.

Environmental Variables:
Device: Buri 1.3 MOZ
BuildID: 20140127004002
Gaia: 25a45a836a4a21a30f63fa7b544b42e8b781180a
Gecko: c40099a42c1f
Version: 28.0a2
Firmware Version: V1.2-device.cfg

Notes:
The smallest file this issue was reproduced on was 9,500 kb in size.

Repro frequency: 100%
See attached: dmesg, logcat
I was unable to reproduce this issue on the latest v1.2 build. When navigating to the link in Comment #0 the webpage immediately displayed a notification that it had crashed. The device never restarted as the page was never able to load.

Environmental Variables:
Device: Buri 1.2 MOZ
BuildID: 20140128004004
Gaia: 539a25e1887b902b8b25038c547048e691bd97f6
Gecko: d10e1f965d0c
Version: 26.0
Firmware Version: v1.2-device.cfg
Attached file logcat
blocking-b2g: --- → 1.3?
Component: Gaia::Browser → General
Whiteboard: dogfood1.3 → dogfood1.3 [MemShrink]
I'm able to reproduce this issue on the earliest 1.3 build I have available.

Device: Buri v1.3 Mozilla RIL
BuildID: 20130919040201
Gaia: 0dedb5b3789afc638a0c7c67652937fcb30e77d2
Gecko: 803189f35921
Version: 27.0a1
Base Image: V1.2-device.cfg
Attached file about-memory-2.zip
Attaching the about memory report, hopefully this contains some useful information.
Keywords: qawanted
Mike - Can you assign someone from the perf team to investigate this?
blocking-b2g: 1.3? → 1.3+
Flags: needinfo?(mlee)
So to be clear the regression is that this used to just crash the tab and now crashes the phone?

I'll take this for investigation.
Assignee: nobody → khuey
Flags: needinfo?(mlee)
This looks very similar to bug 947523 and is probably the same underlying issue. Bug 957668 is aiming to fix this properly for 1.3+.
Depends on: 957668
Kyle did you unassign yourself because of comment 8? If so any reason why we shouldn't close this as a duplicate of Bug 957668?
Flags: needinfo?(khuey)
Priority: -- → P1
Whiteboard: dogfood1.3 [MemShrink] → [c=memory p= s= u=1.3] dogfood1.3 [MemShrink]
(In reply to Mike Lee [:mlee] from comment #9)
> Kyle did you unassign yourself because of comment 8? If so any reason why we
> shouldn't close this as a duplicate of Bug 957668?

Yes.

Whether we dupe it is up to kats.  I think we should keep it open because we can verify that the fix in bug 957668 fixes this once it is ready.
Flags: needinfo?(khuey)
I would rather leave it as blocking than dupe it.
Let's make bug 957668 the MemShrink-tagged one.
Whiteboard: [c=memory p= s= u=1.3] dogfood1.3 [MemShrink] → [c=memory p= s= u=1.3] dogfood1.3
Ok. Kats, assigning this to you. Please close out as needed once bug 957668 is verified as fixing this.
Assignee: nobody → bugmail.mozilla
Status: NEW → ASSIGNED
Whiteboard: [c=memory p= s= u=1.3] dogfood1.3 → [c=memory p= s= u=1.3] dogfood1.3 [ETA:Feb14]
Target Milestone: --- → 1.4 S1 (14feb)
Whiteboard: [c=memory p= s= u=1.3] dogfood1.3 [ETA:Feb14] → [c=memory p= s= u=1.3] dogfood1.3 [ETA:see dep bug]
I'm going to unblock the dependency here, as discussions offline concluded that patch will not be taken to 1.3.
No longer depends on: 957668
I guess I better come up with an alternate fix for this then.
Whiteboard: [c=memory p= s= u=1.3] dogfood1.3 [ETA:see dep bug] → [c=memory p= s= u=1.3] dogfood1.3 [ETA:2/26]
So I tried to reproduce the problem on latest 1.3 hamachi code. I wasn't able to get the exact behaviour described but I saw one of two things happen when I scrolled the page in question:

1) The browser tab crashed with the "Sorry" message. I could close or reload the tab no problem.

2) The iframe on the page went black. When this happened, b2g-ps showed that the Preallocated process and the homescreen had been killed, and the b2g root process and the browser process were using up large amounts of memory. Further scrolling or interaction at this point usually caused all the b2g processes to die, even though the phone never actually rebooted itself. (Maybe this is a separate bug; once the b2g root process gets OOM-killed it should probably restart the phone).

In the latter case I looked at the about:memory dumps and confirmed what I was expecting, which is that the gralloc memory reporter showed a large (~60MB) increase. This is because the APZ code sets a gigantic displayport (something to be addressed properly in bug 957668) and the parent process merrily tries to allocate a giant gralloc buffer for it. This is actually very similar to a problem I was discussing with bjacob yesterday. I have a patch that addresses this (by killing the child process instead of OOMing the parent) but I will run it by bjacob first.
After discussion with jrmuizel and bjacob the attached patch seems like a good fix for 1.3.

However they pointed out another issue in the flow I described in comment 16. That is, when the OOM killer decides to kill the browser and the root process, instead of killing both at the same time (or so it appears to me) it should kill the browser process first, wait for cleanup to happen and for the root process to free up memory, and only then kill the root process. I will file a follow-up bug for this.
Attachment #8378367 - Flags: review?(bjacob)
Filed bug 974445 as a follow-up for the LMK behaviour.
Comment on attachment 8378367 [details] [diff] [review]
Kill child before it kills the parent via OOM

Review of attachment 8378367 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp
@@ +255,5 @@
>    }
>  
> +  // If the requested size is too big (i.e. exceeds the commonly used max GL texture size)
> +  // then we risk OOMing the parent process. It's better to just deny the allocation and
> +  // kill the child process, which is what the following code does.

I would add a comment: TODO: make that check the actual GL_MAX_TEXTURE_SIZE instead of hardcoding 4096.
Attachment #8378367 - Flags: review?(bjacob) → review+
Comment on attachment 8378367 [details] [diff] [review]
Kill child before it kills the parent via OOM

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): APZ
User impact if declined: load some webpages (those with large iframes or scrollable subframes) can cause the parent process to get OOM-killed
Testing completed: locally
Risk to taking this patch (and alternatives if risky): there is a risk that this patch causes child proceses to get killed more frequently. However this should only happen in cases where we otherwise risk OOMing the parent, so I think it's better.
String or UUID changes made by this patch: none
Attachment #8378367 - Flags: approval-mozilla-b2g28?
Attachment #8378367 - Flags: approval-mozilla-b2g28? → approval-mozilla-b2g28+
https://hg.mozilla.org/mozilla-central/rev/679e69e79fe2
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: 1.4 S1 (14feb) → 1.4 S2 (28feb)
Backed out on m-c for causing C2 failure on B2G emulator:

https://hg.mozilla.org/mozilla-central/rev/660b62608951

Backouts to follow for other trees.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Here is an example failure log: https://tbpl.mozilla.org/php/getParsedLog.php?id=34935180&full=1&branch=mozilla-inbound

The crashtest that is failing is layout/base/crashtests/653133-1.html which sets a displayport 6000 pixels tall and expects to not crash. That sounds a little unreasonable to me but I'll have to look into the history of that test before I can disable it with a clear conscience.
Based on the history of the test I think it should still be a valid test if I drop the displayport height to 4096. Try run with that change:

https://tbpl.mozilla.org/?tree=Try&rev=90968afa290c
See Also: → 965389
Try run is green but I'd like to wait on the results of the investigation in bug 965389 before re-landing, since this might result in the gallery app crashing more frequently.
The gallery app part of bug 965389 doesn't happen on 1.3 so I'm going to go ahead and land this again with the test change. It will likely cause the gallery app to crash on nexus 4 and (less frequently) on hamachi, but that's ok because if it didn't crash it would just show black or OOM the phone which isn't much better. And hopefully we'll have bug 957668 done in the 1.4 timeframe and so it'll get taken care of properly.
https://hg.mozilla.org/mozilla-central/rev/5ac5ee30517d
Status: REOPENED → RESOLVED
Closed: 6 years ago6 years ago
Resolution: --- → FIXED
I had a b2g28 tree handy so I pushed the uplift too:

https://hg.mozilla.org/releases/mozilla-b2g28_v1_3/rev/34278682f4c6
Whiteboard: [c=memory p= s= u=1.3] dogfood1.3 [ETA:2/26] → [c=memory p= s= u=1.3] dogfood1.3
Depends on: 977423
You need to log in before you can comment on or make changes to this bug.