[Flame][Memory] Cannot always boot successfully in 273MB RAM

RESOLVED INVALID

Status

Firefox OS
Performance
P1
blocker
RESOLVED INVALID
4 years ago
4 years ago

People

(Reporter: piwei, Assigned: ting)

Tracking

({memory-footprint, perf, regression})

unspecified
2.0 S6 (18july)
ARM
Gonk (Firefox OS)
memory-footprint, perf, regression

Firefox Tracking Flags

(blocking-b2g:2.0+, firefox31 wontfix, firefox32 fixed, firefox33 fixed, b2g-v1.3 unaffected, b2g-v1.4 affected, b2g-v2.0 fixed, b2g-v2.1 fixed)

Details

(Whiteboard: [MemShrink][273MB-Flame-Support][c=memory p= s= u=2.0])

Attachments

(4 attachments, 1 obsolete attachment)

(Reporter)

Description

4 years ago
Related bug 1008050.

To Repro:
Set Flame memory to 273MB by entering following commands in console:
$ adb reboot bootloader
$ fastboot oem mem 273
$ fastboot reboot

Expected: Flame boots up correctly to lockscreen or homescreen.

Actual: Flame doesn't always boot up correctly and would be stuck on Firefox animation screen.

- Frequency of hitting this issue:
v122 base only = 0 out of 5 attempts
v122 base + v1.4 = 2 out of 5 attempts (1.4 build ID: 20140709075131)
v122 base + v2.0 = 2 out of 5 attempts (2.0 build ID: 20140709105132)
v122 base + v2.1 = 2 out of 5 attempts (2.1 master build ID: 20140709131330)


Notes:
1) This issue will be reliably hit for v1.4, v2.0 and v2.1 after flashing gecko & gaia, or after Factory Resetting phone via Settings. I made sure for each version of Firefox these two conditions (flashing & resetting) are each tested once within those 5 attempts.

2) For testers running into this issue, doing an '$ adb reboot' should reset the device and you should be able to boot up eventually.
(Reporter)

Updated

4 years ago
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(jmitchell)
(Reporter)

Updated

4 years ago
status-b2g-v1.3: --- → unaffected
status-b2g-v1.4: --- → affected
status-b2g-v2.0: --- → affected
status-b2g-v2.1: --- → affected

Updated

4 years ago
Blocks: 1036675
nomming 2.0 (not 1.4 because I believe 2.0 is the focus of these 256/273 mem tests) ; issue is obviously severe, not being able to boot correctly.
blocking-b2g: --- → 2.0?
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
Can we get a logcat?
Whiteboard: [MemShrink]
QA Wanted for a logcat.
Keywords: footprint, perf, qawanted

Updated

4 years ago
QA Whiteboard: [QAnalyst-Triage+]

Updated

4 years ago
Keywords: regression
(Reporter)

Comment 4

4 years ago
Created attachment 8453841 [details]
Flame 2.1 master logcat after flashing

Logcat attached.
(Reporter)

Updated

4 years ago
QA Whiteboard: [QAnalyst-Triage?]
Keywords: qawanted
QA Contact: pcheng
(Reporter)

Updated

4 years ago
Flags: needinfo?(jmitchell)
So this is a regression between 1.3 and 1.4, not 1.4 and 2.0, correct?
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(jmitchell)
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #5)
> So this is a regression between 1.3 and 1.4, not 1.4 and 2.0, correct?

Right.
(Assignee)

Comment 7

4 years ago
Those from attachment 8453841 [details] seems odd:

  01-02 08:20:59.109   292   292 E GeckoConsole: [JavaScript Error: "TypeError: shell.contentBrowser is null" {file: "chrome://b2g/content/shell.js" line: 78}]
  07-10 03:18:48.460   292   292 E GeckoConsole: [JavaScript Error: "TypeError: this.contentBrowser is null" {file: "chrome://b2g/content/shell.js" line: 332}]

How should I try again if STR at comment 1 doens't repro, set the memory back to 1024, flash/reset, set memory to 273, and reboot?
(Reporter)

Comment 8

4 years ago
(In reply to Ting-Yu Chou [:ting] from comment #7)
> How should I try again if STR at comment 1 doens't repro, set the memory
> back to 1024, flash/reset, set memory to 273, and reboot?

Basically you need to either flash or factory reset the phone via Settings with the device ALREADY in 273mb mem. So your order of doing it is incorrect.

To put it simply: Set device to 273mb mem, and go to settings > device information > reset phone.

Updated

4 years ago
blocking-b2g: 2.0? → 2.0+

Updated

4 years ago
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage+][lead-review+]

Comment 9

4 years ago
Thinker, do you have any idea about this bug? How should we go next? Need reproducible environment or something else? Thank you!
Flags: needinfo?(tlee)

Comment 10

4 years ago
Ting-Yu, are you the best person to own this bug? Thank you.
Flags: needinfo?(tchou)
(Assignee)

Comment 11

4 years ago
I am not sure wheter I am the best fit, but yes, I am trying to figure out what is going on.
Flags: needinfo?(tchou)
(Assignee)

Comment 12

4 years ago
PiWei, it seems I am not reproducing the issue correctly. Follow the STR in comment 8, after factory reset:

1) the device always stops at "T2Mobile" screen, this does not match to Firefox animation screen you mentioned in comment 1
2) the device stops at "T2Mobile" screen no matter I set 273, or 1024 MB for the device memory, this also does not match to the prerequisite

gecko: f6aad75c6ac3d3da06406f3570833f29771e217d
gaia:  9d822c4666c507da8ffe9967df530aa6980acbf8
Flags: needinfo?(pcheng)

Comment 13

4 years ago
Brian, as far as I know, your team is also using 273MB Flame to do MTBF testing, did you encounter this kind of issues? Thanks!
Flags: needinfo?(brhuang)

Comment 14

4 years ago
We did encounter this issue sometimes but it's not always reproducible.
Flags: needinfo?(brhuang)
(Assignee)

Comment 15

4 years ago
As factory reset clears webapps, I tried to repro by:

  1. $ fastboot oem mem 273
  2. $ flash.sh
  3. go back to step 2 if I see FTU

Repeated step 2 for 10 times, but haven't reproduced it yet, I will keep trying tomorrow.
(Assignee)

Comment 16

4 years ago
(In reply to Ting-Yu Chou [:ting] from comment #15)
>   1. $ fastboot oem mem 273
>   2. $ flash.sh
>   3. go back to step 2 if I see FTU

I tried 20 more times today, and haven't reproduced it. Tried also 20 times "adb reboot" as below, still can't repro:

  1. fastboot oem mem 273, and accomplish FTU
  2. $ adb reboot
  3. go to setp 2 if I see lock screen

gecko:    deea1599293617731dcc0f2104fc0eb79e1c1c28
gaia :    f0d9c5b39735bcbc63405281642437d4fc186cdb
firmare:  v122
platform: 33.0a1

Updated

4 years ago
Flags: needinfo?(tlee)
(Assignee)

Comment 17

4 years ago
After double checked with :mlien, I can reproduce now.

QA update only gecko/gaia on top of base image v122, but I used "./flash.sh" to download my local build, if I update gecko/gaia only by "./flash.sh gecko && ./flash.sh gaia", I can then reproduce it.

From what :mline told me, the gonk/kernel of t2m's base image is from v1.3 (need to be confirmed).

Here's the about-memory report when the issue is occurred, heap-unclassified of b2g process is really high, I will enable DMD to check what is it tomorrow.

Main Process (pid 2281)
Explicit Allocations

90.63 MB (100.0%) -- explicit
├──72.78 MB (80.30%) ── heap-unclassified
├──10.35 MB (11.43%) -- js-non-window
│  ├───6.61 MB (07.30%) -- zones
│  │   ├──4.70 MB (05.19%) -- zone(0xb2b69c00)
│  │   │  ├──2.49 MB (02.75%) ++ compartment([System Principal])
│  │   │  └──2.21 MB (02.44%) ++ (34 tiny)
│  │   ├──1.71 MB (01.89%) -- zone(0xb6a78000)
│  │   │  ├──1.69 MB (01.86%) -- strings/string(<non-notable strings>)
│  │   │  │  ├──1.12 MB (01.23%) ── malloc-heap
│  │   │  │  └──0.57 MB (00.63%) ── gc-heap
│  │   │  └──0.02 MB (00.03%) ++ (4 tiny)
│  │   └──0.20 MB (00.22%) ++ zone(0xb6a79c00)
│  ├───3.65 MB (04.03%) -- runtime
│  │   ├──2.08 MB (02.30%) ── script-data
│  │   ├──1.08 MB (01.20%) ── atoms-table
│  │   └──0.48 MB (00.53%) ++ (10 tiny)
│  └───0.09 MB (00.10%) ++ gc-heap
├───3.26 MB (03.60%) ++ (19 tiny)
├───2.34 MB (02.58%) -- heap-overhead
│   ├──1.69 MB (01.87%) ── bin-unused
│   └──0.65 MB (00.71%) ++ (3 tiny)
└───1.89 MB (02.09%) -- workers/workers()
    ├──1.40 MB (01.54%) ++ worker(resource://gre/modules/ril_worker.js, 0xafdcd800)
    └──0.50 MB (00.55%) ++ worker(resource://gre/modules/nfc_worker.js, 0xafdc7000)
(Assignee)

Comment 18

4 years ago
Clear NI as I can repro now.
Flags: needinfo?(pcheng)

Updated

4 years ago
Whiteboard: [MemShrink] → [MemShrink][273MB-Flame-Support]

Comment 19

4 years ago
How do I even get it to adb reboot? Can anyone help with that process? I'm pressing the volume button and the up arrow, and reaching the reboot screen- what do I do now?

Updated

4 years ago
See Also: → bug 1038854
(Assignee)

Comment 20

4 years ago
Unreported: 45 blocks in stack trace record 1 of 302
 73,912,320 bytes (73,872,000 requested / 40,320 slop)
 86.58% of the heap (86.58% cumulative);  96.62% of unreported (96.62% cumulative)
 Allocated at
   replace_malloc[libdmd.so +0x3898] 0xb6fce898
   ???[/system/b2g/b2g +0xBC42] 0x13c42
   ???[/system/b2g/b2g +0xC656] 0x14656
   __thread_entry[libc.so +0xCB64] 0xb6f7ab64
   pthread_create[libc.so +0xCCE0] 0xb6f7ace0

This is from boot animation, which there are 45 480x854 pngs and AnimationFrame::ReadPngFrame() is called with output format HAL_PIXEL_FORMAT_RGBA_8888. Isn't 70MB for boot animation too much?
(Assignee)

Comment 21

4 years ago
As boot animation eats memory, there comes memory pressure events. And if getContentWindow() is called before shell_start() appending |systemAppFrame| for sending custom event |mozmemorypressure|, it will be failed with error:

  E/GeckoConsole( 1216): [JavaScript Error: "TypeError: shell.contentBrowser is null" {file: "chrome://b2g/content/shell.js" line: 80}]

and the getter function contentBrowser() will be deleted by itself:

  get contentBrowser() {
    delete this.contentBrowser;
    return this.contentBrowser = document.getElementById('systemapp');
  },

Later when shell_start() comes here:

  this.contentBrowser.addEventListener('mozbrowserloadstart', this, true);

it gets another error:

  E/GeckoConsole( 1216): [JavaScript Error: "TypeError: this.contentBrowser is null" {file: "chrome://b2g/content/shell.js" line: 338}]

as the property is assigned as null. StopBootAnimation() won't be called when this happens, so the phone stays at animation screen.

I am cooking a patch for this.
(Assignee)

Comment 22

4 years ago
Created attachment 8456751 [details] [diff] [review]
patch

Removed getter function contentBrowser() which may set the property null. Assign the property after |systemAppFrame| is appended instead.

I didn't add any error handling in getContentWindow() since there will be still error messages if someone is trying to access |shell.contentBrowser| before it is assigned.
Attachment #8456751 - Flags: review?(fabrice)
(Assignee)

Updated

4 years ago
Assignee: nobody → tchou
Status: NEW → ASSIGNED
Priority: -- → P1
Whiteboard: [MemShrink][273MB-Flame-Support] → [MemShrink][273MB-Flame-Support][c=memory p= s= u=2.0]

Updated

4 years ago
Severity: normal → blocker
Attachment #8456751 - Flags: review?(fabrice) → review+
(Assignee)

Comment 24

4 years ago
Created attachment 8457809 [details] [diff] [review]
patch

Added r=fabrice to commit message, carry r+.

I found the try of comment 23 ran only a few tests, added b2g ics emulator and it looks good: https://tbpl.mozilla.org/?tree=Try&rev=0409d889dec8.
Attachment #8457809 - Flags: review+
(Assignee)

Updated

4 years ago
Attachment #8456751 - Attachment is obsolete: true
(Assignee)

Comment 25

4 years ago
Created attachment 8457850 [details] [diff] [review]
patch-v1.4
(Assignee)

Comment 26

4 years ago
Created attachment 8457852 [details] [diff] [review]
patch-v2.0
(Assignee)

Comment 27

4 years ago
Please let me know if I should reqeust approval for uplifting.
Keywords: checkin-needed
2.0 blockers are uplifted automatically.
https://hg.mozilla.org/mozilla-central/rev/7ee1774b1703
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2.0 S6 (18july)
https://hg.mozilla.org/releases/mozilla-aurora/rev/5c4008fe0967
status-b2g-v2.0: affected → fixed
status-b2g-v2.1: affected → fixed
status-firefox31: --- → wontfix
status-firefox32: --- → fixed
status-firefox33: --- → fixed
Dear PiWei,
According to the Comment 17, is there any precondition when we verify this issue?
Flags: needinfo?(pcheng)
(Reporter)

Comment 33

4 years ago
(In reply to Shine from comment #32)
> Dear PiWei,
> According to the Comment 17, is there any precondition when we verify this
> issue?

I believe this bug had been fixed since it was marked fixed. Also we've increased base memory to 319MB and no longer using 273MB so I'm closing this bug as invalid.
Flags: needinfo?(pcheng)
Resolution: FIXED → INVALID
You need to log in before you can comment on or make changes to this bug.