Closed Bug 1109794 Opened 10 years ago Closed 9 years ago

Device not booting up due to b2g process getting killed by LMK on boot up continuously

Categories

(Firefox OS Graveyard :: General, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: anshulj, Assigned: ntroast)

References

Details

(Whiteboard: [MemShrink:P3])

Attachments

(2 files)

The issue is easily reproducible on flame device with the following SHAs

Gecko: 2145ba8738a56c235efc211b461272edede6fb84
Gaia: e04ab7651b1e0c67516e1cef7aa4bc6072529885
The last known good SHAs are below to help narrow down a regression window.

Gecko: bd2404ce8db2ca13b484a7f3c3b3db31239cf904
Gaia: e5d666d6f62480ced56c6d9352f5e12befb5a862
Summary: Device not botting up due to b2g process getting killed by LMK on boot up continuously → Device not booting up due to b2g process getting killed by LMK on boot up continuously
Anshul this the log I have with the above SHA's : http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a61cacb954c5&tochange=be1f49e80d2d, kinda too big. It will help if you can help narrow down further...
Bhavna, I have narrowed it down to Gaia commits between 
b384220eb54329397af53ee6819cc13bd7b641f1 and b1edd64173cd48d130f697e0b0b2adf2523ad57f. I am having hard time bisecting it further as I am getting compilation errors if I bisect anymore.
Last time I flash all image from pvt server but cannot reproduce this issue.
I will use v188 as base and just flash gecko/gaia to reporduce it, will update later.
I sync flame B2G source tree based on the manifest provided by Michael to generate local build images.
But, with these images, I cannot see the same symptom even with SIM card and SD card inserted.
  
ftp://ftp.mozilla.org/pub/mozilla.org/b2g/manifests/nightly/2.2.0/2014-12-09-16/source_flame-kk_2014-12-09-16.xml.

b2g process information of b2g-procrank:
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210    99524K   34568K   29833K   26952K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   132276K   52124K   41677K   35940K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   170240K   68852K   60475K   55528K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   171416K   67932K   59484K   54472K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   207848K   79056K   71978K   67788K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   218500K   41032K   34087K   30796K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   219708K   35168K   28432K   26200K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   218240K   42592K   36269K   33952K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   218240K   37752K   32520K   30708K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   217408K   36044K   32798K   31608K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   217472K   40996K   38020K   36728K  /system/b2g/b2g
  (5 seconds...)
  APPLICATION        PID       Vss      Rss      Pss      Uss  cmdline
  b2g                210   216192K   41568K   39178K   38084K  /system/b2g/b2g

Gaia-Rev        4cdeee67b449db90aae9384337311547c280093c
Gecko-Rev       4e5dbde020e7101d711aca751bb1ff3af40e32b4
Build-ID        20141211135752
Version         37.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.rexmax.20141211.135325
FW-Date         四 12月 11 13:53:43 CST 2014
Bootloader      L1TC00011880
Hi Anshul,
I did two tests on 256MB flame, but still CANNOT reproduce this issue, it's weird! My test configs and environment as following:

Test 1:
1. Flash v188 as base
2. Update gaia/gecko with below commit:
   Gecko: 2145ba8738a56c235efc211b461272edede6fb84
   Gaia: e04ab7651b1e0c67516e1cef7aa4bc6072529885
3. No SIM and no sd card.

Test 2:
1. Flash v188 as base
2. re-sync code with manifest file source_flame-kk_2014-12-09-16.xml and rebuild images.
3. Flash system.img, userdata.img and boot.img
4. No SIM and no sd card

Should I base on another base image or do any pre-setting on flame?

If you guys still can reproduce this issue, there are some suggestions to narrow down the problem.
1. Using gdb to debug
2. Using top command to check if any thread of b2g is busy (maybe there is process enter endless loop and allocate more memory )
   #adb shell top -m 15 -d 1 -t
3. Using get_about_memory.py to find the memory leakage of b2g.
Flags: needinfo?(anshulj)
I have narrowed the issue down to bug 1101158. Reverting bug 1101158 locally fixes the issue.
Flags: needinfo?(anshulj)
Hi Anshul,
We'd like to investigate this issue but we could not reproduce it on Moz build. Could we have your own build for flame device?
Best regards,
I've done most of the development related to the LMK so I wanted to investigate this bug but alas comment 7 points to a bug that has been marked as confidential. I believe that is wrong since the said bug was fixed and landed on gaia/master [1] Mozilla requires that all bugs associated to code landed publicly must also be public. Making it so will obviously also make it easier for people to help with it.

[1] https://github.com/mozilla-b2g/gaia/commit/cd0cb6aa5322e6d98633cb034513136e5e470246
Gabriele, just so you know, I don't have access to the bug mentioned in comment #7 either. I can see the change however in the git history by searching for bug 1101158.
(In reply to Anshul from comment #10)
> Gabriele, just so you know, I don't have access to the bug mentioned in
> comment #7 either. I can see the change however in the git history by
> searching for bug 1101158.

Meh, this is really bad :-( I'll try pinging :kgrandon because he seems to have authored the change.
Hi Anshul,
Per request comment #8, could you provide us your own build of flame device? We'd like to investigate this issue further. Thank you.
Flags: needinfo?(anshulj)
Anshul, try grabbing the stock v188 build from https://developer.mozilla.org/en-US/Firefox_OS/Developer_phone_guide/Flame and shallow flashing a gecko/gaia built from the sha1s that reproduce the issue here on it.
(In reply to Gabriele Svelto [:gsvelto] from comment #11)
> (In reply to Anshul from comment #10)
> > Gabriele, just so you know, I don't have access to the bug mentioned in
> > comment #7 either. I can see the change however in the git history by
> > searching for bug 1101158.
> 
> Meh, this is really bad :-( I'll try pinging :kgrandon because he seems to
> have authored the change.

NI, Kevin here to look into the suspected patch.
Flags: needinfo?(kgrandon)
It seems unlikely to me that bug 1101158 could cause these kinds of symptoms, but I'll look into it. It also seems like no one at moz is able to reproduce this yet.

Anshul - Are you able to get a logcat here during the reboot?
Flags: needinfo?(kgrandon)
Attached file android log
Please find attached the android log as requested.
Flags: needinfo?(anshulj)
Anshul - thanks for the logs. I couldn't immediately see anything spit out by gaia that would be causing this, but there might be something in there more telling of the platform.

Any chance that we've done the suggested steps in comment 13? Does the issue reproduce after the stock build and hsallow flash?
Flags: needinfo?(anshulj)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #13)
> Anshul, try grabbing the stock v188 build from
> https://developer.mozilla.org/en-US/Firefox_OS/Developer_phone_guide/Flame
> and shallow flashing a gecko/gaia built from the sha1s that reproduce the
> issue here on it.

With the latest v188 image on flame and shallow flashing gecko/gaia from moz central I am able to reproduce the issue. Once I revert bug 1101158 the flame device boots up fine. So again confirming the fact that bug 1101158 is the offending bug.
Flags: needinfo?(anshulj)
(In reply to Kevin Grandon :kgrandon from comment #17)
> Anshul - thanks for the logs. I couldn't immediately see anything spit out
> by gaia that would be causing this, but there might be something in there
> more telling of the platform.

This sounds like we're hitting some kind of corner case which is causing memory consumption to hit a peak. Looking at the code in your change however I can't really tell what might be causing it though.
Anshul - are there any actions performed before the device gets stuck in a reboot loop? Is this with manual execution or a marionette test? If there are some actions before the reboot loop, please let us know what they are.

It also sounds like a memory report would be useful if it's possible to get one before the device reboots.
Flags: needinfo?(anshulj)
Kevin, no specific action being taken besides simply trying to boot up the phone.
Flags: needinfo?(anshulj)
Attached file procrank
Please find attached procrank logs for a run on an internal device (not flame as b2g-procrank) doesn't seem to be working on flame for me.
adb shell cat /proc/meminfo on the flame device.

MemTotal:         935400 kB
MemFree:          271200 kB
Buffers:            6152 kB
Cached:            55516 kB
SwapCached:         1052 kB
Active:           571284 kB
Inactive:          39480 kB
Active(anon):     548192 kB
Inactive(anon):     2112 kB
Active(file):      23092 kB
Inactive(file):    37368 kB
Unevictable:        1120 kB
Mlocked:               0 kB
HighTotal:        270336 kB
HighFree:            636 kB
LowTotal:         665064 kB
LowFree:          270564 kB
SwapTotal:        196604 kB
SwapFree:         183960 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:        550212 kB
Mapped:            32640 kB
Shmem:                88 kB
Slab:              21528 kB
SReclaimable:       7672 kB
SUnreclaim:        13856 kB
KernelStack:        3368 kB
PageTables:         2924 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:      664304 kB
Committed_AS:     820628 kB
VmallocTotal:     245760 kB
VmallocUsed:       21660 kB
VmallocChunk:      79364 kB

Every time I run it the Active memory keeps going up and MemFree keeps going down until b2g gets killed.
Whiteboard: [MemShrink]
Whiteboard: [MemShrink] → [MemShrink:P3]
Assignee: nobody → ntroast
We've root-caused this to a sinister landmine in our build environment.  I feel queasy.  Thanks for the debug help all!
Status: NEW → RESOLVED
blocking-b2g: 2.2? → ---
Closed: 9 years ago
Resolution: --- → INVALID
Hi Michael,

I am still curious about what kind of landmine in your environment can result in memory leakage of b2g process based on this Gaia commit. Maybe there are some lessons learned we can have to identify this kind of problem more efficiently next time.

Could you kindly let us know what you found in more detail? Thank you.
Heh, so we cache the b2g_sdk locally because it's very annoying that the Gaia build wants to download it from ftp.mozilla.org every time.  We had a subtle bug in our cache such that we were not using the latest b2g_sdk, we've had this bug for years now but just never triggered it until recently.  So for some reason, that particular gaia patch was causing the older b2k_sdk to generate a bogus system app zip file and causing the LMK at boot.

The best way to avoid this particular build mismatch between Moz/CAF in the future would be for Mozilla to store the b2g_sdk in a git project so that it can be properly versioned like the rest of the build and then we can stop using our local cache to avoid the ftp download.  I know other partners have requested this in the past as well.
No longer blocks: CAF-v3.0-FL-metabug
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: