Closed Bug 1092924 Opened 10 years ago Closed 9 years ago

[MTBF][B2G] B2G crashed due to oom or other related issues.

Categories

(Firefox OS Graveyard :: Vendcom, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:-)

RESOLVED INVALID
blocking-b2g -

People

(Reporter: wachen, Unassigned)

References

Details

(Whiteboard: [mtbf], [POVB])

22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962559] lowmem_shrink: convert oom_adj to oom_score_adj:
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962574] oom_adj 0 => oom_score_adj 0
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962581] oom_adj 1 => oom_score_adj 58
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962588] oom_adj 2 => oom_score_adj 117
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962595] oom_adj 6 => oom_score_adj 352
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962602] oom_adj 8 => oom_score_adj 470
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962609] oom_adj 10 => oom_score_adj 588
22:36:45 The B2G process has restarted after crashing during  the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Is oom killer kiling b2g or what? I can't understand the log. Since B2G crashes shouldn't happen in random cases, can anyone help on investigating here?
mtbf test machine # 7
task: "flame.v2.1.mtbf7.319"
build # 101
device: e472d80e
V2.1 pvt BuildID:20141030161200
Blocks: MTBF-B2G
The log is too large for me to upload here. Please download it from box.com
https://mozilla.box.com/s/u5urdo7morigc4ir8tdp
The full running history of jenkins (including some gaiatest/marionette debugging information, and some immediate logcat info)
https://mozilla.box.com/s/blbo07lj1nycznhzvnjd
Summary: [MTBF][B2G] B2G crashed and restarted → [MTBF][B2G] B2G crashed due to oom or other related issues.
Also, I found build # 98 & 102 has similiar message.

build#98
19:19:09 11-02 19:19:07.461 I/cat     (  296): <6>[191276.715013] mmc0: Starting deferred resume
19:19:09 11-02 19:19:07.521 I/cat     (  296): <6>[191276.774734] mmc0: Deferred resume completed
19:19:09 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

build#102
09:11:14 11-03 09:11:12.885 I/cat     (  232): <6>[241189.377736] mmc0: Starting deferred resume
09:11:14 11-03 09:11:12.945 I/cat     (  232): <6>[241189.438481] mmc0: Deferred resume completed
09:11:14 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

Devices in other mtbf machine are showing similiar logcat information. However, it doesn't necessary die after "B2G is still running ..." messages.
Hi Ting,

Could you help on this? Thx.
Flags: needinfo?(tchou)
[Blocking Requested - why for this release]:
Some MTBF testing devices are stopped by this issue.
blocking-b2g: --- → 2.1?
(In reply to Walter Chen[:ypwalter][:wachen] from comment #2)
> The log is too large for me to upload here. Please download it from box.com
> https://mozilla.box.com/s/u5urdo7morigc4ir8tdp

From dmesg I saw only one OOM:

<3>[25841.838754] Out of memory: Kill process 2515 (Messages) score 918 or sacrifice child

(In reply to Walter Chen[:ypwalter][:wachen] from comment #3)
> The full running history of jenkins (including some gaiatest/marionette
> debugging information, and some immediate logcat info)
> https://mozilla.box.com/s/blbo07lj1nycznhzvnjd

B2G process is killed by OOM killer:

11-01 22:36:43.442 I/cat     (  290): <3>[116727.100412] Out of memory: Kill process 206 (b2g) score 154 or sacrifice child
Flags: needinfo?(tchou)
(In reply to Ting-Yu Chou [:ting] from comment #7)
> (In reply to Walter Chen[:ypwalter][:wachen] from comment #3)
> > The full running history of jenkins (including some gaiatest/marionette
> > debugging information, and some immediate logcat info)
> > https://mozilla.box.com/s/blbo07lj1nycznhzvnjd
> 
> B2G process is killed by OOM killer:
> 
> 11-01 22:36:43.442 I/cat     (  290): <3>[116727.100412] Out of memory: Kill
> process 206 (b2g) score 154 or sacrifice child

I was wondering if there's any memory leaks, but the free is quite large when b2g gets killed:

11-01 22:36:43.442 I/cat     (  290): <6>[116727.167605] Normal free:62008kB
Ting-Yu, there are logs in my comment 2 and comment 3. 
If it possible that you can look further to see if there is any memory leaks?

https://mozilla.box.com/s/u5urdo7morigc4ir8tdp
Since there's still plenty free memory, I doubt it is the "MC_Thread" allocating improper size. Note not only b2g, but also homescreen, ui tests, and nuwa are killed by OOM killer with the same backtrace:

11-01 22:36:43.442 I/cat     (  290): <4>[116727.155281] MC_Thread invoked oom-killer: gfp_mask=0x40d0, order=3, oom_adj=0, oom_score_adj=0
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163044] [<c010bd74>] (unwind_backtrace+0x0/0xf8) from [<c087d748>] (dump_header.isra.10+0x74/0x180)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163066] [<c087d748>] (dump_header.isra.10+0x74/0x180) from [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163087] [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220) from [<c020f6bc>] (out_of_memory+0x2bc/0x334)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163107] [<c020f6bc>] (out_of_memory+0x2bc/0x334) from [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163124] [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c) from [<c0214100>] (__get_free_pages+0x10/0x24)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163143] [<c0214100>] (__get_free_pages+0x10/0x24) from [<c02445b0>] (kmalloc_order_trace+0x24/0xd0)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163160] [<c02445b0>] (kmalloc_order_trace+0x24/0xd0) from [<c0247404>] (__kmalloc+0x204/0x288)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163593] [<c0247404>] (__kmalloc+0x204/0x288) from [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.164194] [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan]) from [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.164731] [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan]) from [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.165268] [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan]) from [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.165801] [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan]) from [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.166320] [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan]) from [<bf069030>] (peProcessMessages+0x38/0x4c [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167048] [<bf069030>] (peProcessMessages+0x38/0x4c [wlan]) from [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167486] [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan]) from [<c0197ae0>] (kthread+0x84/0x90)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167507] [<c0197ae0>] (kthread+0x84/0x90) from [<c0106820>] (kernel_thread_exit+0x0/0x8)

It is not in our codebase, can we have partner's comments?
Wesly, can you help with this issue?

We suspect that MC_Thread is trying to allocate unproper memory (kmalloc). It cause that B2G got killed by OOM killer. Please see comment 10 for reference. That is not one of the area we are taking care of. We suspect it is a partner issue. Please let t2m know and we need to have someone taking care of this.
Flags: needinfo?(wehuang)
2.1+ because this is blocking Mozilla from hitting MTBF goal of 100 hours before 2.1 CC date.
blocking-b2g: 2.1? → 2.1+
Depends on: 1081577
Sorry for catching up late. ni T2M

Hi Youlong: please take a look at this issue, would you help clarify the design/behavior of "MC_Thread", and it this symptom is reasonable based on it? See all comments above for related investigation we've done. Thanks.
Flags: needinfo?(wehuang) → needinfo?(youlong.jiang)
(In reply to Wesly Huang from comment #13)
> Sorry for catching up late. ni T2M
> 
> Hi Youlong: please take a look at this issue, would you help clarify the
> design/behavior of "MC_Thread", and it this symptom is reasonable based on
> it? See all comments above for related investigation we've done. Thanks.

wesly -

how about rep-rate for this issue, and did you summarize the regular rep-step. from our view, MC_Thread is qcom private code, we need you provide more details and make a SR for qcom.

tks.
Flags: needinfo?(youlong.jiang)
Hi Walter, Ting-Yu: would you help on comment#14 for repro. rate, and a more general description of this symptom? Thank you.
Flags: needinfo?(wachen)
Flags: needinfo?(tchou)
According to comment 10, I change the component.
Component: Stability → Vendcom
I don't have reporduce rate, for a description, how about this?

B2G process is killed by OOM killer when the free memory is ~60MB, from log it is MC_Thread invoked OOM killer. Is it possible the MC_Thread is allocating improper size? Note not only B2G, but also Homescreen, UI tests, and Nuwa are killed by OOM killer with the same backtrace.
Flags: needinfo?(tchou)
Thanks Ting-Yu, I thin it's clear and absolutely better then what I can write :)

@Youlong, pls take comment#17 and check with QCT, and keep us updated, thank you.
Flags: needinfo?(youlong.jiang)
Repro rate is about 10% among 10 devices+
Flags: needinfo?(wachen)
Whiteboard: [mtbf]
(In reply to Wesly Huang from comment #18)
> Thanks Ting-Yu, I thin it's clear and absolutely better then what I can
> write :)
> 
> @Youlong, pls take comment#17 and check with QCT, and keep us updated, thank
> you.

Hi wesly:
We have submitted a bug 01808680 into Qualcomm.avoid to be killed
OOM is triged by kernel thread lowmemorykiller.It count points for every task.The higher the point, more an more likely to be killed.The /proc/b2g_pid/oom_adj currently is 0,you can set it smaller like -16 in order to avoid being killed.
Whiteboard: [mtbf] → [mtbf], POVB
Dears,

no feedback on qcom SR path.

tks.
Flags: needinfo?(youlong.jiang) → needinfo?(wehuang)
Hi Youlong, pls help follow w/ QCT for their feedback and keep us posted, thanks.

@Ting-Yu: do you have other suggestion about this, if there is no good news from QCT in the end? thank you.
Flags: needinfo?(youlong.jiang)
Flags: needinfo?(wehuang)
Flags: needinfo?(tchou)
(In reply to Wesly Huang from comment #22)
> @Ting-Yu: do you have other suggestion about this, if there is no good news
> from QCT in the end? thank you.

No, since it is something we don't know triggering kernel OOM killer when free is ~60MB. At least I expect things like debug version image which QA can help to get a log.
Flags: needinfo?(tchou)
(In reply to Ting-Yu Chou [:ting] from comment #23)
> (In reply to Wesly Huang from comment #22)
> > @Ting-Yu: do you have other suggestion about this, if there is no good news
> > from QCT in the end? thank you.
> 
> No, since it is something we don't know triggering kernel OOM killer when
> free is ~60MB. At least I expect things like debug version image which QA
> can help to get a log.
QA says“As we talked, you can adjust oom_score to avoid your key process be killed by OOM killer”
Whiteboard: [mtbf], POVB → [mtbf], [POVB]
Hi! Mike,

Is this case still an issue with V18D image?

--
Keven
Flags: needinfo?(mlien)
Hi Youlong:

As discussed in meeting, please help checked if there is new update in QCT's SR and let us know here, thanks.
(In reply to Wesly Huang from comment #26)
> Hi Youlong:
> 
> As discussed in meeting, please help checked if there is new update in QCT's
> SR and let us know here, thanks.

hi wesly -

sorry for bring un-convenient. I would correct feedback in comment#24

qcom's latest update is "adjust oom_score to avoid key process be killed by OOM killer"

tks.
Flags: needinfo?(youlong.jiang)
Gabriele, are there any concerns if we set the the oom_score_adj of B2G process -1000 to disable OOM?
Flags: needinfo?(gsvelto)
Flags: needinfo?(mlien)
(In reply to Ting-Yu Chou [:ting] from comment #28)
> Gabriele, are there any concerns if we set the the oom_score_adj of B2G
> process -1000 to disable OOM?

I think it might be problematic. Preventing the b2g main process from being killed means that if we start running out of memory we'll start killing other things first, such as the various system daemons (rild, nfcd, vold, wpa_supplicant, etc...). I'm afraid that this would make the phone unusable. Additionally if the b2g process gets OOM killed we should restart it automatically IIRC so as to return the phone to a working state; I don't think we do the same for other system processes though.
Flags: needinfo?(gsvelto)
It's no longer a blocker to 2.1 MTBF now.
blocking-b2g: 2.1+ → -
not seeing this for a really long time. let's close it for now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.