Closed
Bug 1092924
Opened 10 years ago
Closed 9 years ago
[MTBF][B2G] B2G crashed due to oom or other related issues.
Categories
(Firefox OS Graveyard :: Vendcom, defect)
Tracking
(blocking-b2g:-)
RESOLVED
INVALID
blocking-b2g | - |
People
(Reporter: wachen, Unassigned)
References
Details
(Whiteboard: [mtbf], [POVB])
22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962559] lowmem_shrink: convert oom_adj to oom_score_adj: 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962574] oom_adj 0 => oom_score_adj 0 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962581] oom_adj 1 => oom_score_adj 58 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962588] oom_adj 2 => oom_score_adj 117 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962595] oom_adj 6 => oom_score_adj 352 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962602] oom_adj 8 => oom_score_adj 470 22:36:45 11-01 22:36:45.192 I/cat ( 290): <6>[116728.962609] oom_adj 10 => oom_score_adj 588 22:36:45 The B2G process has restarted after crashing during the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure. ------------------------------------------------------------------------------------------------------------------------------------------------------------------ Is oom killer kiling b2g or what? I can't understand the log. Since B2G crashes shouldn't happen in random cases, can anyone help on investigating here?
Reporter | ||
Comment 1•10 years ago
|
||
mtbf test machine # 7 task: "flame.v2.1.mtbf7.319" build # 101 device: e472d80e V2.1 pvt BuildID:20141030161200
Reporter | ||
Comment 2•10 years ago
|
||
The log is too large for me to upload here. Please download it from box.com https://mozilla.box.com/s/u5urdo7morigc4ir8tdp
Reporter | ||
Comment 3•10 years ago
|
||
The full running history of jenkins (including some gaiatest/marionette debugging information, and some immediate logcat info) https://mozilla.box.com/s/blbo07lj1nycznhzvnjd
Reporter | ||
Updated•10 years ago
|
Summary: [MTBF][B2G] B2G crashed and restarted → [MTBF][B2G] B2G crashed due to oom or other related issues.
Reporter | ||
Comment 4•10 years ago
|
||
Also, I found build # 98 & 102 has similiar message. build#98 19:19:09 11-02 19:19:07.461 I/cat ( 296): <6>[191276.715013] mmc0: Starting deferred resume 19:19:09 11-02 19:19:07.521 I/cat ( 296): <6>[191276.774734] mmc0: Deferred resume completed 19:19:09 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure. build#102 09:11:14 11-03 09:11:12.885 I/cat ( 232): <6>[241189.377736] mmc0: Starting deferred resume 09:11:14 11-03 09:11:12.945 I/cat ( 232): <6>[241189.438481] mmc0: Deferred resume completed 09:11:14 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure. Devices in other mtbf machine are showing similiar logcat information. However, it doesn't necessary die after "B2G is still running ..." messages.
Comment 6•10 years ago
|
||
[Blocking Requested - why for this release]: Some MTBF testing devices are stopped by this issue.
blocking-b2g: --- → 2.1?
Comment 7•10 years ago
|
||
(In reply to Walter Chen[:ypwalter][:wachen] from comment #2) > The log is too large for me to upload here. Please download it from box.com > https://mozilla.box.com/s/u5urdo7morigc4ir8tdp From dmesg I saw only one OOM: <3>[25841.838754] Out of memory: Kill process 2515 (Messages) score 918 or sacrifice child (In reply to Walter Chen[:ypwalter][:wachen] from comment #3) > The full running history of jenkins (including some gaiatest/marionette > debugging information, and some immediate logcat info) > https://mozilla.box.com/s/blbo07lj1nycznhzvnjd B2G process is killed by OOM killer: 11-01 22:36:43.442 I/cat ( 290): <3>[116727.100412] Out of memory: Kill process 206 (b2g) score 154 or sacrifice child
Flags: needinfo?(tchou)
Comment 8•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #7) > (In reply to Walter Chen[:ypwalter][:wachen] from comment #3) > > The full running history of jenkins (including some gaiatest/marionette > > debugging information, and some immediate logcat info) > > https://mozilla.box.com/s/blbo07lj1nycznhzvnjd > > B2G process is killed by OOM killer: > > 11-01 22:36:43.442 I/cat ( 290): <3>[116727.100412] Out of memory: Kill > process 206 (b2g) score 154 or sacrifice child I was wondering if there's any memory leaks, but the free is quite large when b2g gets killed: 11-01 22:36:43.442 I/cat ( 290): <6>[116727.167605] Normal free:62008kB
Reporter | ||
Comment 9•10 years ago
|
||
Ting-Yu, there are logs in my comment 2 and comment 3. If it possible that you can look further to see if there is any memory leaks? https://mozilla.box.com/s/u5urdo7morigc4ir8tdp
Comment 10•10 years ago
|
||
Since there's still plenty free memory, I doubt it is the "MC_Thread" allocating improper size. Note not only b2g, but also homescreen, ui tests, and nuwa are killed by OOM killer with the same backtrace: 11-01 22:36:43.442 I/cat ( 290): <4>[116727.155281] MC_Thread invoked oom-killer: gfp_mask=0x40d0, order=3, oom_adj=0, oom_score_adj=0 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163044] [<c010bd74>] (unwind_backtrace+0x0/0xf8) from [<c087d748>] (dump_header.isra.10+0x74/0x180) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163066] [<c087d748>] (dump_header.isra.10+0x74/0x180) from [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163087] [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220) from [<c020f6bc>] (out_of_memory+0x2bc/0x334) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163107] [<c020f6bc>] (out_of_memory+0x2bc/0x334) from [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163124] [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c) from [<c0214100>] (__get_free_pages+0x10/0x24) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163143] [<c0214100>] (__get_free_pages+0x10/0x24) from [<c02445b0>] (kmalloc_order_trace+0x24/0xd0) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163160] [<c02445b0>] (kmalloc_order_trace+0x24/0xd0) from [<c0247404>] (__kmalloc+0x204/0x288) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.163593] [<c0247404>] (__kmalloc+0x204/0x288) from [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.164194] [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan]) from [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.164731] [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan]) from [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.165268] [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan]) from [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.165801] [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan]) from [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.166320] [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan]) from [<bf069030>] (peProcessMessages+0x38/0x4c [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.167048] [<bf069030>] (peProcessMessages+0x38/0x4c [wlan]) from [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan]) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.167486] [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan]) from [<c0197ae0>] (kthread+0x84/0x90) 11-01 22:36:43.442 I/cat ( 290): <6>[116727.167507] [<c0197ae0>] (kthread+0x84/0x90) from [<c0106820>] (kernel_thread_exit+0x0/0x8) It is not in our codebase, can we have partner's comments?
Reporter | ||
Comment 11•10 years ago
|
||
Wesly, can you help with this issue? We suspect that MC_Thread is trying to allocate unproper memory (kmalloc). It cause that B2G got killed by OOM killer. Please see comment 10 for reference. That is not one of the area we are taking care of. We suspect it is a partner issue. Please let t2m know and we need to have someone taking care of this.
Flags: needinfo?(wehuang)
Comment 12•10 years ago
|
||
2.1+ because this is blocking Mozilla from hitting MTBF goal of 100 hours before 2.1 CC date.
blocking-b2g: 2.1? → 2.1+
Comment 13•10 years ago
|
||
Sorry for catching up late. ni T2M Hi Youlong: please take a look at this issue, would you help clarify the design/behavior of "MC_Thread", and it this symptom is reasonable based on it? See all comments above for related investigation we've done. Thanks.
Flags: needinfo?(wehuang) → needinfo?(youlong.jiang)
Comment 14•10 years ago
|
||
(In reply to Wesly Huang from comment #13) > Sorry for catching up late. ni T2M > > Hi Youlong: please take a look at this issue, would you help clarify the > design/behavior of "MC_Thread", and it this symptom is reasonable based on > it? See all comments above for related investigation we've done. Thanks. wesly - how about rep-rate for this issue, and did you summarize the regular rep-step. from our view, MC_Thread is qcom private code, we need you provide more details and make a SR for qcom. tks.
Flags: needinfo?(youlong.jiang)
Comment 15•10 years ago
|
||
Hi Walter, Ting-Yu: would you help on comment#14 for repro. rate, and a more general description of this symptom? Thank you.
Flags: needinfo?(wachen)
Flags: needinfo?(tchou)
Comment 16•10 years ago
|
||
According to comment 10, I change the component.
Component: Stability → Vendcom
Comment 17•10 years ago
|
||
I don't have reporduce rate, for a description, how about this? B2G process is killed by OOM killer when the free memory is ~60MB, from log it is MC_Thread invoked OOM killer. Is it possible the MC_Thread is allocating improper size? Note not only B2G, but also Homescreen, UI tests, and Nuwa are killed by OOM killer with the same backtrace.
Flags: needinfo?(tchou)
Comment 18•10 years ago
|
||
Thanks Ting-Yu, I thin it's clear and absolutely better then what I can write :) @Youlong, pls take comment#17 and check with QCT, and keep us updated, thank you.
Flags: needinfo?(youlong.jiang)
Reporter | ||
Comment 19•10 years ago
|
||
Repro rate is about 10% among 10 devices+
Flags: needinfo?(wachen)
Updated•10 years ago
|
Whiteboard: [mtbf]
Comment 20•10 years ago
|
||
(In reply to Wesly Huang from comment #18) > Thanks Ting-Yu, I thin it's clear and absolutely better then what I can > write :) > > @Youlong, pls take comment#17 and check with QCT, and keep us updated, thank > you. Hi wesly: We have submitted a bug 01808680 into Qualcomm.avoid to be killed OOM is triged by kernel thread lowmemorykiller.It count points for every task.The higher the point, more an more likely to be killed.The /proc/b2g_pid/oom_adj currently is 0,you can set it smaller like -16 in order to avoid being killed.
Updated•10 years ago
|
Whiteboard: [mtbf] → [mtbf], POVB
Comment 21•10 years ago
|
||
Dears, no feedback on qcom SR path. tks.
Flags: needinfo?(youlong.jiang) → needinfo?(wehuang)
Comment 22•10 years ago
|
||
Hi Youlong, pls help follow w/ QCT for their feedback and keep us posted, thanks. @Ting-Yu: do you have other suggestion about this, if there is no good news from QCT in the end? thank you.
Flags: needinfo?(youlong.jiang)
Flags: needinfo?(wehuang)
Flags: needinfo?(tchou)
Comment 23•10 years ago
|
||
(In reply to Wesly Huang from comment #22) > @Ting-Yu: do you have other suggestion about this, if there is no good news > from QCT in the end? thank you. No, since it is something we don't know triggering kernel OOM killer when free is ~60MB. At least I expect things like debug version image which QA can help to get a log.
Flags: needinfo?(tchou)
Comment 24•10 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #23) > (In reply to Wesly Huang from comment #22) > > @Ting-Yu: do you have other suggestion about this, if there is no good news > > from QCT in the end? thank you. > > No, since it is something we don't know triggering kernel OOM killer when > free is ~60MB. At least I expect things like debug version image which QA > can help to get a log. QA says“As we talked, you can adjust oom_score to avoid your key process be killed by OOM killer”
Updated•9 years ago
|
Whiteboard: [mtbf], POVB → [mtbf], [POVB]
Comment 25•9 years ago
|
||
Hi! Mike, Is this case still an issue with V18D image? -- Keven
Flags: needinfo?(mlien)
Comment 26•9 years ago
|
||
Hi Youlong: As discussed in meeting, please help checked if there is new update in QCT's SR and let us know here, thanks.
Comment 27•9 years ago
|
||
(In reply to Wesly Huang from comment #26) > Hi Youlong: > > As discussed in meeting, please help checked if there is new update in QCT's > SR and let us know here, thanks. hi wesly - sorry for bring un-convenient. I would correct feedback in comment#24 qcom's latest update is "adjust oom_score to avoid key process be killed by OOM killer" tks.
Flags: needinfo?(youlong.jiang)
Comment 28•9 years ago
|
||
Gabriele, are there any concerns if we set the the oom_score_adj of B2G process -1000 to disable OOM?
Flags: needinfo?(gsvelto)
Updated•9 years ago
|
Flags: needinfo?(mlien)
Comment 29•9 years ago
|
||
(In reply to Ting-Yu Chou [:ting] from comment #28) > Gabriele, are there any concerns if we set the the oom_score_adj of B2G > process -1000 to disable OOM? I think it might be problematic. Preventing the b2g main process from being killed means that if we start running out of memory we'll start killing other things first, such as the various system daemons (rild, nfcd, vold, wpa_supplicant, etc...). I'm afraid that this would make the phone unusable. Additionally if the b2g process gets OOM killed we should restart it automatically IIRC so as to return the phone to a working state; I don't think we do the same for other system processes though.
Flags: needinfo?(gsvelto)
Reporter | ||
Comment 31•9 years ago
|
||
not seeing this for a really long time. let's close it for now.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•