1092924 - [MTBF][B2G] B2G crashed due to oom or other related issues.

Reporter

Description

•

10 years ago

22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962559] lowmem_shrink: convert oom_adj to oom_score_adj:
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962574] oom_adj 0 => oom_score_adj 0
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962581] oom_adj 1 => oom_score_adj 58
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962588] oom_adj 2 => oom_score_adj 117
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962595] oom_adj 6 => oom_score_adj 352
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962602] oom_adj 8 => oom_score_adj 470
22:36:45 11-01 22:36:45.192 I/cat     (  290): <6>[116728.962609] oom_adj 10 => oom_score_adj 588
22:36:45 The B2G process has restarted after crashing during  the tests so Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Is oom killer kiling b2g or what? I can't understand the log. Since B2G crashes shouldn't happen in random cases, can anyone help on investigating here?

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 1

•

10 years ago

mtbf test machine # 7
task: "flame.v2.1.mtbf7.319"
build # 101
device: e472d80e
V2.1 pvt BuildID:20141030161200

Walter Chen[:ypwalter][:wachen]

Reporter

Updated

•

10 years ago

Blocks: MTBF-B2G

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 2

•

10 years ago

The log is too large for me to upload here. Please download it from box.com
https://mozilla.box.com/s/u5urdo7morigc4ir8tdp

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 3

•

10 years ago

The full running history of jenkins (including some gaiatest/marionette debugging information, and some immediate logcat info)
https://mozilla.box.com/s/blbo07lj1nycznhzvnjd

Walter Chen[:ypwalter][:wachen]

Reporter

Updated

•

10 years ago

Summary: [MTBF][B2G] B2G crashed and restarted → [MTBF][B2G] B2G crashed due to oom or other related issues.

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 4

•

10 years ago

Also, I found build # 98 & 102 has similiar message.

build#98
19:19:09 11-02 19:19:07.461 I/cat     (  296): <6>[191276.715013] mmc0: Starting deferred resume
19:19:09 11-02 19:19:07.521 I/cat     (  296): <6>[191276.774734] mmc0: Deferred resume completed
19:19:09 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

build#102
09:11:14 11-03 09:11:12.885 I/cat     (  232): <6>[241189.377736] mmc0: Starting deferred resume
09:11:14 11-03 09:11:12.945 I/cat     (  232): <6>[241189.438481] mmc0: Deferred resume completed
09:11:14 B2G is still running but Marionette can't respond due to either a Gecko, Gaia or Marionette error. Above, the 5 most recent errors are listed. Check logcat for all errors if these errors are not the cause of the failure.

Devices in other mtbf machine are showing similiar logcat information. However, it doesn't necessary die after "B2G is still running ..." messages.

Brian Huang [:brianhuang][:brhuang]

Comment 5

•

10 years ago

Hi Ting,

Could you help on this? Thx.

Flags: needinfo?(tchou)

Brian Huang [:brianhuang][:brhuang]

Comment 6

•

10 years ago

[Blocking Requested - why for this release]:
Some MTBF testing devices are stopped by this issue.

blocking-b2g: --- → 2.1?

Ting-Yu Chou [:ting] (away)

Comment 7

•

10 years ago

(In reply to Walter Chen[:ypwalter][:wachen] from comment #2)
> The log is too large for me to upload here. Please download it from box.com
> https://mozilla.box.com/s/u5urdo7morigc4ir8tdp

From dmesg I saw only one OOM:

<3>[25841.838754] Out of memory: Kill process 2515 (Messages) score 918 or sacrifice child

(In reply to Walter Chen[:ypwalter][:wachen] from comment #3)
> The full running history of jenkins (including some gaiatest/marionette
> debugging information, and some immediate logcat info)
> https://mozilla.box.com/s/blbo07lj1nycznhzvnjd

B2G process is killed by OOM killer:

11-01 22:36:43.442 I/cat     (  290): <3>[116727.100412] Out of memory: Kill process 206 (b2g) score 154 or sacrifice child

Flags: needinfo?(tchou)

Ting-Yu Chou [:ting] (away)

Comment 8

•

10 years ago

(In reply to Ting-Yu Chou [:ting] from comment #7)
> (In reply to Walter Chen[:ypwalter][:wachen] from comment #3)
> > The full running history of jenkins (including some gaiatest/marionette
> > debugging information, and some immediate logcat info)
> > https://mozilla.box.com/s/blbo07lj1nycznhzvnjd
> 
> B2G process is killed by OOM killer:
> 
> 11-01 22:36:43.442 I/cat     (  290): <3>[116727.100412] Out of memory: Kill
> process 206 (b2g) score 154 or sacrifice child

I was wondering if there's any memory leaks, but the free is quite large when b2g gets killed:

11-01 22:36:43.442 I/cat     (  290): <6>[116727.167605] Normal free:62008kB

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 9

•

10 years ago

Ting-Yu, there are logs in my comment 2 and comment 3. 
If it possible that you can look further to see if there is any memory leaks?

https://mozilla.box.com/s/u5urdo7morigc4ir8tdp

Ting-Yu Chou [:ting] (away)

Comment 10

•

10 years ago

Since there's still plenty free memory, I doubt it is the "MC_Thread" allocating improper size. Note not only b2g, but also homescreen, ui tests, and nuwa are killed by OOM killer with the same backtrace:

11-01 22:36:43.442 I/cat     (  290): <4>[116727.155281] MC_Thread invoked oom-killer: gfp_mask=0x40d0, order=3, oom_adj=0, oom_score_adj=0
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163044] [<c010bd74>] (unwind_backtrace+0x0/0xf8) from [<c087d748>] (dump_header.isra.10+0x74/0x180)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163066] [<c087d748>] (dump_header.isra.10+0x74/0x180) from [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163087] [<c087d898>] (oom_kill_process.part.13.constprop.17+0x44/0x220) from [<c020f6bc>] (out_of_memory+0x2bc/0x334)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163107] [<c020f6bc>] (out_of_memory+0x2bc/0x334) from [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163124] [<c0214070>] (__alloc_pages_nodemask+0x91c/0x94c) from [<c0214100>] (__get_free_pages+0x10/0x24)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163143] [<c0214100>] (__get_free_pages+0x10/0x24) from [<c02445b0>] (kmalloc_order_trace+0x24/0xd0)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163160] [<c02445b0>] (kmalloc_order_trace+0x24/0xd0) from [<c0247404>] (__kmalloc+0x204/0x288)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.163593] [<c0247404>] (__kmalloc+0x204/0x288) from [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.164194] [<bf0a90dc>] (peCreateSession+0xe0/0x40c [wlan]) from [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.164731] [<bf0712b0>] (limHandleFTPreAuthRsp+0xe8/0x270 [wlan]) from [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.165268] [<bf081494>] (limProcessAuthFrameNoSession+0x248/0x3c8 [wlan]) from [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.165801] [<bf085ec8>] (limProcessMessages+0x1bfc/0x1db0 [wlan]) from [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.166320] [<bf086324>] (limMessageProcessor+0x21c/0x2e4 [wlan]) from [<bf069030>] (peProcessMessages+0x38/0x4c [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167048] [<bf069030>] (peProcessMessages+0x38/0x4c [wlan]) from [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan])
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167486] [<bf155a40>] (VosMCThread+0x344/0x6c8 [wlan]) from [<c0197ae0>] (kthread+0x84/0x90)
11-01 22:36:43.442 I/cat     (  290): <6>[116727.167507] [<c0197ae0>] (kthread+0x84/0x90) from [<c0106820>] (kernel_thread_exit+0x0/0x8)

It is not in our codebase, can we have partner's comments?

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 11

•

10 years ago

Wesly, can you help with this issue?

We suspect that MC_Thread is trying to allocate unproper memory (kmalloc). It cause that B2G got killed by OOM killer. Please see comment 10 for reference. That is not one of the area we are taking care of. We suspect it is a partner issue. Please let t2m know and we need to have someone taking care of this.

Flags: needinfo?(wehuang)

Tony Chung [:tchung]

Comment 12

•

10 years ago

2.1+ because this is blocking Mozilla from hitting MTBF goal of 100 hours before 2.1 CC date.

blocking-b2g: 2.1? → 2.1+

Walter Chen[:ypwalter][:wachen]

Reporter

Updated

•

10 years ago

Depends on: 1081577

Wesly Huang (EPM)

Comment 13

•

10 years ago

Sorry for catching up late. ni T2M

Hi Youlong: please take a look at this issue, would you help clarify the design/behavior of "MC_Thread", and it this symptom is reasonable based on it? See all comments above for related investigation we've done. Thanks.

Flags: needinfo?(wehuang) → needinfo?(youlong.jiang)

youlong.jiang

Comment 14

•

10 years ago

(In reply to Wesly Huang from comment #13)
> Sorry for catching up late. ni T2M
> 
> Hi Youlong: please take a look at this issue, would you help clarify the
> design/behavior of "MC_Thread", and it this symptom is reasonable based on
> it? See all comments above for related investigation we've done. Thanks.

wesly -

how about rep-rate for this issue, and did you summarize the regular rep-step. from our view, MC_Thread is qcom private code, we need you provide more details and make a SR for qcom.

tks.

Flags: needinfo?(youlong.jiang)

Wesly Huang (EPM)

Comment 15

•

10 years ago

Hi Walter, Ting-Yu: would you help on comment#14 for repro. rate, and a more general description of this symptom? Thank you.

Flags: needinfo?(wachen)

Flags: needinfo?(tchou)

Ken Chang[:kenkai]

Comment 16

•

10 years ago

According to comment 10, I change the component.

Component: Stability → Vendcom

Ting-Yu Chou [:ting] (away)

Comment 17

•

10 years ago

I don't have reporduce rate, for a description, how about this?

B2G process is killed by OOM killer when the free memory is ~60MB, from log it is MC_Thread invoked OOM killer. Is it possible the MC_Thread is allocating improper size? Note not only B2G, but also Homescreen, UI tests, and Nuwa are killed by OOM killer with the same backtrace.

Flags: needinfo?(tchou)

Wesly Huang (EPM)

Comment 18

•

10 years ago

Thanks Ting-Yu, I thin it's clear and absolutely better then what I can write :)

@Youlong, pls take comment#17 and check with QCT, and keep us updated, thank you.

Flags: needinfo?(youlong.jiang)

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 19

•

10 years ago

Repro rate is about 10% among 10 devices+

Flags: needinfo?(wachen)

Paul Yang [:pyang] (away)

Updated

•

10 years ago

Whiteboard: [mtbf]

shiqian.zhou

Comment 20

•

10 years ago

(In reply to Wesly Huang from comment #18)
> Thanks Ting-Yu, I thin it's clear and absolutely better then what I can
> write :)
> 
> @Youlong, pls take comment#17 and check with QCT, and keep us updated, thank
> you.

Hi wesly:
We have submitted a bug 01808680 into Qualcomm.avoid to be killed
OOM is triged by kernel thread lowmemorykiller.It count points for every task.The higher the point, more an more likely to be killed.The /proc/b2g_pid/oom_adj currently is 0,you can set it smaller like -16 in order to avoid being killed.

bhavana bajaj [:bajaj]

Updated

•

10 years ago

Whiteboard: [mtbf] → [mtbf], POVB

youlong.jiang

Comment 21

•

10 years ago

Dears,

no feedback on qcom SR path.

tks.

Flags: needinfo?(youlong.jiang) → needinfo?(wehuang)

Wesly Huang (EPM)

Comment 22

•

10 years ago

Hi Youlong, pls help follow w/ QCT for their feedback and keep us posted, thanks.

@Ting-Yu: do you have other suggestion about this, if there is no good news from QCT in the end? thank you.

Flags: needinfo?(youlong.jiang)

Flags: needinfo?(wehuang)

Flags: needinfo?(tchou)

Ting-Yu Chou [:ting] (away)

Comment 23

•

10 years ago

(In reply to Wesly Huang from comment #22)
> @Ting-Yu: do you have other suggestion about this, if there is no good news
> from QCT in the end? thank you.

No, since it is something we don't know triggering kernel OOM killer when free is ~60MB. At least I expect things like debug version image which QA can help to get a log.

Flags: needinfo?(tchou)

shiqian.zhou

Comment 24

•

10 years ago

(In reply to Ting-Yu Chou [:ting] from comment #23)
> (In reply to Wesly Huang from comment #22)
> > @Ting-Yu: do you have other suggestion about this, if there is no good news
> > from QCT in the end? thank you.
> 
> No, since it is something we don't know triggering kernel OOM killer when
> free is ~60MB. At least I expect things like debug version image which QA
> can help to get a log.
QA says“As we talked, you can adjust oom_score to avoid your key process be killed by OOM killer”

Mike Lien[:mlien]

Updated

•

9 years ago

Whiteboard: [mtbf], POVB → [mtbf], [POVB]

Keven Kuo [:kkuo]

Comment 25

•

9 years ago

Hi! Mike,

Is this case still an issue with V18D image?

--
Keven

Flags: needinfo?(mlien)

Wesly Huang (EPM)

Comment 26

•

9 years ago

Hi Youlong:

As discussed in meeting, please help checked if there is new update in QCT's SR and let us know here, thanks.

youlong.jiang

Comment 27

•

9 years ago

(In reply to Wesly Huang from comment #26)
> Hi Youlong:
> 
> As discussed in meeting, please help checked if there is new update in QCT's
> SR and let us know here, thanks.

hi wesly -

sorry for bring un-convenient. I would correct feedback in comment#24

qcom's latest update is "adjust oom_score to avoid key process be killed by OOM killer"

tks.

Flags: needinfo?(youlong.jiang)

Ting-Yu Chou [:ting] (away)

Comment 28

•

9 years ago

Gabriele, are there any concerns if we set the the oom_score_adj of B2G process -1000 to disable OOM?

Flags: needinfo?(gsvelto)

Mike Lien[:mlien]

Updated

•

9 years ago

Flags: needinfo?(mlien)

Gabriele Svelto [:gsvelto] (PTO)

Comment 29

•

9 years ago

(In reply to Ting-Yu Chou [:ting] from comment #28)
> Gabriele, are there any concerns if we set the the oom_score_adj of B2G
> process -1000 to disable OOM?

I think it might be problematic. Preventing the b2g main process from being killed means that if we start running out of memory we'll start killing other things first, such as the various system daemons (rild, nfcd, vold, wpa_supplicant, etc...). I'm afraid that this would make the phone unusable. Additionally if the b2g process gets OOM killed we should restart it automatically IIRC so as to return the phone to a working state; I don't think we do the same for other system processes though.

Flags: needinfo?(gsvelto)

Kai-Chih Hu [:khu]

Comment 30

•

9 years ago

It's no longer a blocker to 2.1 MTBF now.

blocking-b2g: 2.1+ → -

Walter Chen[:ypwalter][:wachen]

Reporter

Comment 31

•

9 years ago

not seeing this for a really long time. let's close it for now.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INVALID