Closed Bug 886217 Opened 7 years ago Closed 6 years ago

[General] Memory leakage induces a crash on RPC Modem

Categories

(Firefox OS Graveyard :: General, defect, major)

ARM
Gonk (Firefox OS)
defect
Not set
major

Tracking

(blocking-b2g:leo+, b2g18 fixed)

RESOLVED FIXED
1.1 QE4 (15jul)
blocking-b2g leo+
Tracking Status
b2g18 --- fixed

People

(Reporter: leo.bugzilla.gecko, Unassigned)

References

Details

(Keywords: verifyme, Whiteboard: [TD-59414][MemShrink:P2] QARegressExclude)

Attachments

(5 files)

If gecko (the b2g process) contains a memory leak then we will see its rss continues to grow like this.
  
At some point it's likely that the process will grow large enough that it inhibits the kernel from responding to modem RPC 
requests and the modem will timeout and induce a crash.
blocking-b2g: --- → leo?
Target Milestone: --- → 1.1 QE3 (24jun)
Is this speculation or did you actually reproduce the crash?
(In reply to Jason Smith [:jsmith] from comment #1)
> Is this speculation or did you actually reproduce the crash?

not yet, we are checking now. 

the point of problems of attached kernel log.
0) : <6>[2022-06-19 21:21:32 KST][43496.202695] [ pid ]   uid  tgid total_vm      rss   cpu oom_adj oom_score_adj name
1) : <6>[2022-06-19 17:53:18 KST][31001.723033] [  150]     0   150   112034    69527   0       0             0             b2g
2) : <6>[2022-06-19 21:21:32 KST][43496.202922] [  150]     0   150   128515    84430   0       0             0             b2g

b2g's rss size gradually increased from 69527 to 84430.
0) : <6>[2022-06-19 21:21:32 KST][43496.202695] [ pid ]   uid  tgid  total_vm    rss   cpu   oom_adj    oom_score_adj       name
1) : <6>[2022-06-19 17:53:18 KST][31001.723033] [  150]     0   150   112034    69527   0       0             0             b2g
2) : <6>[2022-06-19 21:21:32 KST][43496.202922] [  150]     0   150   128515    84430   0       0             0             b2g

sorry, fix column's width
(In reply to Jason Smith [:jsmith] from comment #1)
> Is this speculation or did you actually reproduce the crash?

Acutally, the crash happened when we were running Marionette script to test functionalities on Leo device. 
And now we are testing to reproduce this and analyzing to figure out the root cause.
If we find additional information for this, we will attach them here.
I think the growth of RSS of b2g doesn't mean there's memory leak in b2g for sure. What about the PSS? 

Second, from the kernel log, this is an oom-killer invoked case. If the PSS of b2g and Apps doesn't increase much compare to normal cases, maybe checking whether there's memory leak in kernel or fragmentation is a possible direction.
Gabriele can you keep an eye on this bug while Alan is going to be away for the week?
Flags: needinfo?(gsvelto)
(In reply to Wayne Chang [:wchang] from comment #7)
> Gabriele can you keep an eye on this bug while Alan is going to be away for
> the week?

Yes, I can have a look. Comment 5 suggests that this is happening only when using marionette, if this is the case then I know of a recent marionette problem that was looking like a leak but was in fact caused by the GC being indefinitely delayed from running effectively causing b2g to consume all memory. I'll try to dig out that bug and see if it's related.
Flags: needinfo?(gsvelto)
We previously had a similar problem that was fixed as part of bug 825802. The patch was uplifted to mozilla-b2g18 five months ago so you might be experiencing a different issue but it's probably worth checking.
We should try to grab an about:memory dump of the processes right before the crash happens, you can find instructions on how to do it in this tutorial:

https://wiki.mozilla.org/B2G/Debugging_OOMs
I'm using this script to get memory reports of b2g when oom-killer is activated.
I register this script as service and run automated test via marionette.
Attached file STR from oom_trap.sh
I made a service with oom_trap.sh, I attached before.
With that service, marionette test, which made modem crash, is performed repeatedly over 5 hours.
The leo device didn't show crash on modem but there were some marks of oom-killer activation.
So, I think it's worth to dig.

According to the logs, there are huge amount of png images.
I don't think it's normal.

Anyone have idea for this?
(In reply to Changbin Park from comment #12)
> Anyone have idea for this?

Checking your memory report dumps the image that shows up multiplied a lot of times is: "..." so this is the same problem we've been investigating as part of bug 851626. 

I'll close this one as a duplicate and add a comment there to point to your logs; I'll also CC everybody else there. Thanks for the thorough testing!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 851626
blocking-b2g: leo? → ---
Bug 851626 Comment 93

before marionette test.
│ │ │ │ │ ├──1.03 MB (02.83%) -- huge
│ │ │ │ │ │ ├──0.19 MB (00.52%) ── string(length=23510, "...") [4]
│ │ │ │ │ │ ├──0.16 MB (00.43%) ── string(length=18514, "...") [4]
│ │ │ │ │ │ ├──0.16 MB (00.43%) ── string(length=20438, "...") [4]
│ │ │ │ │ │ ├──0.10 MB (00.27%) ── string(length=9114, "...") [5]
│ │ │ │ │ │ ├──0.09 MB (00.26%) ── string(length=10914, "...") [4]
│ │ │ │ │ │ ├──0.09 MB (00.26%) ── string(length=7074, "...") [6]
│ │ │ │ │ │ ├──0.08 MB (00.21%) ── string(length=8054, "...") [5]
│ │ │ │ │ │ ├──0.05 MB (00.13%) ── string(length=4654, "...") [4]
│ │ │ │ │ │ ├──0.04 MB (00.11%) ── string(length=2522, "...") [5]
│ │ │ │ │ │ ├──0.04 MB (00.11%) ── string(length=3062, "...") [5]
│ │ │ │ │ │ └──0.04 MB (00.11%) ── string(length=3926, "...") [5]

after testing by marionette.
│ │ │ │ │ ├──1.55 MB (03.72%) -- huge
│ │ │ │ │ │ ├──0.23 MB (00.56%) ── string(length=9114, "...") [12]
│ │ │ │ │ │ ├──0.20 MB (00.49%) ── string(length=7074, "...") [13]
│ │ │ │ │ │ ├──0.19 MB (00.45%) ── string(length=23510, "...") [4]
│ │ │ │ │ │ ├──0.19 MB (00.45%) ── string(length=8054, "...") [12]
│ │ │ │ │ │ ├──0.16 MB (00.38%) ── string(length=18514, "...") [4]
│ │ │ │ │ │ ├──0.16 MB (00.38%) ── string(length=20438, "...") [4]
│ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=10914, "...") [4]
│ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=2522, "...") [12]
│ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=3062, "...") [12]
│ │ │ │ │ │ ├──0.09 MB (00.23%) ── string(length=3926, "...") [12]
│ │ │ │ │ │ └──0.05 MB (00.11%) ── string(length=4654, "...") [4]
Bug 851626 Comment 94 

Running the gallery-camera stress test on an m-c/master build, the test actually completed; though the b2g parent process had ballooned to the point where only it and the Gallery app could fit in memory at the same time.

After 100 iterations, I see 205 copies of the Marketplace app icon, ~2 per iteration.

(jlebar, this is without your DOMRequest fixes--I'll run that test overnight.)

I know next to nothing about how marionette works, but I wonder how it could be very specifically leaking data: URI icons.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
blocking-b2g: --- → leo+
Whiteboard: [TD-59414]
Target Milestone: 1.1 QE3 (26jun) → 1.1 QE4 (15jul)
Icon duplicate issue might be caused by marionette.
This one is not reproduced by manual test.
Do you know who is a right person for this one?
Flags: needinfo?(tchung)
(In reply to jongsoo.oh from comment #16)
> Icon duplicate issue might be caused by marionette.
> This one is not reproduced by manual test.
> Do you know who is a right person for this one?

Marionette knowledge would be best addressed by jonathan griffin.  jgriffin, any known issues?
Flags: needinfo?(tchung) → needinfo?(jgriffin)
(In reply to Tony Chung [:tchung] from comment #17)
> (In reply to jongsoo.oh from comment #16)
> > Icon duplicate issue might be caused by marionette.
> > This one is not reproduced by manual test.
> > Do you know who is a right person for this one?
> 
> Marionette knowledge would be best addressed by jonathan griffin.  jgriffin,
> any known issues?

more context, it seems bug 889261 is the actual issue that Leo is tracking.  The first comment notes that they are using marionette to launch apps.
It is a wrong. The bug 885158 is not related to the marionette.
It is use the orangutan because of mariomette icon issue.
There aren't any known Marionette issues related to icon duplication.  It's possible that there is a Marionette bug, or it's possible that there is a Gaia bug which only occurs when events are dispatched very quickly (as would happen during a Marionette test) and not when events happen more slowly as would happen during a manual test.

If you can show us the Marionette test you're using, we may be able to investigate.
Flags: needinfo?(jgriffin)
Whiteboard: [TD-59414] → [TD-59414][MemShrink]
The memory report here looks the same as bug 851626.

See bug 889990 for the analysis we've already done.  It's not clear to me yet how this is marionette's fault, but given that we only see this without marionette, I think it's pretty likely.

We're leaking DOM application objects.  If you do a gc/cc dump, you can see exactly how we're leaking them.
Leo, can you attach the Marionette script you're using to reproduce this issue, or describe what it's doing?  I'd like to reproduce it, so I can attempt to fix it, if indeed it's Marionette-related.
If there's some leak when running Marionette tests involving manifests, it may be due to the gaiatest atoms at https://github.com/mozilla/gaia-ui-tests/blob/master/gaiatest/atoms/gaia_apps.js, but I'd need to know if you're using gaiatest or not.
Whiteboard: [TD-59414][MemShrink] → [TD-59414][MemShrink:P2]
We're marking this MemShrink P2 because we're not sure whether this is a marionette-only issue or not.

If this is caused by Marionette, we should leo- this bug.  OTOH if it's not, we should probably make this a P1.

Like jgriffin said in comment 22, we need the test script Leo was running.
Attached file test_scripts.7z
I attach the Marionette test script we reproduce this issue.
(In reply to leo.bugzilla.gecko from comment #25)
> Created attachment 780177 [details]
> test_scripts.7z
> 
> I attach the Marionette test script we reproduce this issue.

Thank you; is it also possible to attach the lglib files that these tests use, so I can locally reproduce the problem?

Also, which of the attached tests were you running?  All of them in sequence?
Depends on: 897684
If the only relevant leak here is the icon url leak, that's been isolated in bug 897684.
I'm sorry, I thought you just need test scripts.
So, I attach lglib files for Marionette testing.
Also, I attach some scripts that we tested. These files using on git bash are numbered according to the order.
needinfo'ing jgriffin here to help understand if this is a marionette issue alone  per lglib files in the above comment and keeping in mind comment# 24.
Flags: needinfo?(jgriffin)
AFAIK, it isn't a Marionette issue; it's caused at least in part by bug 897684 (a gecko bug).  Once that's fixed, we can determine if there are other leaks involved.
Flags: needinfo?(jgriffin)
QA Wanted for what purpose?

Probably don't need verify me here - the bug isn't fixed yet.
Keywords: verifyme
Talked with Preeti in person - don't need do anything here yet for qawanted. When bug 889984 gets uplifted, a retest will be requested on this bug to see if it's fixed.
Keywords: qawanted
bug 889984 will not be fixed for 1.1. Lets wait for bug bug 900221 to be uplifted instead. QA please test if fixing 900221 fixes this issue (886217) as well. Thanks!!
qawanted for comment 33
Keywords: qawanted
Whiteboard: [TD-59414][MemShrink:P2] → [TD-59414][MemShrink:P2] QARegressExclude
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Keywords: qawantedverifyme
You need to log in before you can comment on or make changes to this bug.