Closed Bug 1206634 Opened 6 years ago Closed 6 years ago

[MTBF] After 20 hours executing, test cases failed consequently

Categories

(Firefox OS Graveyard :: Stability, defect, P2)

ARM
Gonk (Firefox OS)

Tracking

(blocking-b2g:2.5+)

RESOLVED FIXED
blocking-b2g 2.5+

People

(Reporter: pyang, Assigned: ting)

References

Details

(Whiteboard: [MemShrink:P1])

Attachments

(2 files, 1 obsolete file)

Traceback (most recent call last):
File "/var/jenkins/workspace/flamekk.vmaster.moztwlab01.512/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_client-0.17-py2.7.egg/marionette/runner/mixins/reporting.py", line 256, in gather_debug
debug['source'] = self.marionette.page_source

File "/var/jenkins/workspace/flamekk.vmaster.moztwlab01.512/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_driver-0.13-py2.7.egg/marionette_driver/marionette.py", line 1164, in page_source
return self._send_message("getPageSource", key="value")

File "/var/jenkins/workspace/flamekk.vmaster.moztwlab01.512/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_driver-0.13-py2.7.egg/marionette_driver/decorators.py", line 36, in _
return func(*args, **kwargs)

File "/var/jenkins/workspace/flamekk.vmaster.moztwlab01.512/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_driver-0.13-py2.7.egg/marionette_driver/marionette.py", line 711, in _send_message
self._handle_error(resp)

File "/var/jenkins/workspace/flamekk.vmaster.moztwlab01.512/label/moztwlab-01/mtbf-env/local/lib/python2.7/site-packages/marionette_driver-0.13-py2.7.egg/marionette_driver/marionette.py", line 752, in _handle_error
raise errors.lookup(error)(message, stacktrace=stacktrace)

MarionetteException: MarionetteException: curContainer.frame is null


STR: run mtbf over 20 hours
EXPECT: over 100 hours
ni? Bobby for further information
Flags: needinfo?(bchien)
Flags: needinfo?(bchien)
Severity: normal → major
blocking-b2g: --- → 2.5+
Priority: -- → P1
Blocks: mtbf-2015q3
We can start from adding some logs to listener.js to see where set curContainer.frame null.
ni David for comment on Marrionette Exception. This exception was raised from MTBF testing. please help to give us some feedback.
Assignee: nobody → janus926
Flags: needinfo?(dburns)
Can this be easily created? That error is suggesting that there is no document to be able to get the source.
Flags: needinfo?(dburns)
Attached file about memory report
Add memory report
From the device I reproduced today, I saw frequent memory pressure, USS of B2G process takes >130MB.
(In reply to Paul Yang [:pyang] from comment #5)
> Created attachment 8668852 [details]

There're two big chunks, large memory-file-data and heap unclassified:

181.86 MB (100.0%) -- explicit
├───78.08 MB (42.94%) -- dom
│   ├──78.07 MB (42.93%) -- memory-file-data
│   │  ├──76.35 MB (41.98%) -- large
│   │  │  ├──33.25 MB (18.28%) ── file(length=201666, sha1=189b37fbf1fde69adc714e419c1ce4324a054d93) [152]
│   │  │  ├──12.25 MB (06.73%) ++ (65 tiny)
│   │  │  ├───8.50 MB (04.67%) ── file(length=257119, sha1=c522aa26de553674e51c89e2b397d16a4ce73d51) [34]
│   │  │  ├───5.39 MB (02.96%) ── file(length=73595, sha1=4428ff94c438931de93776bd69ab1eb3f3c98416) [69]
│   │  │  ├───5.13 MB (02.82%) ── file(length=125524, sha1=30aa9e41b906f96dfd7697dc660a245a9d4b339c) [41]
│   │  │  ├───4.63 MB (02.54%) ── file(length=130677, sha1=3a02c9cb99e691c546170ab91ceaf92fd10dc606) [37]
│   │  │  ├───2.77 MB (01.53%) ── file(length=36100, sha1=079300a794680db80b8648ba2284344f6d142990) [71]
│   │  │  ├───2.41 MB (01.32%) ── file(length=55507, sha1=9513ea348f7a6874de089a73e742c631b085a7c7) [44]
│   │  │  └───2.03 MB (01.12%) ── file(length=154687, sha1=df8f2b598e528ddeb88f7968863f26cf796bcde9) [13]
├───38.52 MB (21.18%) ── heap-unclassified
Blocks: TV_Gecko_P2
Attached file about-memory-2.zip (obsolete) —
The device reset while pulling gc/cc logs, but still got the memory report and DMD.

  347.16 MB (100.0%) -- explicit
  ├──188.96 MB (54.43%) -- dom
  │  ├──188.95 MB (54.43%) -- memory-file-data
  │  │  ├──184.15 MB (53.05%) -- large
  │  │  │  ├───85.31 MB (24.57%) ── file(length=188369, name=, path=, type=image/png, sha1=766755f3ce99c5b1d3bbc325257e2b68eca2839c) [455]
  │  │  │  ├───14.88 MB (04.29%) ++ (189 tiny)
  │  │  │  ├───14.58 MB (04.20%) ── file(length=43637, name=, path=, type=image/jpeg, sha1=61efad1b973d9550a950a7d50dab1b572d74ab96) [311]
  │  │  │  ├───14.38 MB (04.14%) ── file(length=156812, name=, path=, type=image/jpeg, sha1=961e0d10b5c7691ec925d503aa59073dbbbae250) [92]
  │  │  │  ├────9.95 MB (02.87%) ── file(length=110450, name=, path=, type=image/png, sha1=a2daae3ebcc80df9a215f88c0abc1aa95acff7f1) [91]
  │  │  │  ├────9.52 MB (02.74%) ── file(length=110175, name=, path=, type=image/png, sha1=e7de7c9506b8d1508b2de9bdba92b07dcc062340) [87]
  │  │  │  ├────7.97 MB (02.30%) ── file(length=146587, name=, path=, type=image/jpeg, sha1=3bbef36280af9025b96372f0049823df8822702f) [51]
  │  │  │  ├────7.34 MB (02.12%) ── file(length=36100, name=, path=, type=image/jpeg, sha1=079300a794680db80b8648ba2284344f6d142990) [188]
  │  │  │  ├────6.34 MB (01.83%) ── file(length=206292, name=, path=, type=image/png, sha1=533a1f920b3159a60360bd2f24e7fbc1dc99e128) [29]
  │  │  │  ├────6.19 MB (01.78%) ── file(length=93846, name=, path=, type=image/jpeg, sha1=40f59205a87e0d561f0662891eed6dc48405165b) [66]
  │  │  │  ├────4.13 MB (01.19%) ── file(length=94304, name=, path=, type=image/jpeg, sha1=0cfb952df1a1293e9c5ac6c2313317c94e5a162e) [44]
  │  │  │  └────3.56 MB (01.03%) ── file(length=188901, name=, path=, type=image/png, sha1=52748a1aa1f3bd3249ad79838a8b83acd0f7372c) [19]
  │  │  └────4.80 MB (01.38%) ── small
  ├───50.56 MB (14.56%) ── heap-unclassified
Kyle, I couldn't find BlobImplFile or similar Blob corresponding to the large memory-file-data from cc/gc edges in attachment 8668852 [details]. There're some from GridView, but I don't know are they related.

Would you shed me some light on how to debug this further? Thanks.
Flags: needinfo?(khuey)
At least some of the blob JS objects show us as [garbage] in the CC log, meaning they were collected when the memory report was taken. Did you use --minimize to GC first? That might help.
Flags: needinfo?(khuey)
Attached file about-memory-3.zip
I've confirmed with Paul that attachment 8668852 [details] was collected with --minimize. Grabbed a new one on the device reproduced today.
Attachment #8674646 - Attachment is obsolete: true
From the attachment 8676014 [details], I can still see [garbage] blob JS objects. But I also noticed this:

  0xb04b0bf0 [rc=4] FragmentOrElement (xhtml) iframe id='browser163' class='browser' (orphan) app://system.gaiamobile.org/index.html

which is also a [garbage].
any progress?
Am still working on it, not much progress though...
I'm still chasing down this one from cc log of attachment 8676014 [details]:

  0xa9c855f0 [rc=13] FragmentOrElement (xhtml) div id='AppWindow_2' class='appWindow render collapsible light browser scrollable inactive' (orphan) app://system.gaiamobile.org/index.html
Depends on: 1217289
Bug 1217289 is filed for the leak of AppWindow_2, not sure is it all the leaks...
I left 3 devices with the patch of bug 1217289 on Friday, and they're still running this morning.
(In reply to Ting-Yu Chou [:ting] from comment #17)
> I left 3 devices with the patch of bug 1217289 on Friday, and they're still
> running this morning.

Unfortunately, there still seems are memory leaks and all the processes get killed when I |./tools/get_about_memory.py -m|.
In DMD, there's a significant portion with this stack, I'll take a look:

  #01: replace_malloc /home/ting/w/fx/os/flame/gecko/memory/replace/dmd/DMD.cpp:1254 (libdmd.so+0x3900)
  #02: nsSegmentedBuffer::AppendNewSegment() /home/ting/w/fx/os/flame/gecko/xpcom/io/nsSegmentedBuffer.cpp:68 (libxul.so+0x3fb4c8)
  #03: nsPipe::GetWriteSegment(char*&, unsigned int&) /home/ting/w/fx/os/flame/gecko/xpcom/io/nsPipe3.cpp:820 (libxul.so+0x3fb640)
  #04: nsPipeOutputStream::WriteSegments(nsresult (*)(nsIOutputStream*, void*, char*, unsigned int, unsigned int, unsigned int*), void*, unsigned int, unsigned int*) /home/ting/w/fx/os/flame/gecko/xpcom/io/nsPipe3.cpp:1628 (libxul.so+0x3fd3f6)
Can this explain the behavior I have spotted in bug 1215452 ? If so, we should mark bug 1215452 as being dependant on that one.
Flags: needinfo?(janus926)
(In reply to Alexandre LISSY :gerard-majax from comment #20)
> Can this explain the behavior I have spotted in bug 1215452 ? If so, we
> should mark bug 1215452 as being dependant on that one.

Do you mean comment 18?
Flags: needinfo?(janus926)
(In reply to Ting-Yu Chou [:ting] from comment #21)
> (In reply to Alexandre LISSY :gerard-majax from comment #20)
> > Can this explain the behavior I have spotted in bug 1215452 ? If so, we
> > should mark bug 1215452 as being dependant on that one.
> 
> Do you mean comment 18?

I mean the whole leaking, can this explain why taking an about:memory on device after 7 days takes 5 minutes instead of a few seconds ?
Flags: needinfo?(janus926)
Whiteboard: [MemShrink]
Maybe, I'm not sure. There're other things to consider, for example, 1) did you use "--minimize" to trigger GC to get rid of dead objects, 2) the heap may be very fragamented to walk through.
Flags: needinfo?(janus926)
According to priority categorization, this is a P2 bug.
Priority: P1 → P2
The nsPipe of the input/output stream from nsServerSocket's client socket are leaked, trying to sort it out how.
The pipe has refcnt 1 from mOriginalInput, which has refcnt 3. Probably from marionette server.
Depends on: 1220012
Whiteboard: [MemShrink] → [MemShrink:P1]
I learned from QA that MTBF now reaches 48 hours with the patch of bug 1217289. Hopefully we can have bug 1220012 reviewed soon to push it further.
Over 40 hours last week. Raptor report [1][2].

[1] Flame: http://mzl.la/1Q520Ti
[2] Aries: http://mzl.la/1Q51Xqw
Do we want to close the bug?
Flags: needinfo?(bchien)
Flags: needinfo?(pyang)
Close per all depends on closed.
Status: NEW → RESOLVED
Closed: 6 years ago
Flags: needinfo?(pyang)
Flags: needinfo?(bchien)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.