Closed Bug 775227 Opened 12 years ago Closed 5 years ago

OutOfMemory or "out of memory" during mochitest

Categories

(Testing :: General, defect)

x86
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INACTIVE

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [test disabled][leave open])

Attachments

(4 files, 1 obsolete file)

Many of the logs reported in bug 747787 - crash during android mochitests - show "out of memory" or "OutOfMemoryException" in the logcats: https://tbpl.mozilla.org/php/getParsedLog.php?id=12729578&tree=Ionmonkey contains I/GeckoDump( 1710): 82913 INFO TEST-START | /tests/layout/style/test/test_value_cloning.html I/GeckoScreenshot( 1710): rect: -0.316666, 11.000000, 376.549988, 56800.683594 I/GeckoTabs( 1710): Got message: SessionHistory:New I/GeckoTabs( 1710): Got message: SessionHistory:Purge I/GeckoScreenshot( 1710): rect: 207.683334, 13.000000, 513.000000, 507.683350 I/GeckoScreenshot( 1710): rect: 0.000000, 0.000000, 500.000000, 300.000000 I/GeckoScreenshot( 1710): rect: 0.000000, 0.000000, 300.000000, 150.000000 I/dalvikvm-heap( 1710): Clamp target GC heap from 16.109MB to 16.000MB I/dalvikvm-heap( 1710): Forcing collection of SoftReferences for 923572-byte allocation I/dalvikvm-heap( 1710): Clamp target GC heap from 16.081MB to 16.000MB E/dalvikvm-heap( 1710): Out of memory on a 923572-byte allocation. I/WindowManager( 1017): WIN DEATH: Window{445f2f40 org.mozilla.fennec/org.mozilla.fennec.App paused=false} I/ActivityManager( 1017): Process org.mozilla.fennec (pid 1710) has died. I/WindowManager( 1017): WIN DEATH: Window{4463b258 SurfaceView paused=false} https://tbpl.mozilla.org/php/getParsedLog.php?id=12685785&tree=Mozilla-Inbound contains I/GeckoDump( 1716): 30732 INFO TEST-PASS | /tests/layout/style/test/test_parse_rule.html | .a::before {content: 'This is \a'}#a ::before {content: 'FAIL'} - "This is \A " should equal "This is \A " I/GeckoTabs( 1716): Got message: SessionHistory:New I/GeckoTabs( 1716): Got message: SessionHistory:Purge E/GeckoConsole( 1716): [JavaScript Warning: "Unexpected end of file while searching for ';' or '}' to end declaration. Unexpected end of file while searching for closing } of declaration block." {file: "http://mochi.test:8888/tests/layout/style/test/test_parse_rule.html" line: 1}] I/GeckoScreenshot( 1716): rect: 217.683334, 23.000000, 323.000000, 367.683350 I/GeckoScreenshot( 1716): rect: 10.000000, 10.000000, 310.000000, 160.000000 I/GeckoScreenshot( 1716): rect: 0.000000, 0.000000, 300.000000, 150.000000 I/GeckoDump( 1716): 30733 INFO TEST-PASS | /tests/layout/style/test/test_parse_rule.html | #a::before {content: 'This is \a - "This is \A " should equal "This is \A " I/GeckoDump( 1716): 30734 INFO TEST-PASS | /tests/layout/style/test/test_parse_rule.html | Sanity check for rule: div {background: blue} - transparent should not equal rgb(0, 0, 255) I/GeckoTabs( 1716): Got message: SessionHistory:New I/GeckoTabs( 1716): Got message: SessionHistory:Purge I/GeckoScreenshot( 1716): rect: 217.683334, 23.000000, 323.000000, 367.683350 I/GeckoScreenshot( 1716): rect: 10.000000, 10.000000, 310.000000, 160.000000 I/GeckoScreenshot( 1716): rect: 0.000000, 0.000000, 300.000000, 150.000000 I/dalvikvm-heap( 1716): Forcing collection of SoftReferences for 1486108-byte allocation E/dalvikvm-heap( 1716): Out of memory on a 1486108-byte allocation. E/GeckoAppShell( 1716): >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 9 ("GeckoBackgroundThread") E/GeckoAppShell( 1716): java.lang.OutOfMemoryError E/GeckoAppShell( 1716): at java.nio.ByteBuffer.put(ByteBuffer.java:815) E/GeckoAppShell( 1716): at org.mozilla.gecko.gfx.ScreenshotLayer$ScreenshotImage.copyBuffer(ScreenshotLayer.java:138) E/GeckoAppShell( 1716): at org.mozilla.gecko.gfx.ScreenshotLayer$ScreenshotImage.setBitmap(ScreenshotLayer.java:144) E/GeckoAppShell( 1716): at org.mozilla.gecko.gfx.ScreenshotLayer.setBitmap(ScreenshotLayer.java:53) E/GeckoAppShell( 1716): at org.mozilla.gecko.gfx.LayerRenderer.setCheckerboardBitmap(LayerRenderer.java:138) E/GeckoAppShell( 1716): at org.mozilla.gecko.ScreenshotHandler$1.run(GeckoAppShell.java:2338) E/GeckoAppShell( 1716): at android.os.Handler.handleCallback(Handler.java:587) E/GeckoAppShell( 1716): at android.os.Handler.dispatchMessage(Handler.java:92) E/GeckoAppShell( 1716): at android.os.Looper.loop(Looper.java:123) E/GeckoAppShell( 1716): at org.mozilla.gecko.GeckoBackgroundThread.run(GeckoBackgroundThread.java:31) I/WindowManager( 1017): WIN DEATH: Window{4444b6d0 SurfaceView paused=false} I/ActivityManager( 1017): Process org.mozilla.fennec (pid 1716) has died. I/WindowManager( 1017): WIN DEATH: Window{44421f10 org.mozilla.fennec/org.mozilla.fennec.App paused=false}
Blocks: 747787
Bug 766584 shows OOM in: /tests/layout/style/test/test_transitions_computed_values.html, /tests/layout/style/test/test_transitions_and_zoom.html, and /tests/layout/style/test/test_value_storage.html ...all during screenshotting.
Bug shows 775053 OOM in: /tests/dom/tests/mochitest/bugs/test_bug260264_nested.html
Despite the various stack traces showing OOM during a screenshot, the problem may well be elsewhere. Consider: - bug 757468 also had stack traces showing OOM during a screenshot, but the problem was elsewhere - screenshots are one of the largest allocations made -- if there is a memory leak on a smaller allocation somewhere, we would still expect to see most OOM's reported in the screenshot - the pageload-stress test that gave rise to bug 757468 can now run for over 1000 page loads (ie, many screenshots) without OOM - OOM failures during mochitests have persisted despite many significant changes to screenshot implementation
I ran mochitests locally, with TEST_PATH=layout/style/test and monitored heap usage with ddms. After a few minutes, I started seeing logcat messages: I/ActivityManager( 3098): Low Memory: No more background processes. E/GeckoApp(13688): low memory and noted Android killing off many processes (but fennec and sutagent survived). ddms reported the fennec heap size as about 9M (5 alloc/4 free); sutagent about 4M (3/1). ddms sysinfo / memory (PSS in KB graph) shows 75% "unknown". Interesting...where is the memory being used?
The test results in Comment 4 apply to the Galaxy S (Froyo). I repeated the same test on a Galaxy Nexus (ICS): - no low memory warnings in logcat - no processes killed - ddms sysinfo / memory (PSS in KB graph) shows fennec using about 20% of the graph - ddms reports heap size at about 18M (15 alloc/3 free)
I repeated the same test on a local Tegra. There were no memory warnings in the logcat, 130000+ test results were reported, and then the Tegra rebooted!
Summary of additional test results from Galaxy S for layout/style/test: - Low Memory persists for a variety of conditions including: - screenshots disabled - favicon bitmap creation disabled - SUT or ADB devicemanagers - typically, the Android low memory killer kills off a significant number of processes; usually that includes the provider process; sometimes the sutagent is also killed - ** Low Memory does not occur and layout/style/test mochitests run to completion without error if test_transitions_per_property.html is disabled ** Try run without test_transitions_per_property: https://tbpl.mozilla.org/?tree=Try&rev=318ebf99823d ...
I am having trouble demonstrating an improvement to mochitest runs on try (other tests fail), but my local results are consistent, so I would like to check this in to disable test_transitions_per_property while I investigate other mochitest errors.
Attachment #644509 - Attachment is obsolete: true
Attachment #644810 - Flags: review?(jmaher)
Comment on attachment 644810 [details] [diff] [review] disable test_transitions_per_property to avoid OOM Review of attachment 644810 [details] [diff] [review]: ----------------------------------------------------------------- good stuff, make sure we leave this bug open with the whiteboard as: [orange][test disabled][leave open]
Attachment #644810 - Flags: review?(jmaher) → review+
Whiteboard: [orange][test disabled][leave open]
Depends on: 710942
Sorry, that's just Windows; removing again.
No longer depends on: 710942
I have found a few more tests that cause Low Memory / OOM / tegra reboot -- I will try to complete a comprehensive list today.
Blocks: 438871
This is taking longer than anticipated...I have identified about 20 OOM tests now.
With these additional tests disabled, I can run all of the mochitests - in 8 chunks - on my Galaxy S with good reliability (3 straight runs for each chunk without test failure due to OOM). Even so, there are still signs of low memory -- the Android low memory killer often kills off a few processes during the tests, but not so aggressively that sutagent or fennec are killed. Results from Try are inconclusive. I have only managed 6 runs of M3 on try; of these, 3 were green, 1 failed due to reboot, 1 failed possibly due to reboot, and 1 failed for an unrelated reason.
Attachment #646221 - Flags: review?(jmaher)
Comment on attachment 646221 [details] [diff] [review] disable 18 additional tests Review of attachment 646221 [details] [diff] [review]: ----------------------------------------------------------------- ::: testing/mochitest/android.json @@ +33,5 @@ > "content/base/test/websocket_hybi/test_receive-arraybuffer.html": "", > "content/base/test/websocket_hybi/test_receive-blob.html": "", > "content/base/test/websocket_hybi/test_send-arraybuffer.html": "", > "content/base/test/websocket_hybi/test_send-blob.html": "", > + "content/canvas/test/webgl/test_webgl_conformance_test_suite.html": "bug 775227", this specific test has it's own manifest with rules to ignore tests on each platform. I would like somebody who owns this test to be aware of the memory impact this test has.
Attachment #646221 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmahe > this specific test has it's own manifest with rules to ignore tests on each > platform. I would like somebody who owns this test to be aware of the > memory impact this test has. I have opened additional bugs to track progress on re-enabling each test and draw attention to the memory issue. See bugs 778003,6,8,9,10,11,14,16,17,18,19.
Some tests continue to fail on tbpl. I notice this pattern on many M3 runs: 1996 INFO TEST-START | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Infraware/test_event_session_oldvalue.html 1997 INFO TEST-PASS | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Infraware/test_event_session_oldvalue.html | Elided 3 passes or known failures. 1998 INFO TEST-END | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Infraware/test_event_session_oldvalue.html | finished in 1060ms 1999 INFO TEST-START | /tests/dom/imptests/webapps/WebStorage/tests/submiss command timed out: 2400 seconds without output, killing pid 2496
I find it disturbing how out-of-band the information that these tests are disabled on android is. Normally tests are disabled either with javascript that lives inside the test, or by conditionally listing the test in the makefile. Listing these in a central place makes it hard to discover that a particular test is disabled. Could one of the other mechanisms be used instead?
I find the makefiles and inside the tests themselves out of band. The tests themselves read android.json as the primary source for what to run. Personally I find a single file much easier to read when figuring out what is disabled than doing a few different code greps to try to figure out what is disabled. This is really similar to how reftests are done. I will admit, maybe I don't know the most ideal method for doing this. If there are other examples of tests that are disabled maybe I will see the more streamlined way.
The primary source for what to run is the makefiles that list the tests and install them, just like in reftest the primary source is the reftest.list files. (I prefer the reftest.list approach too, but that's not how we do mochitests.) I actually wasn't aware (as style system module owner) that a bunch of the style system tests were disabled on Android until bug 778018 and bug 778019 were filed notifying me that two *more* had been disabled on Android. That I could modify the tests themselves and the list of tests to run without knowing that a bunch of the tests are disabled is what seems "out of band" to me. It would have been trivial for me (or anyone else) to copy a test file to create a new and similar test, add that test, and end up with Android bustage because there was no sign at any point in the process that the test I was copying from was disabled on some platforms (and the copying, or in fact a renaming of a test) wouldn't copy that information.
One additional thought (echoing bug 772740 comment 9, which has a patch to do this): if memory use is the problem, perhaps we should GC between tests so that memory use is more deterministic (and we don't end up disabling, say, the test after the one that's actually responsible for using a lot of memory). (Our GC algorithms aren't tuned for running mochitests, nor should they be; they should be tuned for what our users do.)
Depends on: 778676
Depends on: 778691
Blocks: 778952, 778954
This disables one more test, in android.json. The test is in dom/imptests, identified in bug 772632 as troublesome in general. Personally I'm seeing this test in particular causing failures, so I prefer to disable just this one test, rather than the whole directory, as suggested in bug 772632...at least for now. IMO :dbaron makes a good case for disabling tests using a different mechanism, but I am reluctant to change mid-bug, so this patch also uses android.json, just like the other patches on this bug.
Attachment #647652 - Flags: review?(jmaher)
Comment on attachment 647652 [details] [diff] [review] disable test_event_session_key.html Review of attachment 647652 [details] [diff] [review]: ----------------------------------------------------------------- thanks. regarding the place to disable, I think that is a larger meta issue we need to figure out.
Attachment #647652 - Flags: review?(jmaher) → review+
That last patch didn't have the effect I wanted it to on m-i. I may have confused test_storage_local_key and test_event_session_key...
Enable test_browserElement_inproc_Iconchange.html -- no longer any sign of OOM. Enable test_event_session_key.html, which I seemed to confuse with... Disable test_storage_local_key.html, which is currently failing nearly 100% on tbpl. Once this is done, I still anticipate Android M3 failures in dom/indexedDB/ipc/test_ipc.html -- bug 770046 -- and possibly dom/indexedDB/test/test_event_propagation.html (need to look into that more).
Attachment #649046 - Flags: review?(jmaher)
Comment on attachment 649046 [details] [diff] [review] minor changes to android.json Review of attachment 649046 [details] [diff] [review]: ----------------------------------------------------------------- thanks!
Attachment #649046 - Flags: review?(jmaher) → review+
Last 5 native m3 runs on inbound: https://tbpl.mozilla.org/php/getParsedLog.php?id=14140899&tree=Firefox#error0 { 2330 INFO TEST-START | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_index_js.html 2331 INFO TEST-PASS | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_stor command timed out: 2400 seconds without output, killing pid 36288 } https://tbpl.mozilla.org/php/getParsedLog.php?id=14145833&tree=Firefox#error0 { 2328 INFO TEST-END | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_index_js.html | finished in 592ms 2329 INFO TEST-START | /t command timed out: 2400 seconds without output, killing pid 12344 } https://tbpl.mozilla.org/php/getParsedLog.php?id=14146712&tree=Firefox#error0 { 2331 INFO TEST-END | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_length_js.html | finished in 578ms 2332 INFO TES command timed out: 2400 seconds without output, killing pid 27397 } https://tbpl.mozilla.org/php/getParsedLog.php?id=14150183&tree=Firefox#error0 { 2331 INFO TEST-END | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_length_js.html | finished in 1009ms 2332 INFO TEST-START | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_removeitem_js.html 2333 INFO TEST-PASS | /tests/dom/imptests/webapps/Web command timed out: 2400 seconds without output, killing pid 2952 } https://tbpl.mozilla.org/php/getParsedLog.php?id=14152659&tree=Firefox#error0 { 2331 INFO TEST-END | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_storage_session_length_js.html | finished in 593ms 2332 INFO TEST-START | /tests/dom/imptests/webapps/WebStorage/tests/submissions/Ms2ger/test_s command timed out: 2400 seconds without output, killing pid 54676 }
On native M1: https://tbpl.mozilla.org/php/getParsedLog.php?id=14181377&tree=Mozilla-Inbound#error0 { 49129 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/uniforms/uniform-location.html] Test passed - successfullyParsed is true 49130 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/uniforms/uniform-location.html] All 27 test(s) passed 49131 INFO TEST-INFO | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/uniforms/uniform-samplers-test.html] (WebGL mochitest) Starting test page 49132 INFO TEST-PASS | /tests/content/canvas/test/webgl/test_webgl_conformance_test_suite.html | [conformance/unif command timed out: 2400 seconds without output, killing pid 38107 process killed by signal 9 program finished with exit code -1 elapsedTime=3657.179387 TinderboxPrint: mochitest-plain<br/><em class="testfail">T-FAIL</em> buildbot.slave.commands.TimeoutError: command timed out: 2400 seconds without output, killing pid 38107 }
looking at that specific log we timeout in mochitest at 2400 seconds, then the next step that we talk to the device (at the bottom of the log) is reboot.py and we timeout with 1800 seconds. This is an indicator of a device that needs a PDU reboot or is fully dead. Why we get into this state is unknown. I get into it locally when I get a dhcp renew, I have tested that on the production network that doesn't cause problems. That was a year and a half ago though...maybe the sutagent acts differently. The big thing to know is: 1) is the device pingable? 2) is the sutagent accepting new connections? If both are false, then we should take the device offline.
(In reply to Joel Maher (:jmaher) from comment #38) > looking at that specific log we timeout in mochitest at 2400 seconds, then > the next step that we talk to the device (at the bottom of the log) is > reboot.py and we timeout with 1800 seconds. > > This is an indicator of a device that needs a PDU reboot or is fully dead. > > Why we get into this state is unknown. This seems to be happening quite a bit on test_webgl_conformance_test_suite.html specifically though - are we sure the test isn't a contributing factor? https://tbpl.mozilla.org/php/getParsedLog.php?id=14200997&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=14202596&tree=Mozilla-Inbound
the test could be a factor. I believe it could cause fennec to starve the memory of other applications (i.e. sut, system processes) and that could force us to hang.
Whiteboard: [orange][test disabled][leave open] → [test disabled][leave open]
Dropping off as the assignee here as I am not finding time to check back on these.
Assignee: gbrown → nobody
Bug 885157 removed content/media/test/test_too_many_elements.html.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: