Open Bug 1390884 Opened 7 years ago Updated 1 year ago

Chaos mode makes test verification unreliable

Categories

(Testing :: General, enhancement, P3)

enhancement

Tracking

(Not tracked)

People

(Reporter: gbrown, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Test verification tries to efficiently find intermittent test failures by running just-modified tests repeatedly and in various configurations or environments. The initial implementation includes running tests in chaos mode (MOZ_CHAOS_MODE environment variable set).

Initial tests indicate that many more failures occur in chaos mode than in regular mode. I want to investigate those failures and determine if chaos mode is appropriate and practical for test verification.
...but first, as a temporary measure, let's remove chaos mode from test verification, so that we can start using test verification.

I'll leave-open for investigation. Hopefully we can restore this code soon.
Attachment #8897860 - Flags: review?(jmaher)
Comment on attachment 8897860 [details] [diff] [review]
remove chaos mode support from test verification

Review of attachment 8897860 [details] [diff] [review]:
-----------------------------------------------------------------

ok, it was a good idea.
Attachment #8897860 - Flags: review?(jmaher) → review+
Keywords: leave-open
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3539f73f0f04
Do not use chaos mode for test verification; r=jmaher
The main issue is seen here:

https://public-artifacts.taskcluster.net/Si-ZnmjIRQCW5cNsu5lRDw/0/public/logs/live_backing.log

[task 2017-09-25T14:57:32.092Z] 14:57:32     INFO - TEST-INFO | started process GECKO(12284)
[task 2017-09-25T14:57:32.136Z] 14:57:32     INFO - GECKO(12284) | *** You are running in chaos test mode. See ChaosMode.h. ***
[task 2017-09-25T14:57:33.289Z] 14:57:33     INFO - GECKO(12284) | 1506351453285	Marionette	INFO	Enabled via --marionette
[task 2017-09-25T14:57:35.114Z] 14:57:35     INFO - GECKO(12284) | 1506351455109	Marionette	INFO	Listening on port 2828
[task 2017-09-25T14:57:35.373Z] 14:57:35     INFO - GECKO(12284) | 1506351455366	Marionette	DEBUG	Register listener.js for window 2147483649
[task 2017-09-25T14:57:35.750Z] 14:57:35     INFO -  Traceback (most recent call last):
[task 2017-09-25T14:57:35.750Z] 14:57:35     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2660, in doTests
[task 2017-09-25T14:57:35.750Z] 14:57:35     INFO -      marionette_args=marionette_args,
[task 2017-09-25T14:57:35.751Z] 14:57:35     INFO -    File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2164, in runApp
[task 2017-09-25T14:57:35.751Z] 14:57:35     INFO -      addons.install(create_zip(self.mochijar))
[task 2017-09-25T14:57:35.752Z] 14:57:35     INFO -    File "/builds/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/addons.py", line 52, in install
[task 2017-09-25T14:57:35.752Z] 14:57:35     INFO -      raise AddonInstallException(e)
[task 2017-09-25T14:57:35.753Z] 14:57:35     INFO -  AddonInstallException: Could not install add-on at '/tmp/tmpxnNgyy.zip': UnknownError: ERROR_FILE_ACCESS: There was an error accessing the filesystem.
[task 2017-09-25T14:57:35.753Z] 14:57:35     INFO -  stacktrace:
[task 2017-09-25T14:57:35.755Z] 14:57:35     INFO -  	WebDriverError@chrome://marionette/content/error.js:239:5
[task 2017-09-25T14:57:35.756Z] 14:57:35     INFO -  	UnknownError@chrome://marionette/content/error.js:537:5
[task 2017-09-25T14:57:35.757Z] 14:57:35     INFO -  	addon.install@chrome://marionette/content/addon.js:101:11
[task 2017-09-25T14:57:35.758Z] 14:57:35     INFO -  	async*GeckoDriver.prototype.installAddon@chrome://marionette/content/driver.js:3326:10
[task 2017-09-25T14:57:35.759Z] 14:57:35     INFO -  	despatch@chrome://marionette/content/server.js:555:20
[task 2017-09-25T14:57:35.760Z] 14:57:35     INFO -  	async*execute@chrome://marionette/content/server.js:529:11
[task 2017-09-25T14:57:35.761Z] 14:57:35     INFO -  	async*onPacket/<@chrome://marionette/content/server.js:504:15
[task 2017-09-25T14:57:35.765Z] 14:57:35     INFO -  	async*onPacket@chrome://marionette/content/server.js:503:8
[task 2017-09-25T14:57:35.767Z] 14:57:35     INFO -  	_onJSONObjectReady/<@chrome://marionette/content/transport.js:501:9
[task 2017-09-25T14:57:35.767Z] 14:57:35    ERROR - Automation Error: Received unexpected exception while running application
[task 2017-09-25T14:57:35.771Z] 14:57:35    ERROR - 
[task 2017-09-25T14:57:35.772Z] 14:57:35     INFO - Stopping web server
[task 2017-09-25T14:57:35.773Z] 14:57:35     INFO - GECKO(12284) | 1506351455742	addons.xpi	WARN	Failed to install /tmp/tmpxnNgyy.zip from file:///tmp/tmpxnNgyy.zip to /tmp/tmpfIuYKv.mozrunner/extensions/staged/mochikit@mozilla.org.xpi: Unix error 4 during operation pump (Interrupted system call) ((unknown module)) No traceback available
[task 2017-09-25T14:57:35.774Z] 14:57:35     INFO - Stopping web socket server
Test chaos mode has a variety of features -- see ChaosMode.h. It seems like test verification remains reliable if only some features are enabled:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2451d0577730036c61e1e70750546346da50a055&filter-tier=1&filter-tier=2&filter-tier=3

I'd like to land this, go to tier 2, then circle back here another day to figure out the issues and expand chaos mode support.
Attachment #8912348 - Flags: review?(jmaher)
Comment on attachment 8912348 [details] [diff] [review]
add back limited (3) chaos mode steps

Review of attachment 8912348 [details] [diff] [review]:
-----------------------------------------------------------------

very cool
Attachment #8912348 - Flags: review?(jmaher) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/87291aa18bf0
Enable limited test chaos mode in test-verify; r=jmaher
Priority: -- → P3
Assignee: gbrown → nobody
The leave-open keyword is there and there is no activity for 6 months.
:gbrown, maybe it's time to close this bug?
Flags: needinfo?(gbrown)
I hope to get to this in 2019.
Flags: needinfo?(gbrown)

The leave-open keyword is there and there is no activity for 6 months.
:gbrown, maybe it's time to close this bug?

Flags: needinfo?(gbrown)

No, still comment 11!

Flags: needinfo?(gbrown)
Keywords: leave-open

https://treeherder.mozilla.org/#/jobs?repo=try&revision=679135ad6119f92398515e26f5b936fea93a1ab8

With ChaosFeature TimerScheduling, I see reftest crashes on shutdown:

[task 2020-02-19T23:23:07.007Z] 23:23:07     INFO - Assertion failure: rc != 0 (destroyed timer off its target thread!), at /builds/worker/workspace/build/src/xpcom/threads/TimerThread.cpp:443
[task 2020-02-19T23:23:22.728Z] 23:23:22     INFO - #01: nsThread::ProcessNextEvent(bool, bool*) [xpcom/threads/nsThread.cpp:1220]
[task 2020-02-19T23:23:22.728Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.729Z] 23:23:22     INFO - #02: NS_ProcessNextEvent(nsIThread*, bool) [xpcom/threads/nsThreadUtils.cpp:481]
[task 2020-02-19T23:23:22.729Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.730Z] 23:23:22     INFO - #03: mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:303]
[task 2020-02-19T23:23:22.730Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.731Z] 23:23:22     INFO - #04: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:315]
[task 2020-02-19T23:23:22.731Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.732Z] 23:23:22     INFO - #05: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:291]
[task 2020-02-19T23:23:22.732Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.733Z] 23:23:22     INFO - #06: nsThread::ThreadFunc(void*) [xpcom/threads/nsThread.cpp:466]
[task 2020-02-19T23:23:22.733Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.858Z] 23:23:22     INFO - #07: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:204]
[task 2020-02-19T23:23:22.858Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.859Z] 23:23:22     INFO - #08: libpthread.so.0 + 0x76db
[task 2020-02-19T23:23:22.859Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.859Z] 23:23:22     INFO - #09: libc.so.6 + 0x12188f
[task 2020-02-19T23:23:22.859Z] 23:23:22     INFO - 
[task 2020-02-19T23:23:22.860Z] 23:23:22     INFO - #10: ??? (???:???)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=83d04c3f961887d9bf79918a0c0b716656f95b46

With TimerScheduling disabled but all other modes enabled, no failures, for this limited test.

Now, with all chaos modes enabled I see TimerThread assertions, especially in Windows mochitests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d0fc96086d619fbd100cd2ca4ecc08f2ee446555

With all but TimerScheduling enabled (0xfb), all is well:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0398f96ae62876627cf5eec4512b8c959725a0bd

Assignee: nobody → whole.grains
Status: NEW → ASSIGNED
Pushed by whole.grains@protonmail.com:
https://hg.mozilla.org/integration/autoland/rev/1c6aa40b84a2
Enable all test-verify chaos modes except TimerScheduling; r=jmaher
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/25368 for changes under testing/web-platform/tests
Keywords: leave-open
Upstream PR merged by moz-wptsync-bot

Reviewing recent TV* runs on autoland, I think all is working as expected: Most TV runs pass; most TV failures occur in the first, non-chaos steps; TV failures during chaos mode steps seem reasonable.

Still leaving open for follow-up on TimerScheduling.

Keywords: leave-open
Assignee: gbrown → nobody
Status: ASSIGNED → NEW
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: