Closed Bug 1800035 Opened 2 years ago Closed 4 months ago

Intermittent Windows Marionette Return code: 10

Categories

(Testing :: Marionette Client and Harness, defect, P5)

x86_64
Windows 10
defect
Points:
2

Tracking

(firefox132 fixed)

RESOLVED FIXED
132 Branch
Tracking Status
firefox132 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: jdescottes)

References

Details

(Keywords: intermittent-failure, Whiteboard: [webdriver:m12][retriggered])

Attachments

(1 file)

Filed by: csabou [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=394617514&repo=mozilla-central
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/KBV8B0OfTMmWo3ZbCco2Sg/runs/0/artifacts/public/logs/live_backing.log


[task 2022-10-27T23:09:43.867Z] 23:09:43     INFO -  DEBUG: Starting phase profile-before-change
[task 2022-10-27T23:09:43.873Z] 23:09:43     INFO -  DEBUG: Spinning the event loop
[task 2022-10-27T23:09:43.874Z] 23:09:43     INFO -  DEBUG: Completed blocker EnvironmentAddonBuilder for phase AddonManager: Waiting to start provider shutdown.
[task 2022-10-27T23:09:43.874Z] 23:09:43     INFO -  DEBUG: Completed blocker PageActions: purging unregistered actions from cache for phase profile-before-change
[task 2022-10-27T23:09:43.874Z] 23:09:43     INFO -  DEBUG: Completed blocker Places Expiration: shutdown for phase Places Connection shutdown
[task 2022-10-27T23:09:43.875Z] 23:09:43     INFO -  DEBUG: Completed blocker JSON store: writing data for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.876Z] 23:09:43     INFO -  DEBUG: Completed blocker JSON store: writing data for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.876Z] 23:09:43     INFO -  DEBUG: Completed blocker JSON store: writing data for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.876Z] 23:09:43     INFO -  DEBUG: Completed blocker JSON store: writing data for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.877Z] 23:09:43     INFO -  DEBUG: Completed blocker RemoteSettingsClient - finish IDB access. for phase profile-before-change
[task 2022-10-27T23:09:43.877Z] 23:09:43     INFO -  DEBUG: Completed blocker Remote Settings profile-before-change for phase profile-before-change
[task 2022-10-27T23:09:43.877Z] 23:09:43     INFO -  DEBUG: Completed blocker DownloadAutoSaveView: writing data for phase profile-before-change
[task 2022-10-27T23:09:43.878Z] 23:09:43     INFO -  DEBUG: Completed blocker Search service: shutting down for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.878Z] 23:09:43     INFO -  DEBUG: Completed blocker places.sqlite#1: waiting for shutdown for phase Sqlite.sys.mjs: wait until all connections are closed
[task 2022-10-27T23:09:43.878Z] 23:09:43     INFO -  DEBUG: Completed blocker OS.File: flush I/O queued before profileBeforeChange for phase profile-before-change
[task 2022-10-27T23:09:43.906Z] 23:09:43     INFO -  DEBUG: Completed blocker PlacesUtils wrapped connection closing as part of Places shutdown for phase Places Connection shutdown
[task 2022-10-27T23:09:43.914Z] 23:09:43     INFO -  DEBUG: Completed blocker PlacesUtils wrapped connection must be closed before Sqlite.sys.mjs for phase Sqlite.sys.mjs: wait until all clients have completed their task
[task 2022-10-27T23:09:43.915Z] 23:09:43     INFO -  DEBUG: Completed blocker SitePermsAddonProvider for phase AddonManager: Waiting for providers to shut down.
[task 2022-10-27T23:09:43.915Z] 23:09:43     INFO -  DEBUG: Completed blocker GMPProvider for phase AddonManager: Waiting for providers to shut down.
[task 2022-10-27T23:09:43.915Z] 23:09:43     INFO -  DEBUG: Completed blocker XPIProvider for phase AddonManager: Waiting for providers to shut down.
[task 2022-10-27T23:09:43.915Z] 23:09:43     INFO -  DEBUG: Completed blocker AddonManager: shutting down. for phase profile-before-change
[task 2022-10-27T23:09:43.916Z] 23:09:43     INFO -  DEBUG: Completed blocker CrashMonitor: Writing notifications to file after receiving profile-before-change and awaiting all checkpoints written for phase IOUtils: waiting for profileBeforeChange IO to complete
[task 2022-10-27T23:09:43.922Z] 23:09:43     INFO -  DEBUG: Completed blocker content-prefs.sqlite#0: waiting for shutdown for phase Sqlite.sys.mjs: wait until all connections are closed
[task 2022-10-27T23:09:43.924Z] 23:09:43     INFO -  DEBUG: Completed blocker Closing ContentPrefService2 connection. for phase Sqlite.sys.mjs: wait until all clients have completed their task
[task 2022-10-27T23:09:43.924Z] 23:09:43     INFO -  DEBUG: Completed blocker places.sqlite#0: waiting for shutdown for phase Sqlite.sys.mjs: wait until all connections are closed
[task 2022-10-27T23:09:43.924Z] 23:09:43     INFO -  DEBUG: Completed blocker PlacesUtils read-only connection closing as part of Places shutdown for phase Places Connection shutdown
[task 2022-10-27T23:09:43.924Z] 23:09:43     INFO -  DEBUG: Completed blocker PlacesUtils read-only connection must be closed before Sqlite.sys.mjs for phase Sqlite.sys.mjs: wait until all clients have completed their task
[task 2022-10-27T23:09:43.927Z] 23:09:43     INFO -  DEBUG: Completed blocker Sqlite.sys.mjs shutdown blocker for phase profile-before-change
[task 2022-10-27T23:09:43.937Z] 23:09:43     INFO -  DEBUG: Finished phase profile-before-change
[task 2022-10-27T23:09:43.958Z] 23:09:43     INFO -  DEBUG: Starting phase profile-before-change-telemetry
[task 2022-10-27T23:09:43.964Z] 23:09:43     INFO -  DEBUG: Spinning the event loop
[task 2022-10-27T23:09:44.247Z] 23:09:44     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-10-27T23:09:44.300Z] 23:09:44     INFO -  DEBUG: Adding blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-10-27T23:09:44.322Z] 23:09:44     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-10-27T23:09:44.328Z] 23:09:44     INFO -  DEBUG: Completed blocker Waiting for ping task for phase TelemetryController: Waiting for pending ping activity
[task 2022-10-27T23:09:44.340Z] 23:09:44     INFO -  DEBUG: Completed blocker TelemetryController: shutting down for phase profile-before-change-telemetry
[task 2022-10-27T23:09:44.341Z] 23:09:44     INFO -  DEBUG: Finished phase profile-before-change-telemetry
[task 2022-10-27T23:09:44.342Z] 23:09:44     INFO -  DEBUG: Starting phase xpcom-will-shutdown
[task 2022-10-27T23:09:44.342Z] 23:09:44     INFO -  DEBUG: Spinning the event loop
[task 2022-10-27T23:09:44.346Z] 23:09:44     INFO -  DEBUG: Completed blocker OS.File: flush pending requests, warn about unclosed files, shut down service. for phase xpcom-will-shutdown
[task 2022-10-27T23:09:44.346Z] 23:09:44     INFO -  DEBUG: Finished phase xpcom-will-shutdown
[task 2022-10-27T23:09:44.349Z] 23:09:44     INFO -  DEBUG: Starting phase web-workers-shutdown
[task 2022-10-27T23:09:44.349Z] 23:09:44     INFO -  DEBUG: Spinning the event loop
[task 2022-10-27T23:09:44.350Z] 23:09:44     INFO -  DEBUG: Finished phase web-workers-shutdown
[task 2022-10-27T23:09:46.561Z] 23:09:46    ERROR - Return code: 10
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - TinderboxPrint: marionette<br/>1013/<em class="testfail">1</em>/36
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - gecko.log not found
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - TinderboxPrint: marionette<br/>1013/<em class="testfail">1</em>/36
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - Marionette exited with return code 10: WARNING
[task 2022-10-27T23:09:46.561Z] 23:09:46  WARNING - # TBPL WARNING #
[task 2022-10-27T23:09:46.561Z] 23:09:46  WARNING - setting return code to 1
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - Running post-action listener: _package_coverage_data
[task 2022-10-27T23:09:46.561Z] 23:09:46     INFO - Running post-action listener: _resource_record_post_action
[task 2022-10-27T23:09:46.562Z] 23:09:46     INFO - Running post-action listener: process_java_coverage_data
[task 2022-10-27T23:09:46.562Z] 23:09:46     INFO - [mozharness: 2022-10-27 23:09:46.562432Z] Finished run-tests step (success)
[task 2022-10-27T23:09:46.562Z] 23:09:46     INFO - Running post-run listener: _resource_record_post_run
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - Total resource usage - Wall time: 2775s; CPU: 42%; Read bytes: 306483200; Write bytes: 6099653632; Read time: 0; Write time: 149
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: CPU usage<br/>41.7%
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: I/O read bytes / time<br/>306,483,200 / 0
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: I/O write bytes / time<br/>6,099,653,632 / 149
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: CPU idle<br/>12,960.9 (58.3%)
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: CPU system<br/>2,466.9 (11.1%)
[task 2022-10-27T23:09:46.762Z] 23:09:46     INFO - TinderboxPrint: CPU user<br/>6,772.3 (30.5%)
[task 2022-10-27T23:09:46.775Z] 23:09:46     INFO - install - Wall time: 10s; CPU: 25%; Read bytes: 73728; Write bytes: 394252288; Read time: 0; Write time: 25
[task 2022-10-27T23:09:46.787Z] 23:09:46     INFO - run-tests - Wall time: 2766s; CPU: 42%; Read bytes: 305488896; Write bytes: 5612545024; Read time: 0; Write time: 122
[task 2022-10-27T23:09:47.431Z] 23:09:47  WARNING - returning nonzero exit status 1
[taskcluster 2022-10-27T23:09:47.514Z]    Exit Code: 1
[taskcluster 2022-10-27T23:09:47.514Z]    User Time: 31.25ms
[taskcluster 2022-10-27T23:09:47.514Z]  Kernel Time: 15.625ms
[taskcluster 2022-10-27T23:09:47.514Z]    Wall Time: 47m43.1632606s
[taskcluster 2022-10-27T23:09:47.514Z]       Result: FAILED
[taskcluster 2022-10-27T23:09:47.514Z] === Task Finished ===
[taskcluster 2022-10-27T23:09:47.514Z] Task Duration: 47m43.1666893s
[taskcluster 2022-10-27T23:09:47.611Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/json" and expiry 2023-01-25T22:02:42.793Z
[taskcluster 2022-10-27T23:09:47.772Z] Uploading artifact public/test_info/marionette_errorsummary.log from file build\blobber_upload_dir\marionette_errorsummary.log with content encoding "gzip", mime type "text/plain" and expiry 2023-01-25T22:02:42.793Z
[taskcluster 2022-10-27T23:09:48.602Z] Uploading artifact public/test_info/marionette_raw.log from file build\blobber_upload_dir\marionette_raw.log with content encoding "gzip", mime type "text/plain" and expiry 2023-01-25T22:02:42.793Z
[taskcluster 2022-10-27T23:09:48.888Z] Uploading artifact public/test_info/report.html from file build\blobber_upload_dir\report.html with content encoding "gzip", mime type "text/html; charset=utf-8" and expiry 2023-01-25T22:02:42.793Z
[taskcluster 2022-10-27T23:09:48.995Z] Uploading artifact public/test_info/resource-usage.json from file build\blobber_upload_dir\resource-usage.json with content encoding "gzip", mime type "application/json" and expiry 2023-01-25T22:02:42.793Z
[taskcluster 2022-10-27T23:09:49.481Z] Uploading link artifact public/logs/live.log to artifact public/logs/live_backing.log with expiry 2023-01-25T22:02:42.793Z
[taskcluster:error] exit status 1

The Marionette harness will exit with an code of 10 when tests were failing. Interestingly I see:

[task 2022-10-27T23:09:42.639Z] 23:09:42     INFO -  passed: 1013
[task 2022-10-27T23:09:42.639Z] 23:09:42     INFO -  failed: 1
[task 2022-10-27T23:09:42.639Z] 23:09:42     INFO -  todo: 36 (skipped: 30)
[task 2022-10-27T23:09:42.640Z] 23:09:42     INFO -  FAILED TESTS

But there is no failure listed in the parsed log. No idea what's broken here given that we haven't changed anything in that code for quite some time. Maybe there was a Python upgrade recently for test workers which might have caused that behavior?

Flags: needinfo?(jmaher)

the base win10 image hasn't changed. As this is intermittent and on ASAN only, this makes me wonder if we have some timing issue. For successful runs I see test summary of 1012/0/36 (pass/fail/skip-todo), but for the failures I see (1013/1/36). I don't know if that means anything.

Quite likely the browser could have changed some timing in shutdown.

This job (win10/asan-marionette) fails 12.5% of the time in the last month, it is broken down in these bugs:

bug id frequency
1782045 10
1791299 9
1800035 6
1797552 2
1792855 2
1691876 2
1790942 2
1795485 1
1798158 1
1413849 1
1780602 1
1669403 1
1727152 1

while this bug isn't the most frequent, it might be the most confusing.

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #2)

successful runs I see test summary of 1012/0/36 (pass/fail/skip-todo), but for the failures I see (1013/1/36). I don't know if that means anything.

That's interesting. Why do we have one time 1012 tests and another time 1014 tests that have been running? It might be good to check and compare different logs from passing and failing jobs to see which tests are run on top in failing cases. But I would defer that until the failure rate is high enough.

No new conclusion from running additional jobs. It's still the above changelog which isn't helpful. I doubt the change to devtools might have caused that.

Whenever checking the full log there is a single failure present for:

[task 2022-11-15T00:12:54.928Z] 00:12:54 INFO - TEST-FAIL | testing/marionette/harness/marionette_harness/tests/unit/test_expectedfail.py TestFail.test_fails | took 135ms

But as you can see this one is expected. So not sure what's causing us to fail with an exit code of 10 at all.

OS: Unspecified → Windows 10
Hardware: Unspecified → x86_64
Summary: Intermittent Marionette Return code: 10 → Intermittent Windows ASAN Marionette Return code: 10

Maybe the following errors for Telemetry during shutdown might not be related to this particular error in Marionette but might warrant investigation and separate bugs to be filed.

https://treeherder.mozilla.org/logviewer?job_id=399246634&repo=mozilla-central&lineNumber=155239

Chris, could someone from the Telemetry team please have a look? Thanks.

Flags: needinfo?(chutten)

I actually filed bug 1805153 and moved the needinfo over there.

Flags: needinfo?(chutten)
Product: Testing → Remote Protocol
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Summary: Intermittent Windows ASAN Marionette Return code: 10 → Intermittent Windows Marionette Return code: 10
Duplicate of this bug: 1830534
Duplicate of this bug: 1838357
Duplicate of this bug: 1839987
Duplicate of this bug: 1884112

This seems to be a mozharness related issue given that it surfaces around log parsing for tinderbox. We have two lines of code that could cause the return code of 10:

https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1664 (run_command)
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/script.py#1870 (get_output_from_command)

I assume that the failure might come from parsing the log files and checking for failures.

Status: REOPENED → NEW
Component: Marionette Client and Harness → Applications: MozharnessCore
Product: Testing → Release Engineering

This actually happens when the Marionette harness sends a TEST-START but misses to send a TEST-PASS or TEST-FAIL as it can be seen here for the test layout/base/tests/marionette/test_accessiblecaret_selection_mode.py AccessibleCaretSelectionModeTestCase.test_handle_tilt_when_carets_overlap_each_other_textarea_rtl:

https://treeherder.mozilla.org/logviewer?job_id=471272624&repo=autoland&lineNumber=132339

[task 2024-08-22T00:05:46.635Z] 00:05:46     INFO -  TEST-START | layout/base/tests/marionette/test_accessiblecaret_selection_mode.py AccessibleCaretSelectionModeTestCase.test_handle_tilt_when_carets_overlap_each_other_textarea_rtl
[..]
[task 2024-08-22T00:05:48.292Z] 00:05:48     INFO -  1724285148291	Marionette	TRACE	[20] MarionetteCommands actor destroyed for window id 12884901939
[task 2024-08-22T00:05:48.295Z] 00:05:48     INFO -  1724285148295	Marionette	DEBUG	53 <- [1,97,null,{"value":null}]
[task 2024-08-22T00:05:48.296Z] 00:05:48     INFO -  TEST-START | layout/base/tests/marionette/test_accessiblecaret_selection_mode.py AccessibleCaretSelectionModeTestCase.test_long_press_changes_focus_from_content_to_contenteditable

When this job fails it's actually always about this test. When I check the test file this specific test is generated via parameterization. As it looks like the test works fine and should have passed but whether we fail a check or miss to actually print the status.

Component: Applications: MozharnessCore → Marionette Client and Harness
Product: Release Engineering → Testing
Flags: needinfo?(jdescottes)

Interestingly, when forcing a random error in the test, the failure is correctly picked up by the harness.

However when forcing an error with self.assertEqual(target_content, ...), then we have the same issue: the job fails but the error is not reported. The error message that should be printed in this case is:

TEST_END: FAIL, expected PASS - AssertionError: ' فيرفكس موزيلا فيرفكس' != 'AAA'
-  فيرفكس موزيلا فيرفكس
+ AAA

(AAA is just a random string I am expecting here to make the test fail). I wonder if on windows the test harness fails to process the arabic characters in the error message.

Flags: needinfo?(jdescottes)

Note that other failures for this test are tracked under Bug 1855083

See Also: → 1855083

Just to confirm: this is a windows only issue triggered by the fact that we are unable to print the failure message containing arabic characters. I am trying at the moment to force PYTHONIOENCODING=utf-8 in the environment to see if it helps. https://treeherder.mozilla.org/jobs?repo=try&revision=7331676ff70d01f5a3c2d27cc7c4801d04c0612b

Hi Alex,

Henrik suggested reaching out to you about such issues.

Here we have an issue where Python on windows CI jobs is not able to print a failure message containing unicode characters, for instance

TEST_END: FAIL, expected PASS - AssertionError: ' فيرفكس موزيلا فيرفكس' != 'AAA'

With this log missing, the log parser is unable to detect the proper failure.

What I tried above was to force env["PYTHONIOENCODING"] = "utf-8" in testing/mozharness/scripts/marionette.py (changeset) and this seems to fix the issue, although the printed error message is a bit harder to read:

AssertionError: ' \u0641\u064a\u0631\u0641\u0643\u0633 \u0645\u0648\u0632\u064a\u0644\u0627 \u0641\u064a\u0631\u0641\u0643\u0633' != 'AAA'

(which reminds me of https://bugzilla.mozilla.org/show_bug.cgi?id=1902981 which I filed a few months ago)

So it seems that there is a problem when trying to print unicode characters on Windows CI. Do you have a suggestion on how to fix it properly (potentially different from what I did in the patch above?)

Talking with :jgraham, there is also a suggestion to use sys.stdout.reconfigure to properly set the encoding

Flags: needinfo?(ahochheiden)

I think env["PYTHONIOENCODING"] = "utf-8" seems like the best solution.

The sys.stdout.reconfigure(encoding='utf-8') should also work, though.

There's also this you could play around with:

import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

(But you might need set ar_SA.UTF-8 specifically for that test with that approach? Not sure).

Disclaimer: I've dealt with problems like this before, but I'm by no means an expert.

Flags: needinfo?(ahochheiden)

Hi Alex. I’m not an expert with UTF-8 either, but I’m curious about what the default encoding is on Windows. Do we have any custom settings for mozbuild? If not, could it be a solution to set the encoding via mozbuild, ensuring that it works not only on CI systems but also for users running tests locally on their machines?

Flags: needinfo?(ahochheiden)

On Windows I'd expect the default encoding to either be cp-1252 or utf-8 (depending on the OS version and Python version, but maybe also locale?).

Setting it to utf-8 in the MozillaBuild profile would have side effects on tools that do use the system encoding (I assumed this, but checked with :glandium to confirm).

I don't think there's a better way than explicitly setting the encoding to handle a particular case.

Flags: needinfo?(ahochheiden)

Thanks for all the input, I think for now we should go with the initial approach at least to get the failures correctly classified from now on.

Error messages with unicode characters failed to be correctly printed on windows.
Forcing PYTHONIOENCODING to utf-8 avoids the issue.

Assignee: nobody → jdescottes
Status: NEW → ASSIGNED

Aryx: quick question related to this bug. Here the reason the issue was hard to investigate was that we had a TEST-START without a corresponding TEST-PASS - or any other line indicating the final result for the test.

In general do you think this is something we should try to detect in logs, and trigger failures in those cases? Or are there tests for which it is expected to have TEST-START but not necessarily any result?

Flags: needinfo?(aryx.bugmail)

If there is a TEST-START marker, there should also be a marker to identify the end of the logging block for the test if the test suite continues to be executed - if Firefox gets terminated/crashed and testing does not resume, there should be no such marker. End markers are e.g. TEST-OK, TEST-SKIP, TEST-UNEXPECTED-FAIL, TEST-FAIL.
In such a case of a crash of the application, a fatal error is supposed to get logged by the test suite and provide information.
In case of Marionette, it's a test step (TEST-PASS) which does not get logged but the same concept applies: if test execution continues after failure, an end marker for the failed test should be logged.

Flags: needinfo?(aryx.bugmail)
Pushed by jdescottes@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/573b8b270102 [mozharness] Force PYTHONIOENCODING to utf-8 in testing/mozharness/scripts/marionette.py r=webdriver-reviewers,jgraham
Status: ASSIGNED → RESOLVED
Closed: 2 years ago4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 132 Branch
Points: --- → 2
Whiteboard: [retriggered] → [webdriver:m12][retriggered]

All the recent failure classifications were incorrect and got re-classified. There is only one issue on esr115 which we are going to ignore.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: