Closed Bug 1785893 Opened 2 years ago Closed 2 months ago

Intermittent /mozilla/tests/webdriver/new_session/profile_root.py | single tracking bug

Categories

(Remote Protocol :: Marionette, defect, P5)

Default
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: jmaher, Unassigned)

References

(Regression)

Details

(Keywords: intermittent-failure, intermittent-testcase, regression)

Attachments

(2 obsolete files)

No description provided.

Additional information about this bug failures and frequency patterns can be found by running: ./mach test-info failure-report --bug 1785893

I'm still not sure how to handle single tracking bugs in case of multiple underlying issues. :( So nearly all the failures here are actually related to bug 1770042 which got duped to this one. And it's a regression as caused by bug 1766125. But should we mark the bug as being a regression? I don't know.

Here the exact failure message:

[task 2022-05-18T20:22:37.772Z] 20:22:37     INFO - STDOUT: >           geckodriver(config=config, extra_args=extra_args)
[task 2022-05-18T20:22:37.773Z] 20:22:37     INFO - STDOUT: E           Failed: DID NOT RAISE <class 'Exception'>
Regressed by: 1766125

:jgraham, since you are the author of the regressor, bug 1766125, could you take a look?

For more information, please visit auto_nag documentation.

Flags: needinfo?(james)

I spent some time trying to reproduce this before, failed, and judged that the frequency was too low to be worth spending a lot of additional time trying to track down what is presumably some kind of race condition.

Flags: needinfo?(james)

Yeah not sure why P3 was set for this intermittent failure. Single tracking bugs should also get a P5 assigned by default.

Priority: P3 → P5

James, I saw this interesting failure:
https://treeherder.mozilla.org/logviewer?job_id=392030052&repo=try&lineNumber=92541

[task 2022-10-02T13:08:34.283Z] 13:08:34     INFO - STDOUT: Running command: ['/builds/worker/fetches/geckodriver', '--port', '45898', '-vv', '--profile-root', '/tmp/wdspec-ah4aw5os/pytest/test_profile_root_missing0/missing-path']
[task 2022-10-02T13:08:34.287Z] 13:08:34     INFO - STDOUT: /builds/worker/fetches/geckodriver: error: Unable to write to temporary directory; consider --profile-root with a writeable directory

So it seems that we have a previously created temporary folder here for this affected test? In the case of a non-writable folder we only print the usage but do not raise an error in geckodriver. We probably should do so?

Flags: needinfo?(james)

The usage! macro causes that function to return Err, which should cause geckodriver to terminate. On the test side, that should raise an exception. This happens before the server is started, so there shouldn't be a race condition there. tmp_path is supposed to be unique per test, and if there was a problem with that we wouldn't see this message (which is what we expect). So my best guess is the wait-for-connection code is sometimes returning when it shouldn't. Maybe there's a race between shutting down a previous geckodriver instance and the current one? We could try passing in a random server port (or making the server port system-allocated and allowing clients to read it, which would be better but not trivial to ship in a backwards compatible way).

Flags: needinfo?(james)

Ah I see. I missed that the macro raises an exception. So wdspec tests already use a different port for geckodriver so I don't think this is a race with another instance. Also the former test ends successfully.

To hardening this specific test I would suggest that we simply add another assert before calling geckodriver to make sure that the temporary profile path doesn't exist. Maybe that will help us to get closer to the problem. I'll push a try build with this change and maybe we can reproduce it.

https://treeherder.mozilla.org/jobs?repo=try&revision=75753ba4ad9ab23894ea5373fbc67e175bda3fce

Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Attachment #9297787 - Attachment is obsolete: true

And we got a failure and that shows that the additional assertion didn't fail so the profile path indeed doesn't exist:
https://treeherder.mozilla.org/logviewer?job_id=392610363&repo=try&lineNumber=152356

So it means due to some reason the wptrunner doesn't detect the thrown exception from geckodriver.

Assignee: hskupin → nobody
Status: ASSIGNED → NEW
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
Assignee: hskupin → nobody
Status: ASSIGNED → NEW

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #20)

I pushed another try build with the debug tbpl logging enabled in CI:
https://treeherder.mozilla.org/jobs?repo=try&revision=6bc4417662cf4a592c6255c476d9ef9c55be89be

Hm, none of the Wd jobs are failing but I also cannot see any indication of debug level logs from wptrunner in the logs. Using your suggested mach logger level doesn't work given that in CI we do not run via mach. But also using --log-raw-level=debug doesn't help.

James, do you have any further ideas? Maybe something is wrong with the --log-raw-level argument?

Flags: needinfo?(james)

We always log raw logs at debug level. If yo look at, say, https://firefoxci.taskcluster-artifacts.net/WSz0nqBrSzmaVKyGhPMPXA/1/public/test_info/wpt_raw.log you can find the debug level logs. To get the human readable output to include debug level logs you need to pass in --log-tbpl-level=debug, because we're using the TBPL log formatter in CI for compatibility with regexp based log parsing.

Flags: needinfo?(james)

But --log-tbpl-level=debug didn't work because of ValueError: Options specified for unused formatter(s) (['tbpl']) have no effect:
https://treeherder.mozilla.org/logviewer?job_id=392817348&repo=try&lineNumber=1673

So for me it means that we do not use the TBPL formatter or I'm missing something else.

Flags: needinfo?(james)

Ah, so we actually just output the logs to stdout and then mozharness has its own byzatine processing of them.

https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/base/config.py#373-381 suggests there's a log_level config option that needs to be set to the numeric value of logging.DEBUG which seems to be 10. So probably you can do that in https://searchfox.org/mozilla-central/source/testing/mozharness/configs/web_platform_tests/prod_config.py

Or, alternatively, as I mentioned before all the logging ends up in the wpt_raw.log artifact, so you can also just look in that one (and even reformat it using the tbpl or mach formatters), without even rerunning the jobs.

Flags: needinfo?(james)

(In reply to James Graham [:jgraham] from comment #28)

Or, alternatively, as I mentioned before all the logging ends up in the wpt_raw.log artifact, so you can also just look in that one (and even reformat it using the tbpl or mach formatters), without even rerunning the jobs.

As James pointed out on Element the following mozlog script can be used to do the reformatting:
https://searchfox.org/mozilla-central/source/testing/mozbase/mozlog/mozlog/scripts/format.py

Looks like I forgot to add the last related try build as comment and now I cannot find it anymore. I pushed a new try build with the --log-raw-level=debug entry for the mozharness script.

https://treeherder.mozilla.org/jobs?repo=try&revision=83d30ce9dd6245db4c6fb6ad6035f58358b06dff

Moving bug to Remote Protocol::Marionette component per bug 1815831.
Component: geckodriver → Marionette
Product: Testing → Remote Protocol

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #43)

Looks like I forgot to add the last related try build as comment and now I cannot find it anymore. I pushed a new try build with the --log-raw-level=debug entry for the mozharness script.

https://treeherder.mozilla.org/jobs?repo=try&revision=83d30ce9dd6245db4c6fb6ad6035f58358b06dff

All of the triggered wdspec jobs didn't fail for this issue. As such I triggered a new try build now:
https://treeherder.mozilla.org/jobs?repo=try&revision=4b6ba9c2e551a4e5eb0f862848874f69a661d471

(In reply to Henrik Skupin [:whimboo][⌚️UTC+1] from comment #52)

All of the triggered wdspec jobs didn't fail for this issue. As such I triggered a new try build now:
https://treeherder.mozilla.org/jobs?repo=try&revision=4b6ba9c2e551a4e5eb0f862848874f69a661d471

I re-triggered around 110 wdspec tests on various platforms and build types but was not able to get this failure reproduced even once.

Maybe we have to land some extra log output in case of such a failure which includes the user permissions as set for the folder. Maybe this might give a hint why we cannot use it.

Duplicate of this bug: 1828041

The last failure here is from April 20th. So as it looks like the patch on bug 1824841 fixed this intermittent test failure. I'll check back later this week if that is still true and then mark the bug as WFM.

I can no longer see a problem with the test test_profile_root_missing so this high frequent failure is gone now.

But there is still a low frequent one remaining around for test_profile_root which runs before. Here we fail because the address is still in use:
https://treeherder.mozilla.org/logviewer?job_id=413563326&repo=autoland&lineNumber=99193-99197

[task 2023-04-24T13:32:42.413Z] 13:32:42     INFO - STDOUT: Running command: ['/builds/worker/fetches/geckodriver', '--port', '45314', '-vv', '--profile-root', '/tmp/wdspec-5mg9t7yw/pytest/test_profile_root0/geckodriver-test']
[task 2023-04-24T13:32:42.414Z] 13:32:42     INFO - Test harness output was not a valid structured log message
[task 2023-04-24T13:32:42.414Z] 13:32:42     INFO - /builds/worker/fetches/geckodriver: error: Address in use (os error 98)
[task 2023-04-24T13:32:42.415Z] 13:32:42     INFO - geckodriver 0.33.0 (94f7e9ac107e 2023-04-20 21:50 +0000)
[task 2023-04-24T13:32:42.415Z] 13:32:42     INFO - WebDriver implementation for Firefox

This is strange given that we try to find a free port before, but given that we do not lock it some other process might grab it as well. There is probably nothing to worry right now given that it happens very rarely.

The current still left-over failure should indeed be related to bug 1571426.

Depends on: 1571426
See Also: 1571426

(In reply to Henrik Skupin [:whimboo][⌚️UTC+2] from comment #84)

The current still left-over failure should indeed be related to bug 1571426.

That one has been fixed now and since its landing there were no more failures.

Closing as WFM for now. Lets reopen if failures start to happen again.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → WORKSFORME
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

The fix for bug 1853120 landed for Firefox 120 and we never uplifted. It might help with the esr115 branch but given that there was only a single failure yet I don't think we should consider an uplift.

Status: REOPENED → RESOLVED
Closed: 1 year ago5 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---

(In reply to Treeherder Bug Filer from comment #98)

New failure instance: https://treeherder.mozilla.org/logviewer?job_id=474132409&repo=mozilla-esr115

This failure is most likely based on the following line in the logs:
https://treeherder.mozilla.org/logviewer?job_id=474132409&repo=mozilla-esr115&lineNumber=51783

/builds/worker/fetches/geckodriver: error: Unable to write to temporary directory; consider --profile-root with a writeable directory

It means that geckodriver fails to create the Firefox profile because the tmp folder is not writeable? Lets see if that happens again or if it just was a random issue.

Status: REOPENED → RESOLVED
Closed: 5 months ago2 months ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: