Open Bug 1924507 Opened 4 months ago Updated 2 days ago

Intermittent TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/classic/new_session/unhandled_prompt_behavior.py | single tracking bug

Categories

(Remote Protocol :: Marionette, defect, P5)

defect

Tracking

(Not tracked)

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure)

This specific test timeout only happening on MacOS started around October 13th and then manifested on October 15th by producing around 5 failures per day for shippable builds. Checking the logs shows that we have a delay in starting Firefox which means that the final-ui-startup notification is sent out delayed with ~2s. In working cases the notification is received after at maximum 2.5s but now its between 4-6s for jobs timing out.

We didn't change this test nor any behavior in Remote Agent (Marionette). So I wonder if there was any change for MacOS specific recently which might have caused some slowness in startup. Stephen do you maybe have an idea? Thanks.

Flags: needinfo?(spohl.mozilla.bugs)

As it looks like the issue goes back even further and we have a huge difference in the test duration between different Wd4 jobs on MacOS for opt/shippable builds. These Treeherder jobs, which are next to each other show that.

While one has 129s the older one has 163s.

Overall the tests are running more on their limit these days which means some other factor came in as well. But that is hard to figure out.

Joel, any idea where this huge difference can originate from? Are the Mac workers shared as well between different jobs which may cause slowness of Firefox?

Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(jmaher)

Locally I the tests take 80s on MacOS, and on wpt.fyi the tests on 60s only on Linux.

yes the mac workers are shared with all unittests and perftests, not builders and signers though. Nothing has changed with our task -> pools for a long time. We did force cleaning up these directories (September 24-26):

/opt/worker/downloads
/Users/cltbld/Library/Caches/Mozilla/updates/opt/worker/tasks/

if for some reason these tests were depending on data in those directories between tasks, then I could see that being affected.

I have noticed that failing tasks have:

[task 2024-10-22T12:54:06.429Z] 12:54:06     INFO - PID 3551 | 1729601646428	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.429Z] 12:54:06     INFO - PID 3551 | 1729601646428	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:06.556Z] 12:54:06     INFO - PID 3551 | 1729601646555	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.556Z] 12:54:06     INFO - PID 3551 | 1729601646555	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:06.791Z] 12:54:06     INFO - PID 3551 | 1729601646791	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.792Z] 12:54:06     INFO - PID 3551 | 1729601646791	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.025Z] 12:54:07     INFO - PID 3551 | 1729601647025	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.026Z] 12:54:07     INFO - PID 3551 | 1729601647025	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.256Z] 12:54:07     INFO - PID 3551 | 1729601647255	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.256Z] 12:54:07     INFO - PID 3551 | 1729601647255	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.495Z] 12:54:07     INFO - PID 3551 | 1729601647495	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.496Z] 12:54:07     INFO - PID 3551 | 1729601647495	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.731Z] 12:54:07     INFO - PID 3551 | 1729601647731	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.732Z] 12:54:07     INFO - PID 3551 | 1729601647731	geckodriver::marionette	TRACE	Retrying in 100ms

it is probably likely that these messages exist and only are printed out when we have a failure- so we only see these messages on the failed (timedout) tasks.

for macosx, I went back on m-c to oct 10th, and osx/shippable was taking in the range of 130-165 seconds. I did a handful of retriggers and most of the tasks failed, in fact some passing ones finished faster; I think this is intermittent- but it could be machine dependent. I did notice that headless was a lot faster, even today.

retriggering off the tip of m-c, I see ranges of 146-182s on the retriggers (180+ for original jobs). Not as many retriggers timed out on the tip, but the overall runtimes seemed longer.

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC -8) from comment #5)

yes the mac workers are shared with all unittests and perftests, not builders and signers though. Nothing has changed with our task -> pools for a long time. We did force cleaning up these directories (September 24-26):

Ok, so it could happen that some other test job runs in parallel and claiming too much resources so that due to limited resources on our side it takes longer for Firefox to startup.

if for some reason these tests were depending on data in those directories between tasks, then I could see that being affected.

No, we do not have any dependency. It is just that Firefox takes longer as expected to startup.

I have noticed that failing tasks have:

[task 2024-10-22T12:54:06.429Z] 12:54:06     INFO - PID 3551 | 1729601646428	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.429Z] 12:54:06     INFO - PID 3551 | 1729601646428	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:06.556Z] 12:54:06     INFO - PID 3551 | 1729601646555	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.556Z] 12:54:06     INFO - PID 3551 | 1729601646555	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:06.791Z] 12:54:06     INFO - PID 3551 | 1729601646791	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:06.792Z] 12:54:06     INFO - PID 3551 | 1729601646791	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.025Z] 12:54:07     INFO - PID 3551 | 1729601647025	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.026Z] 12:54:07     INFO - PID 3551 | 1729601647025	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.256Z] 12:54:07     INFO - PID 3551 | 1729601647255	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.256Z] 12:54:07     INFO - PID 3551 | 1729601647255	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.495Z] 12:54:07     INFO - PID 3551 | 1729601647495	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.496Z] 12:54:07     INFO - PID 3551 | 1729601647495	geckodriver::marionette	TRACE	Retrying in 100ms
[task 2024-10-22T12:54:07.731Z] 12:54:07     INFO - PID 3551 | 1729601647731	geckodriver::browser	TRACE	Failed to open /var/folders/p4/_gm58y794639164h1xnkpyyw000014/T/tmpqvlzhgva.mozrunner/MarionetteActivePort
[task 2024-10-22T12:54:07.732Z] 12:54:07     INFO - PID 3551 | 1729601647731	geckodriver::marionette	TRACE	Retrying in 100ms

it is probably likely that these messages exist and only are printed out when we have a failure- so we only see these messages on the failed (timedout) tasks.

No, these messages are added in all the cases and not only for failing tasks. It just means that geckodriver tries to read the Marionette port of the started Firefox process but it isn't ready yet. But the amount of entries of these lines indeed demonstrates a startup delay.

for macosx, I went back on m-c to oct 10th, and osx/shippable was taking in the range of 130-165 seconds. I did a handful of retriggers and most of the tasks failed, in fact some passing ones finished faster; I think this is intermittent- but it could be machine dependent. I did notice that headless was a lot faster, even today.

Maybe this means that anything to graphic related initialization could be involved and especially in a shared worker environment causes longer delays for startup?

Markus, do you have an idea?

Flags: needinfo?(mstange.moz)

Because other tests are affected as well, I moved the needinfo request over to bug 1927416.

Depends on: 1927416
Flags: needinfo?(mstange.moz)
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Status: REOPENED → RESOLVED
Closed: 2 months ago15 days ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
You need to log in before you can comment on or make changes to this bug.