Closed Bug 1716452 Opened 4 years ago Closed 4 years ago

Intermittent TEST-UNEXPECTED-TIMEOUT | toolkit/components/extensions/test/mochitest/test_ext_webrequest_auth.html | application timed out after 370 seconds with no output

Categories

(WebExtensions :: General, defect, P5)

defect

Tracking

(firefox-esr78 unaffected, firefox89 unaffected, firefox90 unaffected, firefox91 fixed, firefox92 fixed)

RESOLVED FIXED
92 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox89 --- unaffected
firefox90 --- unaffected
firefox91 --- fixed
firefox92 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: rpl)

References

(Regression)

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed:patch])

Crash Data

Attachments

(3 files, 1 obsolete file)

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=342754204&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/QivaseasQc6rmZ5gni9nSQ/runs/0/artifacts/public/logs/live_backing.log


[task 2021-06-14T22:55:32.941Z] 22:55:32     INFO - TEST-PASS | toolkit/components/extensions/test/mochitest/test_ext_webrequest_auth.html | authHandler.onAuthRequired called with 840 http://mochi.test:8888/tests/toolkit/components/passwordmgr/test/mochitest/authenticate.sjs?realm=auth_nonblocking_forwardAuth&user=auth_nonblocking_forwardAuth&pass=auth_nonblocking_forwardAuth result undefined 
[task 2021-06-14T22:55:32.941Z] 22:55:32     INFO - Buffered messages finished
[task 2021-06-14T22:55:32.942Z] 22:55:32    ERROR - TEST-UNEXPECTED-TIMEOUT | toolkit/components/extensions/test/mochitest/test_ext_webrequest_auth.html | application timed out after 370 seconds with no output
[task 2021-06-14T22:55:32.942Z] 22:55:32    ERROR - Force-terminating active process(es).
[task 2021-06-14T22:55:32.942Z] 22:55:32     INFO - Determining child pids from psutil...
[task 2021-06-14T22:55:32.953Z] 22:55:32     INFO - [5328, 5350, 5415, 5448, 5482, 5512, 5545, 5566, 5598, 5631, 5659, 5696, 5729, 5761, 5787, 5826, 5853, 5875, 5906, 5931, 5956, 5981, 6006, 6031, 6056, 6081, 6106, 6131, 6157, 6182, 6230]
[task 2021-06-14T22:55:32.954Z] 22:55:32     INFO - ==> process 5267 launched child process 5287

Looks like this has started after Bug 1709838 landed. Retriggers.
Christoph, could you have a look over it? Thank you.

Flags: needinfo?(ckerschb)
Regressed by: 1709838
Has Regression Range: --- → yes

Set release status flags based on info from the regressing bug 1709838

On it ...

Assignee: nobody → ckerschb
Status: NEW → ASSIGNED
Flags: needinfo?(ckerschb)

Hey Cosmin, I evaluated test test_ext_webrequest_auth.html in detail and I don't think that Bug 1709838 is causing the intermittent.

In more detail:
Within Bug 1709838, we enable https-first in Private Windows which upgrades all connections from http to https in PBM but ...

  • only in case the port is the default port - in other words, https-first does not apply to http://mochi.test:8888/ like in the testcase
  • only upgrades connections in PBM which is not the case in the testcase.

When running the test I also couldn't find any upgrades or any other indicators that https-first would be messing with it - maybe some other patch that caused this intermittent to appear?

Assignee: ckerschb → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(csabou)

Hi Christoph, maybe it's something that surfaced along with those changes.
The thing is that before those changes there are 11 green jobs for test_ext_webrequest_auth (log) and after the patch landed there are 8/21 timeouts as you can see from this TH link. Added 20 retriggers more to see if it will fail on the previous push.

Flags: needinfo?(csabou)

After 20 more retriggers it's still there it fails first.

Flags: needinfo?(ckerschb)

(In reply to Cosmin Sabou [:CosminS] from comment #7)

After 20 more retriggers it's still there it fails first.

Hey Shane, I don't know how to move forward with this intermittent test failure. I left some analysis in comment 5, but I can't reproduce the error. I was wondering, since you wrote the test, if you could take a look and maybe with joint forces we get the problem resolved - thank you!

Flags: needinfo?(ckerschb) → needinfo?(mixedpuppy)

There is probably some minor timing change that affects this test on linux. The log shows that onAuthRequired is called, but otherwise there is not much logging in this test.

I would add some logging between each await and figure out which promise is failing, that might tell us something. Then maybe consider doing a Promise.all on these promises: https://searchfox.org/mozilla-central/rev/683046a2d37e1e2a4fd68e7749c1effd1aee2d38/toolkit/components/extensions/test/mochitest/test_ext_webrequest_auth.html#109-117

Failing that, I'd disable the test for linux+debug since that seems to be the bulk of the timeouts.

Flags: needinfo?(mixedpuppy)
Assignee: nobody → abutkovits
Status: NEW → ASSIGNED
Assignee: abutkovits → nobody
Status: ASSIGNED → NEW
Crash Signature: [@ profiler_thread_sleep()]

Hey Shane,
have you even been able to trigger it locally? I even tried to add some debugging logging and trigger it in a push to try with a TV job but unfortunately I've been unable to trigger it in that TV job, despite the TV job did run 10 times).
This makes me suspect there may be some interaction with a test case running before this one, and without both the test does not fail even if executed multiple times and in chaos mode (as the TV job does).

Follows some more context related to a couple of details I did notice while looking to this intermittent failure logs.


Looking to a bunch of the intermittent failures tracked in this bug, I did notice that all screenshot captured when the failure is triggered show the authentication dialog open, while the test is expected to be intercepting requests to open the authentication dialog and replace it with a mock object to test the behaviors expected by this test.

I'm unable to trigger this locally and so I didn't dig more into it, but looking to what class AuthRequestor (defined and used internally in WebRequest.jsm) I did notice that AuthRequestor's _getForwardPrompt does have a try..catch with an empty catch block as part of its own fallback mechanism, and so there is a chance that the test is somehow triggering an exception when the test fails and the fallback mechanism is calling the authentication dialog as a side effect of that, but the empty catch block is preventing us to see an error log when that happens.

If we could trigger the failure with some additional debugging logs in those two _getForwardPrompt's try...catch blocks may help to shade some lights about this intermittent.

To be fair, it may be reasonable to still report the error using (Cu.reportError) if we are running on a debug build, that may help us to get some more details to work with if this kind of unexpected issue will keep happening in the future.

Flags: needinfo?(mixedpuppy)
Assignee: nobody → lgreco
Status: NEW → ASSIGNED

Adding leave-open to prevent this issue from being closed once D119644, the purpose of that patch it to provide us some more details from the failure happening while the test runs on the build infrastructure along with the other tests in the same chunk, but it wouldn't fix the intermittent failure yet.

(Also clearing the needinfo assigned to Shane, we have already discussed about this and agreed to proceed with the proposed approach described in comment 15).

Flags: needinfo?(mixedpuppy)
Keywords: leave-open
Pushed by luca.greco@alcacoop.it: https://hg.mozilla.org/integration/autoland/rev/d5e14c4d6f96 Ensure WebRequest.jsm AuthRequestor _getForwardPrompt does report errors in debug builds. r=mixedpuppy

This patch fixes the intermittent failures being triggered by test_ext_webrequest_auth.html by making sure that
the "http-on-modify-request" observer added by the test tasks part of this test file are going to ignore any http
request that isn't related to the mochi.test url that is expected to trigger the authentication dialog.

With the previous version of the chrome script executed by these two test tasks, there was a chance that the observer
would be setting the mocked nsIAuthPromptProvider/nsIauthPromp2 notification callback on the wrong channel, if
the http channel was also related to the mochi.test host. After that the observer was removed and it would not be
setting the notificationCallbacks on the channel that was meant to set it on (and that would make it the test to
trigger the actual authentication prompt that can be seen in the failure screenshots).

As part of this patch we are also reventing the additional debugging logs added in a previous patch attached to
this issue.

Removing leave-open keyword, we are pushing to autoland the fix attached to this issue as D119880, and I'm pretty confident it should prevent these intermittent failures from now on.

Keywords: leave-open
Pushed by luca.greco@alcacoop.it: https://hg.mozilla.org/integration/autoland/rev/9078c74116cd Fix test_ext_webrequest_auth.html intermittent failures. r=mixedpuppy
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch

The trend of the intermittent failures tracked by this bug seems to also confirm the last patch we landed (comment 23) is preventing the unexpected timeout failures.

Removing disable-recommended from the whiteboard to reflect that.

Whiteboard: [stockwell disable-recommended] → [stockwell]
Attachment #9229565 - Attachment is obsolete: true

The patch landed in nightly and beta is affected.
:rpl, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(lgreco)
Whiteboard: [stockwell] → [stockwell fixed:patch]

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #26)

The patch landed in nightly and beta is affected.
:rpl, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

The patch with the fix would not graft cleanly on beta as is without landing both D119644 and D119880 (because the fix from D119880 does also revert the additional debugging logging added in D119644).

It may not be worth it If the intermittency rate in beta isn't that high, but if the intermittency rate on beta is high enough to be annoying then I'll be more than happy to attach a separate test-only patch that does only apply the fix to the test (which wouldn't also require an explicit uplift request).

Andreaa, wdyt?

Flags: needinfo?(lgreco) → needinfo?(apavel)

(In reply to Luca Greco [:rpl] [:luca] [:lgreco] from comment #28)

(In reply to Release mgmt bot [:sylvestre / :calixte / :marco for bugbug] from comment #26)

The patch landed in nightly and beta is affected.
:rpl, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

The patch with the fix would not graft cleanly on beta as is without landing both D119644 and D119880 (because the fix from D119880 does also revert the additional debugging logging added in D119644).

It may not be worth it If the intermittency rate in beta isn't that high, but if the intermittency rate on beta is high enough to be annoying then I'll be more than happy to attach a separate test-only patch that does only apply the fix to the test (which wouldn't also require an explicit uplift request).

Andreaa, wdyt?

Failure rate is not high at the moment https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2021-07-12&endday=2021-07-19&tree=mozilla-beta&bug=1716452.
We'll keep an eye on it and add a ni if it gets high frequency.

Thank you!

Flags: needinfo?(apavel)

IMO, we should go ahead with the test-only uplift patch as we'll be living with these failures on ESR91 for the next year otherwise.

Flags: needinfo?(lgreco)

This attachment contains the fix landed on Nightly, turned into a single test-only patch that can be applied cleanly on 91 beta as is (got by just folding both D119644 and D119880 into a single patch and then double-checking that it does apply without conflicts on 91 beta as expected).

(In reply to Ryan VanderMeulen [:RyanVM] from comment #31)

IMO, we should go ahead with the test-only uplift patch as we'll be living with these failures on ESR91 for the next year otherwise.

+1 That's definitely a good point.

I've just attached in comment 32 the test-only patch which can be grafted on 91 beta.

Flags: needinfo?(lgreco)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: