Closed Bug 1318389 Opened 7 years ago Closed 7 years ago

Intermittent browser/modules/test/browser_CaptivePortalWatcher.js | This test exceeded the timeout threshold. It should be rewritten or split up. If that's not possible, use requestLongerTimeout(N), but only as a last resort. -

Categories

(Firefox :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 54
Tracking Status
firefox52 --- fixed
firefox-esr52 --- fixed
firefox53 --- fixed
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: nhnt11)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(3 files)

Looks like this test was recently added and has pretty much immediately gone intermittent. Can we work out why and/or fix this?
Flags: needinfo?(nhnt11)
(based on timing I suspect bug 1315969)
this bug spiked yesterday, :gijs, can you find someone to look at this?  We have had a <30 failures/week trend, until this week.
Flags: needinfo?(nhnt11) → needinfo?(gijskruitbosch+bugs)
(In reply to Joel Maher ( :jmaher) from comment #10)
> this bug spiked yesterday, :gijs, can you find someone to look at this?  We
> have had a <30 failures/week trend, until this week.

Nihanth? I don't have cycles myself, unfortunately. Joel, do you have a more precise regression range? That would probably also help...
Flags: needinfo?(nhnt11)
Flags: needinfo?(jmaher)
Flags: needinfo?(gijskruitbosch+bugs)
in doing retriggers, I was not able to figure out a root cause, nor a point where we increased in frequency :(
Flags: needinfo?(jmaher)
Let's split up this file which contains many tests that do plenty of time-consuming UI operations.
Flags: needinfo?(nhnt11)
Comment on attachment 8838186 [details]
Bug 1318389 - Move captive portal test utility functions to head.js.

https://reviewboard.mozilla.org/r/113154/#review114642

I mean, I guess, but have we actually determined this is going to fix the timeouts? IOW, are the timeouts actually because we're not managing to run everything within the timeout limit, or are we just getting stuck in the middle of a test sometimes?

::: browser/base/content/test/captivePortal/browser_CaptivePortalWatcher.js:34
(Diff revision 1)
>      let win1 = yield openWindowAndWaitForFocus();
>      let win2 = yield openWindowAndWaitForFocus();

This doesn't make a lot of sense, seems like you could just open two windows without focusing them?

::: browser/base/content/test/captivePortal/head.js:37
(Diff revision 1)
> +function* cleanUp() {
> +  RecentWindow.getMostRecentBrowserWindow = RecentWindow._getMostRecentBrowserWindowCopy;
> +  delete RecentWindow._getMostRecentBrowserWindowCopy;
> +  window.CaptivePortalWatcher.init();
> +}

the setup function should just register a cleanup function that does this. That also avoids storing stuff on a global/public object instead of just locally.
Attachment #8838186 - Flags: review?(gijskruitbosch+bugs) → review+
Comment on attachment 8838187 [details]
Bug 1318389 - Move single run captive portal test cases to separate test file.

https://reviewboard.mozilla.org/r/113156/#review114644

::: browser/base/content/test/captivePortal/browser.ini:7
(Diff revision 1)
> +[browser_CaptivePortalWatcher_1.js]
> +skip-if = os == "win" # Bug 1313894

Please rename this to something identifying what kind of tests are in this file. I don't really know what 'single run testcases' means, so a comment in the new test wouldn't go amiss. Otherwise I assume you're just moving code around so this looks fine. If there are actual changes please clarify on the bug.
Attachment #8838187 - Flags: review?(gijskruitbosch+bugs) → review+
Comment on attachment 8838186 [details]
Bug 1318389 - Move captive portal test utility functions to head.js.

https://reviewboard.mozilla.org/r/113154/#review114642

Yeah, this failure is specifically a "This test exceeded the timeout threshold." issue. The cleanUp function finishes and then mochitest complains that the test file took too long.
Comment on attachment 8838186 [details]
Bug 1318389 - Move captive portal test utility functions to head.js.

https://reviewboard.mozilla.org/r/113154/#review114642

> This doesn't make a lot of sense, seems like you could just open two windows without focusing them?

A window is focused when it is opened, we simply ensure we wait for the focus to complete to avoid intermittent timing issues since we need to do a manual focus operation after opening the two windows.

> the setup function should just register a cleanup function that does this. That also avoids storing stuff on a global/public object instead of just locally.

Good point, thanks!
Comment on attachment 8838187 [details]
Bug 1318389 - Move single run captive portal test cases to separate test file.

https://reviewboard.mozilla.org/r/113156/#review114644

> Please rename this to something identifying what kind of tests are in this file. I don't really know what 'single run testcases' means, so a comment in the new test wouldn't go amiss. Otherwise I assume you're just moving code around so this looks fine. If there are actual changes please clarify on the bug.

Hmm, that was only named singleRunTestCases to distinguish from the other testcases that were run once for captive-portal-login-success and captive-portal-login-abort. Instead of naming the file something specific, I've renamed the array to just "testcases" which is much simpler. The array of tests in the first file is still called testCasesForBothSuccessAndAbort, and there's a comment. I think this makes sense.
(In reply to Nihanth Subramanya [:nhnt11] from comment #24)
> Comment on attachment 8838187 [details]
> Bug 1318389 - Move single run captive portal test cases to separate test
> file.
> 
> https://reviewboard.mozilla.org/r/113156/#review114644
> 
> > Please rename this to something identifying what kind of tests are in this file. I don't really know what 'single run testcases' means, so a comment in the new test wouldn't go amiss. Otherwise I assume you're just moving code around so this looks fine. If there are actual changes please clarify on the bug.
> 
> Hmm, that was only named singleRunTestCases to distinguish from the other
> testcases that were run once for captive-portal-login-success and
> captive-portal-login-abort. Instead of naming the file something specific,
> I've renamed the array to just "testcases" which is much simpler. The array
> of tests in the first file is still called testCasesForBothSuccessAndAbort,
> and there's a comment. I think this makes sense.

Maybe for inside the file, but ideally we should avoid having "test_foo1.js" and "test_foo2.js" files if at all possible. Is there really nothing distinctive about the tests you're putting in the second file that you can use to name this in a better way?
this is failing 50%+ of the time on linux* debug e10s.  I see r+ in here, but no landing, can we land today/tomorrow?  I would prefer to have this disabled given the high failure rate if we cannot land these patches
Flags: needinfo?(nhnt11)
(In reply to :Gijs (away until Feb 27) from comment #25)
> Maybe for inside the file, but ideally we should avoid having "test_foo1.js"
> and "test_foo2.js" files if at all possible. Is there really nothing
> distinctive about the tests you're putting in the second file that you can
> use to name this in a better way?

I thought about it and honestly can't think of a true distinction that isn't coincidental. In the future when adding tests, I intend for them to be "appended". The reason for splitting them is purely to prevent the test run from exceeding the timeout. There's precedent for the file naming in the popupNotifications tests.

I'll admit this argument isn't very strong :(
But in the interest of silencing this failure I'm going to land this as is.
Flags: needinfo?(nhnt11)
Pushed by nhnt11@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/3fcbccbc04b8
Move captive portal test utility functions to head.js. r=Gijs
https://hg.mozilla.org/integration/autoland/rev/e8836935d34b
Move single run captive portal test cases to separate test file. r=Gijs
Whiteboard: [stockwell fixed]
https://hg.mozilla.org/mozilla-central/rev/3fcbccbc04b8
https://hg.mozilla.org/mozilla-central/rev/e8836935d34b
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 54
(In reply to Nihanth Subramanya [:nhnt11] from comment #36)
> But in the interest of silencing this failure I'm going to land this as is.

This was the right the decision.

> (In reply to :Gijs (away until Feb 27) from comment #25)
> > Maybe for inside the file, but ideally we should avoid having "test_foo1.js"
> > and "test_foo2.js" files if at all possible. Is there really nothing
> > distinctive about the tests you're putting in the second file that you can
> > use to name this in a better way?
> 
> I thought about it and honestly can't think of a true distinction that isn't
> coincidental. In the future when adding tests, I intend for them to be
> "appended". The reason for splitting them is purely to prevent the test run
> from exceeding the timeout. There's precedent for the file naming in the
> popupNotifications tests.

However, other people making similar decisions as the only reason to make decisions (in spite of counterarguments) is a bad idea.

The tests as-is have descriptive names for each of their tasks, so it seems like we can use those to help with naming the test files. Please can you file a followup for this?
Flags: needinfo?(nhnt11)
Assignee: nobody → nhnt11
The change - comment 37 - seemed to mostly fix this failure, but now it has come back, with quite a few failures the last couple of days.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I used this as a testcase for my mochitest retriggering work (bug 1343327), I tried to see if I could reproduce the failure by running the test in isolation. I could, easily:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6f2ad538f1c9939d57591e6f06a7a6e7750bd118&selectedJob=83496495

(don't be fooled by the fact that the job passed, that's a seperate bug)

Since the test was retriggered with a high debugging level, there's a fair amount of interesting debug information in the log:

https://treeherder.mozilla.org/logviewer.html#?job_id=83496495&repo=try&lineNumber=2904

In particular I'm seeing a pile of assertions like this:

[task 2017-03-13T22:18:43.263116Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_TRUE(standardURL) failed: file /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
[task 2017-03-13T22:18:43.266667Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 107
[task 2017-03-13T22:18:43.269168Z] GECKO(1279) | [Child 1334] WARNING: 'NS_FAILED(aResult)', file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 614
[task 2017-03-13T22:18:43.274259Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_TRUE(standardURL) failed: file /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
[task 2017-03-13T22:18:43.276297Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 107
[task 2017-03-13T22:18:43.278148Z] GECKO(1279) | [Child 1334] WARNING: 'NS_FAILED(aResult)', file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 614
[task 2017-03-13T22:18:43.310829Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_TRUE(standardURL) failed: file /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
[task 2017-03-13T22:18:43.313371Z] GECKO(1279) | [Child 1334] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 107
[task 2017-03-13T22:18:43.315077Z] GECKO(1279) | [Child 1334] WARNING: 'NS_FAILED(aResult)', file /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp, line 614
(In reply to William Lachance (:wlach) (use needinfo!) from comment #52)
> I used this as a testcase for my mochitest retriggering work (bug 1343327),
> I tried to see if I could reproduce the failure by running the test in
> isolation. I could, easily:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=6f2ad538f1c9939d57591e6f06a7a6e7750bd118&selectedJob=8
> 3496495
> 
> (don't be fooled by the fact that the job passed, that's a seperate bug)
> 
> Since the test was retriggered with a high debugging level, there's a fair
> amount of interesting debug information in the log:
> 
> https://treeherder.mozilla.org/logviewer.
> html#?job_id=83496495&repo=try&lineNumber=2904
> 
> In particular I'm seeing a pile of assertions like this:
> 
> [task 2017-03-13T22:18:43.263116Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_TRUE(standardURL) failed: file
> /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
> [task 2017-03-13T22:18:43.266667Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 107
> [task 2017-03-13T22:18:43.269168Z] GECKO(1279) | [Child 1334] WARNING:
> 'NS_FAILED(aResult)', file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 614
> [task 2017-03-13T22:18:43.274259Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_TRUE(standardURL) failed: file
> /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
> [task 2017-03-13T22:18:43.276297Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 107
> [task 2017-03-13T22:18:43.278148Z] GECKO(1279) | [Child 1334] WARNING:
> 'NS_FAILED(aResult)', file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 614
> [task 2017-03-13T22:18:43.310829Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_TRUE(standardURL) failed: file
> /home/worker/workspace/build/src/caps/nsPrincipal.cpp, line 229
> [task 2017-03-13T22:18:43.313371Z] GECKO(1279) | [Child 1334] WARNING:
> NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004005: file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 107
> [task 2017-03-13T22:18:43.315077Z] GECKO(1279) | [Child 1334] WARNING:
> 'NS_FAILED(aResult)', file
> /home/worker/workspace/build/src/extensions/cookie/nsPermissionManager.cpp,
> line 614

I'm quite unsure what these assertions are about. I think that this test is still simply taking too long to run. I'm going to look at this today.
on a passing run of linux64-debug-e10s, we have:
[task 2017-03-13T06:30:07.270511Z] 06:30:07     INFO - TEST-OK | browser/base/content/test/captivePortal/browser_CaptivePortalWatcher.js | took 87731ms

and we fail at 90 seconds, so this test needs to be split up or reqeustLongerTimeout().  Thanks for looking into this :nhnt11
Whiteboard: [stockwell fixed] → [stockwell needswork]
I wonder what accounts for the longer runtime though? Looking at the log, it appears that the `test_detectedWithNoBrowserWindow_LongRecheck` subtest is taking about 15 seconds for the second case, which seems quite excessive to me.
most tests run in a 'range', if this is <3 seconds from the timeout, that seems acceptable in the 'range'.  I suspect finding one root cause and fixing it will temporarily put this off for a month or two as something else will slow things down.
(In reply to Joel Maher ( :jmaher) from comment #57)
> most tests run in a 'range', if this is <3 seconds from the timeout, that
> seems acceptable in the 'range'.  I suspect finding one root cause and
> fixing it will temporarily put this off for a month or two as something else
> will slow things down.

Ok apparently Kyle has a dashboard for tracking test times. :) Using that, it looks like there hasn't been a radical shift in the timing of this test, just a gentle climb around March 7th or 8th:

https://people-mozilla.org/~klahnakoski/testfailures/test.html#search=browser_CaptivePortalWatcher.js&sampleMax=2017-03-14&sampleMin=2017-03-01

So probably just increasing the timeout would be ok.
Comment on attachment 8847767 [details] [diff] [review]
use requestLongerTimeout

Review of attachment 8847767 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the patch! I took some time so that I could think about whether we want to un-split the test if we're just going to requestLongerTimeout() anyway, but having thought about it, I think we should leave it split; it's better for the long run, and I will at some point think about how we can organize these tests better.

::: browser/base/content/test/captivePortal/browser_CaptivePortalWatcher.js
@@ +1,3 @@
>  "use strict";
>  
> +// Bug 1318389 - this just takes a long time in debug

Nit: Can we say something a bit more descriptive like:

// Bug 1318389 - This test does a lot of window and tab manipulation,
//               causing it to take a long time on debug.
Attachment #8847767 - Flags: review?(nhnt11) → review+
Flags: needinfo?(nhnt11)
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b4e1fb56018b
Intermittent browser/modules/test/browser_CaptivePortalWatcher.js, requestLongerTimeout.  This test does a lot of window and tab manipulation causing it to take a long time on debug. r=nhnt11
thanks for the review, adjusted the comment and ensured the commit message reflected this as well.
Whiteboard: [stockwell needswork] → [stockwell fixed]
I don't think the requestLongerTimeout has helped, based on comment #71. :-(
Flags: needinfo?(nhnt11)
it has, we had so many early on and the last 4 days there have been no instances.
Flags: needinfo?(nhnt11)
You need to log in before you can comment on or make changes to this bug.