Closed Bug 1128311 Opened 9 years ago Closed 7 years ago

Intermittent browser_bug1108547.js | Test timed out

Categories

(Core :: Networking: DNS, defect, P3)

x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox50 --- wontfix
firefox51 --- fixed
firefox52 --- fixed
firefox53 --- fixed

People

(Reporter: philor, Assigned: ehsan.akhgari)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [necko-active])

Attachments

(1 file)

      No description provided.
This just looks like a terrible test machine. The test starts at 16:10:20, then at 16:11:07 it actually opens a new window and starts issuing TEST-PASS notifications. As soon as it returns to the event loop the timeout timer executes and declares that the test has timed out.
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
this has been picking up frequency the last few weeks, the majority are win8 opt or win7-vm debug- all in e10s mode specifically.

looking a failure case in a log [0], I see some error messages:
01:46:49     INFO - TEST-UNEXPECTED-FAIL | dom/html/test/browser_bug1108547.js | Test timed out - 
01:46:49     INFO - MEMORY STAT | vsize 571MB | vsizeMaxContiguous 474MB | residentFast 210MB | heapAllocated 70MB
01:46:49     INFO - TEST-OK | dom/html/test/browser_bug1108547.js | took 90145ms
01:46:49     INFO - Not taking screenshot here: see the one that was previously logged
01:46:49     INFO - TEST-UNEXPECTED-FAIL | dom/html/test/browser_bug1108547.js | Found a tab after previous test timed out: http://mochi.test:8888/browser/dom/html/test/file_bug1108547-3.html - 
01:46:49     INFO - ++DOCSHELL 07E0D000 == 3 [pid = 5476] [id = {1eec7932-4e9b-40cd-b812-542b4880e74e}]
01:46:49     INFO - ++DOMWINDOW == 5 (07E10800) [pid = 5476] [serial = 74] [outer = 00000000]
01:46:49     INFO - TEST-UNEXPECTED-FAIL | unknown test url | uncaught exception - TypeError: BrowserTestUtils is null at onNewTabOpened2@chrome://mochitests/content/browser/dom/html/test/browser_bug1108547.js:97:5
01:46:49     INFO - addTab@chrome://browser/content/tabbrowser.xml:2276:13
01:46:49     INFO - Tester_waitForWindowsState@chrome://mochikit/content/browser-test.js:298:20
01:46:49     INFO - Tester.prototype.nextTest<@chrome://mochikit/content/browser-test.js:569:5
01:46:49     INFO - TaskImpl_run@resource://gre/modules/Task.jsm:319:42
01:46:49     INFO - promise callback*TaskImpl_handleResultValue@resource://gre/modules/Task.jsm:396:7
01:46:49     INFO - TaskImpl_run@resource://gre/modules/Task.jsm:327:15
01:46:49     INFO - TaskImpl_handleResultValue@resource://gre/modules/Task.jsm:401:7
01:46:49     INFO - TaskImpl_run@resource://gre/modules/Task.jsm:327:15
01:46:49     INFO - TaskImpl@resource://gre/modules/Task.jsm:277:3
01:46:49     INFO - asyncFunction@resource://gre/modules/Task.jsm:252:14
01:46:49     INFO - timeoutFn@chrome://mochikit/content/browser-test.js:819:9
01:46:49     INFO - setTimeout handler*SimpleTest_setTimeoutShim@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:625:12
01:46:49     INFO - timeoutFn@chrome://mochikit/content/browser-test.js:797:13
01:46:49     INFO - setTimeout handler*Tester_execTest@chrome://mochikit/content/browser-test.js:781:9
01:46:49     INFO - Tester.prototype.nextTest</<@chrome://mochikit/content/browser-test.js:657:7
01:46:49     INFO - SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<@chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:744:59
01:46:49     INFO - JavaScript error: chrome://mochitests/content/browser/dom/html/test/browser_bug1108547.js, line 97: TypeError: BrowserTestUtils is null
01:46:49     INFO - checking window state
01:46:49     INFO - ++DOMWINDOW == 6 (08626800) [pid = 5476] [serial = 75] [outer = 07E10800]
01:46:49     INFO - Console message: [JavaScript Error: "TypeError: BrowserTestUtils is null" {file: "chrome://mochitests/content/browser/dom/html/test/browser_bug1108547.js" line: 97}]


:ehsan, can you take a look at this, normally this runs in <15seconds, so I don't think we need to make the timeout longer.



[0] https://archive.mozilla.org/pub/firefox/tinderbox-builds/mozilla-inbound-win32-debug/1482647857/mozilla-inbound_win7_vm-debug_test-mochitest-e10s-browser-chrome-5-bm138-tests1-windows-build40.txt.gz
Flags: needinfo?(ehsan)
Win8 is hardware, and it has hardly ever happened twice on the same machine; Win7 is AWS instances, where "the same one" is an even sketchier concept, but again it's not just a few consistently failing.

A much better bet would be to bet on some other previous test (either newly added, or newly moved into the same chunk) intermittently leaving the browser in offline mode.
The only tests running before this are dom/html/test/browser_bug1081537.js and dom/html/test/browser_DOMDocElementInserted.js and neither is doing anything related to the offline mode.  I also set a breakpoint on nsIOService::SetOffline to verify that this isn't called with a true argument in either processes during the test...
in looking at this, I don't see any patterns on machines or other information about what tests are running prior to this.  The test runs very stable for me locally.

What would trigger the browser to get offline?  Would some event at the OS level do it?  A preference?  something else?
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) from comment #325)
> What would trigger the browser to get offline?  Would some event at the OS
> level do it?  A preference?  something else?

Who can say...  I pushed a try job <https://treeherder.mozilla.org/#/jobs?repo=try&revision=824359c7864ba41e9f5ab16d983782a3462ef132> which crashes the browser when someone puts us in offline mode in the hopes of getting a stack trace...  Will report back when I know more.
Flags: needinfo?(ehsan)
On my try push I got a few oranges, but the assertion wasn't triggered, which means we are not really in offline mode!

Seeing the error page however indicates that something is canceling the channel with NS_ERROR_OFFLINE...  Usually that status code is used when we detect the IO service being in offline mode, but I'll add some printfs to the places where we raise this error to see if I can find out anything more.
Here is where the NS_ERROR_OFFLINE comes from: <http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/netwerk/dns/nsDNSService2.cpp#758>

https://archive.mozilla.org/pub/firefox/try-builds/eakhgari@mozilla.com-0976748241e7c1f4bfb3f25c65ca79c530ab82b2/try-win32-debug/try_win7_vm-debug_test-mochitest-e10s-browser-chrome-5-bm128-tests1-windows-build468.txt.gz
15:46:04     INFO - #01: mozilla::net::nsSocketTransport::ResolveHost() [netwerk/base/nsSocketTransport2.cpp:1097]
15:46:04     INFO - 
15:46:04     INFO - #02: mozilla::net::nsSocketTransport::OnSocketEvent(unsigned int,nsresult,nsISupports *) [netwerk/base/nsSocketTransport2.cpp:1873]
15:46:04     INFO - 
15:46:04     INFO - #03: mozilla::net::nsSocketEvent::Run() [netwerk/base/nsSocketTransport2.cpp:87]
15:46:04     INFO - 
15:46:04     INFO - #04: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:1213]
15:46:04     INFO - 
15:46:04     INFO - #05: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/glue/nsThreadUtils.cpp:381]
15:46:04     INFO - 
15:46:04     INFO - #06: mozilla::net::nsSocketTransportService::Run() [netwerk/base/nsSocketTransportService2.cpp:956]
15:46:04     INFO - 
15:46:04     INFO - #07: nsThread::ProcessNextEvent(bool,bool *) [xpcom/threads/nsThread.cpp:1213]
15:46:04     INFO - 
15:46:04     INFO - #08: NS_ProcessNextEvent(nsIThread *,bool) [xpcom/glue/nsThreadUtils.cpp:381]
15:46:04     INFO - 
15:46:04     INFO - #09: mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate *) [ipc/glue/MessagePump.cpp:338]
15:46:04     INFO - 
15:46:04     INFO - #10: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:238]
15:46:04     INFO - 
15:46:04     INFO - #11: MessageLoop::RunHandler() [ipc/chromium/src/base/message_loop.cc:232]
15:46:04     INFO - 
15:46:04     INFO - #12: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:212]
15:46:04     INFO - 
15:46:04     INFO - #13: nsThread::ThreadFunc(void *) [xpcom/threads/nsThread.cpp:469]
15:46:04     INFO - 
15:46:04     INFO - #14: _PR_NativeRunThread [nsprpub/pr/src/threads/combined/pruthr.c:397]
15:46:04     INFO - 
15:46:04     INFO - #15: pr_root [nsprpub/pr/src/md/windows/w95thred.c:95]
15:46:04     INFO - 
15:46:04     INFO - #16: ucrtbase.DLL + 0x3d5ef
15:46:04     INFO - 
15:46:04     INFO - #17: kernel32.dll + 0x53c45
15:46:04     INFO - 
15:46:04     INFO - #18: ntdll.dll + 0x637f5
15:46:04     INFO - 
15:46:04     INFO - #19: ntdll.dll + 0x637c8

Looking at accesses to mResolver, the only thing that can have happened is we calling Shutdown() here: <http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/netwerk/dns/nsDNSService2.cpp#965>.  This is a regression from bug 909050: when we close the last private window, the DNS resolver clearing may race with name resolution.  Clearly this comment is a lie <http://searchfox.org/mozilla-central/rev/51aa673e28802fe6ea89f4793721fc55334a6ac8/netwerk/dns/nsDNSService2.cpp#956> which doesn't help.

It seems to me that the approach of shutting down the host resolver and re-initing is way too heavy of a hammer.  All we really need to do is flush the DNS cache, I think.
Assignee: nobody → ehsan
Blocks: 909050
Component: DOM → Networking: DNS
Flags: needinfo?(ehsan)
Keywords: regression
Whiteboard: [necko-active]
(In reply to :Ehsan Akhgari from comment #329)
> I'm testing a patch for this on the try server:
> <https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=850532ac39d630b9a1c1cb1cef304de50b22532c>

It seems that the fix works!
Comment on attachment 8822754 [details] [diff] [review]
Only flush the DNS cache when the last private browsing window is closed

Review of attachment 8822754 [details] [diff] [review]:
-----------------------------------------------------------------

thank you!
Attachment #8822754 - Flags: review?(mcmanus) → review+
Pushed by eakhgari@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a2059d25a0ae
Only flush the DNS cache when the last private browsing window is closed; r=mcmanus
https://hg.mozilla.org/mozilla-central/rev/a2059d25a0ae
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
Please request Aurora (and Beta if the risk isn't too high) approval on this when you get a chance.
Flags: needinfo?(ehsan)
(In reply to OrangeFactor Robot from comment #337)
> 14 failures in 563 pushes (0.025 failures/push) were associated with this
> bug in the last 7 days.  
> 
> Repository breakdown:
> * mozilla-aurora: 9
> * mozilla-beta: 3
> * mozilla-central: 2
> 
> Platform breakdown:
> * windows8-64: 7
> * windows7-32-vm: 7
> 
> For more details, see:
> https://brasstacks.mozilla.com/orangefactor/
> ?display=Bug&bugid=1128311&startday=2017-01-02&endday=2017-01-08&tree=all

For the record, I looked at the mozilla-central ones here, and they are both coming from revisions before this got merged to central.
Flags: needinfo?(ehsan)
Comment on attachment 8822754 [details] [diff] [review]
Only flush the DNS cache when the last private browsing window is closed

Approval Request Comment
[Feature/Bug causing the regression]: Bug 909050, shipped in Firefox 26. :(
[User impact if declined]: When closing a private browsing window, DNS queries can return an error and prevent the respective content being loaded.  If the content being loaded is a top-level page, we mistakenly show an error page stating that the browser is in offline mode; otherwise random parts of the page may fail loading.
[Is this code covered by automated tests?]: No.  It is "covered" by this intermittent test failure going away though.
[Has the fix been verified in Nightly?]: Yes, see the previous comment.
[Needs manual test from QE? If yes, steps to reproduce]: 
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]:
[Why is the change risky/not risky?]:
[String changes made/needed]:
Attachment #8822754 - Flags: approval-mozilla-beta?
Attachment #8822754 - Flags: approval-mozilla-aurora?
Sorry, submitted my comment too eagerly...

(In reply to :Ehsan Akhgari (in Taipei, laggy response time) from comment #339)
> [Needs manual test from QE? If yes, steps to reproduce]: No.
> [List of other uplifts needed for the feature/fix]: None.
> [Is the change risky?]: No.
> [Why is the change risky/not risky?]: It is pretty well understood, and arguably the fix landed in bug 909050 wasn't the right thing to do in the first place.  I'm pretty comfortable uplifting this all the way to beta.
> [String changes made/needed]: Nada.
Comment on attachment 8822754 [details] [diff] [review]
Only flush the DNS cache when the last private browsing window is closed

fix possible dns errors when closing a pb window, for aurora52 and beta51
Attachment #8822754 - Flags: approval-mozilla-beta?
Attachment #8822754 - Flags: approval-mozilla-beta+
Attachment #8822754 - Flags: approval-mozilla-aurora?
Attachment #8822754 - Flags: approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: