Closed Bug 1117979 Opened 5 years ago Closed 3 years ago

Intermittent test_location_error.js | xpcshell return code: 0 | run_test/< - [run_test/< : 39] 1 == 0

Categories

(Firefox :: General, defect, P3)

x86_64
Windows 8.1
defect
Points:
2

Tracking

()

RESOLVED WORKSFORME
Firefox 38
Iteration:
38.1 - 26 Jan
Tracking Status
firefox35 --- unaffected
firefox36 --- fixed
firefox37 --- fixed
firefox38 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: RyanVM, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intermittent-failure)

Attachments

(1 file, 2 obsolete files)

Two failures within minutes of each other on two different trees (spanning different Gecko revisions no less) = external dependencies maybe?

11:59:59 INFO - TEST-START | toolkit/components/search/tests/xpcshell/test_location_error.js
12:00:04 WARNING - TEST-UNEXPECTED-FAIL | toolkit/components/search/tests/xpcshell/test_location_error.js | xpcshell return code: 0
12:00:04 INFO - TEST-INFO took 4735ms
12:00:04 INFO - >>>>>>>
12:00:04 INFO - (xpcshell/head.js) | test MAIN run_test pending (1)
12:00:04 INFO - TEST-PASS | toolkit/components/search/tests/xpcshell/test_location_error.js | run_test - [run_test : 8] true == true
12:00:04 INFO - PROCESS | 5452 | *** Search: SearchService.init
12:00:04 INFO - PROCESS | 5452 | *** Search: metadata init: starting
12:00:04 INFO - (xpcshell/head.js) | test pending (2)
12:00:04 INFO - (xpcshell/head.js) | test MAIN run_test finished (2)
12:00:04 INFO - running event loop 

...

12:00:04 INFO - TEST-PASS | toolkit/components/search/tests/xpcshell/test_location_error.js | run_test/< - [run_test/< : 34] 0 == 0
12:00:04 WARNING - TEST-UNEXPECTED-FAIL | toolkit/components/search/tests/xpcshell/test_location_error.js | run_test/< - [run_test/< : 39] 1 == 0
12:00:04 INFO - C:/slave/test/build/tests/xpcshell/tests/toolkit/components/search/tests/xpcshell/test_location_error.js:run_test/<:39
12:00:04 INFO - resource://gre/components/nsSearchService.js:onSuccess:3982
12:00:04 INFO - resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:Handler.prototype.process:870
12:00:04 INFO - resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:this.PromiseWalker.walkerLoop:749
12:00:04 INFO - resource://gre/modules/Promise.jsm -> resource://gre/modules/Promise-backend.js:this.PromiseWalker.scheduleWalkerLoop/<:691
12:00:04 INFO - C:\slave\test\build\tests\xpcshell\head.js:_do_main:184
12:00:04 INFO - C:\slave\test\build\tests\xpcshell\head.js:_execute_test:476
12:00:04 INFO - -e:null:1
12:00:04 INFO - null:null:0
12:00:04 INFO - exiting test
12:00:04 INFO - <<<<<<<
Flags: needinfo?(mhammond)
A fairly trivial test-only fix.

This test is supposed to see a DNS error but instead is seeing a timeout.  I speculate that this could happen due to the DNS request taking longer than ~2s to fail, causing the symptoms we see in the failure.  The solution is (hopefully!) to increase the geoip timeout for this test.

Gavin, note that I also changed the server name - the test originally had:

  // from server-locations.txt, we choose a URL without a cert.
  let url = "https://nocert.example.com:443";

with the intention being that the test would see a certificate error.  It turns out that the locations in server-locations.txt are *not* running for xpcshell tests, so what was actually happening is a DNS failure (the test infra doesn't complain about external DNS request, only about external socket connections).  

So I changed it to:

  // We configure for a server that's going to fail a DNS lookup.
  let url = "https://xpcshell-test-does-not-exist.mozilla.org";

which I think it somewhat clearer and doesn't change the semantics.  Note I've manually verified a certificate error *does* cause failure in the way we expect, but I haven't yet worked out how to sanely have an xpcshell test connect to a https server (at all, let alone one with a bad cert - httpd.js isn't any help with https)
Assignee: nobody → mhammond
Status: NEW → ASSIGNED
Flags: needinfo?(mhammond)
Attachment #8545703 - Flags: review?(gavin.sharp)
(In reply to Mark Hammond [:markh] from comment #3)
> (the test infra doesn't complain about external DNS request, only
> about external socket connections).  

I wonder how often this burns us in other tests.
Flags: needinfo?(nfroyd)
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #4)
> (In reply to Mark Hammond [:markh] from comment #3)
> > (the test infra doesn't complain about external DNS request, only
> > about external socket connections).  
> 
> I wonder how often this burns us in other tests.

No idea; then again, it's not clear how much win we really got out of disabling non-local network connections in the first place.

Bug 995417 comment 4 initially discouraged blocking things at the DNS level, maybe it's worth revisiting that assumption.
Flags: needinfo?(nfroyd)
Blocks: 1119246
Filed bug 1119246 for discussing DNS blocking.
Comment on attachment 8545703 [details] [diff] [review]
0001-Bug-1117979-fix-orange-by-increasing-the-geoip-timeo.patch

This seems like a fine bandaid, but doesn't really help if the DNS request ends up taking 21 seconds.

Wouldn't an invalid URI be a more reliable way to test the same error case?
Attachment #8545703 - Flags: review?(gavin.sharp)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #7)
> Comment on attachment 8545703 [details] [diff] [review]
> 0001-Bug-1117979-fix-orange-by-increasing-the-geoip-timeo.patch
> 
> This seems like a fine bandaid, but doesn't really help if the DNS request
> ends up taking 21 seconds.
> 
> Wouldn't an invalid URI be a more reliable way to test the same error case?

Fair enough - I felt it was nice to actually hit the network stack during this test, but I agree avoiding this can-o-worms is more worthwhile.
Attachment #8545703 - Attachment is obsolete: true
Attachment #8546316 - Flags: review?(gavin.sharp)
Attachment #8546316 - Flags: review?(gavin.sharp) → review+
Comment on attachment 8546316 [details] [diff] [review]
0001-Bug-1117979-fix-orange-by-not-relying-on-DNS-lookup-.patch

Review of attachment 8546316 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/search/tests/xpcshell/test_location_error.js
@@ +20,5 @@
>      removeCacheFile();
>    });
>  
> +  // An unknown URL scheme will cause our error handler to be hit.
> +  let url = "unknown-scheme://something";

So this line is causing an exception in the XHR code rather than onerror().  Gavin, do you have other ideas on how to cause onerror() without relying on dns failure or other parts of the network stack?
Attachment #8546316 - Flags: feedback?(gavin.sharp)
I suppose you could avoid blocking on DNS by making a request to 127.0.0.1?
Attachment #8546316 - Flags: feedback?(gavin.sharp)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #13)
> I suppose you could avoid blocking on DNS by making a request to 127.0.0.1?

The only problem I have with that is that due to xpcshell tests running concurrently, there's no way I know of to ensure the port I choose isn't going to actually be in-use via httpd.js being used by another test (or indeed, by anything else on the machine).

But I wonder if I can use httpd.js (which selects an unused port) and return an invalid content-length or something...
Flags: needinfo?(mhammond)
Flags: firefox-backlog+
Iteration: --- → 38.1 - 26 Jan
Flags: qe-verify?
Hi Mark, can you provide a point value.
Flags: needinfo?(mhammond)
Points: --- → 2
Flags: needinfo?(mhammond)
I thought I'd already uploaded this :(

The change is:

-  // this will cause an "unknown host" error, but not report an external
-  // network connection in the tests (note that the hosts listed in
-  // server-locations.txt are *not* loaded for xpcshell tests...)
-  let url = "https://nocert.example.com:443";
+  // using a port > 2^32 causes an error to be reported.
+  let url = "http://localhost:111111111";
+

which causes our onerror handler to be called.
Attachment #8546316 - Attachment is obsolete: true
Attachment #8552781 - Flags: review?(gavin.sharp)
Attachment #8552781 - Flags: review?(gavin.sharp) → review+
https://hg.mozilla.org/mozilla-central/rev/1b46ee069096
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 38
Flags: qe-verify? → qe-verify-
Mark, this one has been continuing to happen fairly consistently. Do you think it's worth re-opening the bug for further investigation?
Flags: needinfo?(markh)
(In reply to Nigel Babu [:nigelb] from comment #77)
> Mark, this one has been continuing to happen fairly consistently. Do you
> think it's worth re-opening the bug for further investigation?

Yeah, it's pretty difficult to argue the bug truly is fixed :(
Status: RESOLVED → REOPENED
Flags: needinfo?(markh)
Resolution: FIXED → ---
Florian, do you think you could have a look at this?
Assignee: markh → nobody
Flags: needinfo?(florian)
I'm not sure if what this test still needs is a domain name that will fail locally - but if so, you should be able to use anything now that ends in .onion.. (facebook.onion e.g. :))

thanks to 1228457
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
I see no OrangeFactor Robot comment here, so I assume this means the failure hasn't occurred in a long while.
Status: REOPENED → RESOLVED
Closed: 5 years ago3 years ago
Flags: needinfo?(florian)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.