Intermittent dom/security/test/csp/test_upgrade_insecure.html | upgrade all requests on toplevel http where default-src is not specified (result: websocket-unexpected-error) - didn't expect -1, but got it

RESOLVED INVALID

Status

()

defect
P3
normal
RESOLVED INVALID
3 years ago
8 months ago

People

(Reporter: aryx, Unassigned)

Tracking

({intermittent-failure})

unspecified
Points:
---

Firefox Tracking Flags

(firefox52 disabled, firefox-esr52 disabled, firefox53 disabled, firefox54 disabled)

Details

(Whiteboard: [domsecurity-active][stockwell fixed])

Attachments

(2 attachments, 3 obsolete attachments)

Two times in a row on OS X 10.10 opt:

https://treeherder.mozilla.org/logviewer.html#?job_id=6385219&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=6382811&repo=autoland

04:04:31     INFO - TEST-PASS | dom/security/test/csp/test_upgrade_insecure.html | upgrade all requests on toplevel https (result: img-redir-ok) 
04:04:31     INFO - TEST-PASS | dom/security/test/csp/test_upgrade_insecure.html | upgrade all requests on toplevel https (result: iframe-ok) 
04:04:31     INFO - TEST-PASS | dom/security/test/csp/test_upgrade_insecure.html | upgrade all requests on toplevel https (result: nested-img-ok) 
04:04:31     INFO - Buffered messages finished
04:04:31     INFO - TEST-UNEXPECTED-FAIL | dom/security/test/csp/test_upgrade_insecure.html | upgrade all requests on toplevel http (result: websocket-unexpected-error) - didn't expect -1, but got it
04:04:31     INFO -     SimpleTest.isnot@SimpleTest/SimpleTest.js:282:5
04:04:31     INFO -     checkResults@dom/security/test/csp/test_upgrade_insecure.html:129:3
04:04:31     INFO -     receiveMessage@dom/security/test/csp/test_upgrade_insecure.html:152:3
04:04:31     INFO -     EventListener.handleEvent*@dom/security/test/csp/test_upgrade_insecure.html:150:1
Duplicate of this bug: 1317135
https://treeherder.mozilla.org/logviewer.html#?job_id=39129066&repo=mozilla-inbound

Putting in P1 ("this quarter") because it could be a regression we'll want to squelch before more is built on top of it. 

Kate: can you take a quick look and see if this is fallout of the HSTS pinning stuff you added?
Flags: needinfo?(kmckinley)
Priority: -- → P1
Whiteboard: [domsecurity-active]
I don't believe it is related to HSTS priming.

* upgrade-insecure-requests takes precedence over HSTS Priming, and confirmed running in non-e10s that no priming requests are sent, and the problem persists
* priming doesn't touch ws/wss URIs
* forcing priming to off makes no difference
* looking at the javascript console and network panel, I see the connection succeeding securely, but onerror handler is invoked
* attempting to use the javascript console to create a new WebSocket generates the same issue - the request is made to the endpoint, and connects, but then returns an error saying it doesn't connect.

The timing is really close to when the reduced timeouts, landed, so it's hard to say 100% that it doesn't relate, but I backed out that reduced timeout patch and still have WebSocket test fails on OSX for this and other WebSocket tests, such as dom/base/test/test_websocket_basic.html.
Flags: needinfo?(kmckinley)
Priority: P1 → P3
Whiteboard: [domsecurity-active] → [domsecurity-backlog1]
this is a very consistent failure on osx 10.10 opt, I am not sure there is more information I can provide- some weeks this crosses our 30 failures/week threshold (to make it a priority intermittent to focus on), other weeks it dips right below that number.

:ckerschb, can you find someone to look into this, ideally this quarter.  Also if there is anything I can do to help get more information or narrow this down, please let me know!
Flags: needinfo?(ckerschb)
I was able to reproduce that problem locally the other day. I'll have a look. Thanks for bringing such things to my attention.
Assignee: nobody → ckerschb
Status: NEW → ASSIGNED
Flags: needinfo?(ckerschb)
Priority: P3 → P1
Whiteboard: [domsecurity-backlog1] → [domsecurity-active]
interesting, bug 1183300 seems to be failing in the same pattern, I suspect both are actually the same bug as it is the same test case :)
Obviously I can't reproduce that error locally anymore :-(

Baku, any idea what might go wrong here? Let me provide some background: The test tries to create a websocket using a scheme of ws:// [0] which should then be upgraded to wss:// using CSPs upgrade-insecure-requests directive. Interestingly the test fails because it receives 'websocket-unexpected-error' which means socket.onopen() is not called but rather websocket.onerror() [1]. At the moment I can't tell why it's failing; could it be that there is something wrong with the .py file [2]?

Alternatively, should we add some logging information to onerror? If so, what would be useful to log in your opinion? Something like:
> dump("  xxx mySocket.onerror: (e.message): " + e.message + "\n");
> dump("  xxx mySocket.onerror: (mySocket.url): " + mySocket.url + "\n");

[0] https://dxr.mozilla.org/mozilla-central/source/dom/security/test/csp/file_upgrade_insecure.html#52
[1] https://dxr.mozilla.org/mozilla-central/source/dom/security/test/csp/file_upgrade_insecure.html#62
[2] https://dxr.mozilla.org/mozilla-central/source/dom/security/test/csp/file_upgrade_insecure_wsh.py#1
Flags: needinfo?(amarchesini)
Other websocket tests use |ws.close();| I have no idea whether it's important to call .close() or not, but probably worth a try.
following up here, this is a very frequent failure- would it be possible to make some progress on this bug this week?
> Alternatively, should we add some logging information to onerror? If so,
> what would be useful to log in your opinion? Something like:
> > dump("  xxx mySocket.onerror: (e.message): " + e.message + "\n");
> > dump("  xxx mySocket.onerror: (mySocket.url): " + mySocket.url + "\n");

No idea why we have this failure. But I would add some debugging information. If it's so easy to reproduce, we can just push a patch to treeherder and fix the issue when we have the first results.
Flags: needinfo?(amarchesini)
it sounds like there is work being done here, I will follow up next week- Given the frequency, I imagine this is easy to reproduce- can we fix this bug this week?  If not, I would recommend disabling this test until we have more resources to work on this bug.
Flags: needinfo?(amarchesini)
going 1.5 weeks out again, lets disable this for now on osx&&!debug until there is time to hack on this.
Posted patch disable test for osx opt. (obsolete) — Splinter Review
Attachment #8840957 - Flags: review?(gbrown)
Attachment #8840957 - Flags: review?(gbrown) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a46ca34a6bde
Intermittent dom/security/test/csp/test_upgrade_insecure.html. disabled on osx. r=gbrown
Whiteboard: [domsecurity-active] → [domsecurity-active][stockwell disabled]
:baku anything to add here?  I see the ni and would like to clear it but thought I would check in.
(In reply to Joel Maher ( :jmaher) from comment #36)
> :baku anything to add here?  I see the ni and would like to clear it but
> thought I would check in.

Baku actually provided his thoughts within in comment 23. Every day I am trying to reproduce locally - and the day will come. I should actually check in some logging in case the test fails. Probably that's the best way forward. I'll do that this week.
Flags: needinfo?(amarchesini)
Thanks Christoph, hopefully more logging will help out here.
Joel, it's cumbersome debugging that test. I tried locally many times but it didn't go anywhere. I suppose we should add that debugging information which hopefully helps us figuring out what's going on. Agreed?
Attachment #8840957 - Attachment is obsolete: true
Attachment #8846554 - Flags: review?(jmaher)
Comment on attachment 8846554 [details] [diff] [review]
bug_1316305_debug_upgrade_insecure_requests.patch

Review of attachment 8846554 [details] [diff] [review]:
-----------------------------------------------------------------

this is great, will we need to enable the test again though by reverting:
https://bug1316305.bmoattachments.org/attachment.cgi?id=8840957

happy to review when the test is enabled- or correct me if I am wrong.
Attachment #8846554 - Flags: review?(jmaher) → review-
(In reply to Joel Maher ( :jmaher) from comment #40)
> this is great, will we need to enable the test again though by reverting:
> https://bug1316305.bmoattachments.org/attachment.cgi?id=8840957
> 
> happy to review when the test is enabled- or correct me if I am wrong.

Oh, I thought the test is only disabled on mac. We are still facing the issue on Linux as well, right? It's up to you. If you want me to I am happy to revert Bug 8840957. As a first attempt we could also leave it disabled on mac and see where this path leads. Let me know what you prefer!
Flags: needinfo?(jmaher)
(In reply to Christoph Kerschbaumer [:ckerschb] from comment #41)
> (In reply to Joel Maher ( :jmaher) from comment #40)
> > this is great, will we need to enable the test again though by reverting:
> > https://bug1316305.bmoattachments.org/attachment.cgi?id=8840957
> > 
> > happy to review when the test is enabled- or correct me if I am wrong.
> 
> Oh, I thought the test is only disabled on mac. We are still facing the
> issue on Linux as well, right? It's up to you. If you want me to I am happy
> to revert Bug 8840957. As a first attempt we could also leave it disabled on
> mac and see where this path leads. Let me know what you prefer!

Oh, I see lately it only happenend on mac according to comment 34. Well, in that case, let's revert it.
Flags: needinfo?(jmaher)
the ended up as a osx opt issue only, you can see we don't have anymore instances after disabling on osx:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1316305&startday=2017-02-20&endday=2017-03-13&tree=all

I am happy to turn this on, or push to try with what you have and do 50 retriggers.
Joel, sorry for the confusion. You are right - we need to enable the test again. Here we go - thanks!
Attachment #8846554 - Attachment is obsolete: true
Attachment #8846573 - Flags: review?(jmaher)
Comment on attachment 8846573 [details] [diff] [review]
bug_1316305_debug_upgrade_insecure_requests.patch

Review of attachment 8846573 [details] [diff] [review]:
-----------------------------------------------------------------

looking forward to getting more debugging information
Attachment #8846573 - Flags: review?(jmaher) → review+
Someone please land the patch but leave the bug open - we are just adding debug information - thanks!
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/61774c8f5fd3
Add debug information for test_upgrade_insecure_requests. r=jmaher
Keywords: checkin-needed
Baku, it seems that the logging does not provide as much details as I would have hoped. The only thing it tells us is that websocket.onerror is called and that the websocket was not upgraded from ws to wss, which theoretically means that this code [1] potentially did not even run, otherwise the websocket.url in the onerror debug statement should print wss instead of ws.

Potentially our test environment gets confused if too many websockets are around. One idea, as you suggested on IRC is to explicitly close the websocket. I suppose that is how it's done right?

Is there any other debug information which we could print which might be helpful?

[1] https://hg.mozilla.org/mozilla-central/annotate/1b9293be51637f841275541d8991314ca56561a5/dom/base/WebSocket.cpp#l1599
Attachment #8849079 - Flags: review?(amarchesini)
Comment on attachment 8849079 [details] [diff] [review]
bug_1316305_debug_uir_websockets.patch

Review of attachment 8849079 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry, this is not what I meant. My idea is that you have to call close() when the test is finished. if you receive onclose() the ws has been already disconnected.

::: dom/security/test/csp/file_upgrade_insecure.html
@@ +66,5 @@
>        dump("  xxx mySocket.onerror: (e.message): " + e.message + "\n");
>        window.parent.postMessage({result: "websocket-unexpected-error"}, "*");
>      };
> +    mySocket.onclose = function(e) {
> +      mySocket.close();

move this to line 59.

::: dom/security/test/csp/file_upgrade_insecure_meta.html
@@ +62,5 @@
>      mySocket.onerror = function(e) {
>        window.parent.postMessage({result: "websocket-unexpected-error"}, "*");
>      };
> +    mySocket.onclose = function(e) {
> +      mySocket.close();

this to line 61.
Attachment #8849079 - Flags: review?(amarchesini) → review-
(In reply to Andrea Marchesini [:baku] from comment #51)
> Sorry, this is not what I meant. My idea is that you have to call close()
> when the test is finished. if you receive onclose() the ws has been already
> disconnected.

Makes sense now. I just browsed through DXR and did what other tests did and they did it that way, hence the confusion. Thanks!
Attachment #8849079 - Attachment is obsolete: true
Attachment #8849104 - Flags: review?(amarchesini)
See Also: → 1347773
Attachment #8849104 - Flags: review?(amarchesini) → review+
Pushed by mozilla@christophkerschbaumer.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/85e011992bf8
Explicilty call .close() for websocket in test. r=baku
I had marked this as disabled before, I see this patch landed for cleaning up the state here- I assume this is fixed and we can remove the leave-open?
Whiteboard: [domsecurity-active][stockwell disabled] → [domsecurity-active][stockwell fixed]
(In reply to Joel Maher ( :jmaher) from comment #54)
> I had marked this as disabled before, I see this patch landed for cleaning
> up the state here- I assume this is fixed and we can remove the leave-open?

We still don't know what causes the intermittent. We are kind of guessing and see what works :-( Apparently it's still not fixed. I suggest we should keep the 'leave-open' for now.

According to comment 34, 49 and finally comment 56, the error rate went slightly down - not sure if our code improvements caused that or if it's just coincidence.
The test has been disabled on a few variations of build for now, see Bug 1183300. It's quite hard to reproduce locally. Putting this in the backlog for now :-(
Assignee: ckerschb → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
The leave-open keyword is there and there is no activity for 6 months.
:ckerschb, maybe it's time to close this bug?
Flags: needinfo?(ckerschb)
(In reply to Release mgmt bot [:sylvestre / :calixte] from comment #62)
> The leave-open keyword is there and there is no activity for 6 months.
> :ckerschb, maybe it's time to close this bug?

Yes, I guess it makes sense to mark this one as resolved given that the test has been disabled within Bug 1183300.
Flags: needinfo?(ckerschb)
Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.