Closed Bug 1798642 Opened 2 years ago Closed 2 years ago

High frequency OSX websocket related failures

Categories

(Core :: Networking: WebSockets, defect, P3)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox-esr102 --- unaffected
firefox106 --- unaffected
firefox107 --- unaffected
firefox108 --- wontfix

People

(Reporter: imoraru, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [fixed by bug 1797816 comment 15][necko-triaged][necko-monitor])

We started to notice that there are frequent multiple failures related to websocket.
Here are some examples:

See Also: → 1793376
See Also: → 1776026
See Also: → 1777090
See Also: → 1777028
Summary: High frequency websocket related failures → High frequency OS X websocket related failures
Summary: High frequency OS X websocket related failures → High frequency OSX websocket related failures

Hi Frederik! Can you please take a look in to this?
Thank you!

Flags: needinfo?(fbraun)
Component: DOM: Security → Networking: WebSockets
Flags: needinfo?(fbraun) → needinfo?(valentin.gosu)
See Also: → 1784824
See Also: → 1793439

I tried to debug this a bit by running websocket tests with log enabled in this try run.
The log shows that after the ssl tunnel established, the websocket server returns some weird data.

[task 2022-11-03T18:58:31.332Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable 128fa0300 ReadSegments returned [rv=0 read=509 sock-cond=0 again=1]
[task 2022-11-03T18:58:31.332Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp   writing transaction request stream
[task 2022-11-03T18:58:31.332Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpTransaction::ReadSegments 13a681900
[task 2022-11-03T18:58:31.333Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable 128fa0300 ReadSegments returned [rv=0 read=0 sock-cond=0 again=1]
[task 2022-11-03T18:58:31.333Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=13a681900 status=804b000a progress=0]
[task 2022-11-03T18:58:31.334Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::ResumeRecv [this=128fa0300]
[task 2022-11-03T18:58:31.334Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Main Thread]: D/nsWebSocket WebSocketChannel::GetInterface() 128f5d500
[task 2022-11-03T18:58:31.335Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable [this=128fa0300]
[task 2022-11-03T18:58:31.335Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpTransaction::WriteSegments 13a681900
[task 2022-11-03T18:58:31.336Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnectionMgr::ShouldThrottle trans=13a681900
[task 2022-11-03T18:58:31.336Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Main Thread]: D/nsWebSocket WebSocketChannelParent::GetInterface() 128fedd60
[task 2022-11-03T18:58:31.337Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable 128fa0300 trans->ws rv=0 n=0 socketin=80470002
[task 2022-11-03T18:58:31.337Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnection::CloseTransaction[this=128fa0300 trans=13a681900 reason=80470002]
[task 2022-11-03T18:58:31.338Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp   closing associated mTransaction
[task 2022-11-03T18:58:31.338Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpTransaction::Close [this=13a681900 reason=0]
[task 2022-11-03T18:58:31.339Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpConnectionMgr::RemoveActiveTransaction t=13a681900 tabid=d(1) thr=0
[task 2022-11-03T18:58:31.339Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp Active transactions -[0,0,0,0]
[task 2022-11-03T18:58:31.340Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpTransaction::ParseHead [count=1]
[task 2022-11-03T18:58:31.340Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpResponseHead::ParseVersion [version=]
[task 2022-11-03T18:58:31.340Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp looks like a HTTP/0.9 response
[task 2022-11-03T18:58:31.341Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp response status line needs default reason phrase
[task 2022-11-03T18:58:31.341Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: E/nsHttp Have status line [version=9 status=200 statusText=OK]
[task 2022-11-03T18:58:31.342Z] 18:58:31     INFO - GECKO(31929) | [Parent 31929: Socket Thread]: V/nsHttp nsHttpTransaction::Close 13a681900 0 Byte 0.9 Response

The normal case should be that we receive 101 response from the server as below:

 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable 141e4cd00 ReadSegments returned [rv=0 read=509 sock-cond=0 again=1]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp   writing transaction request stream
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpTransaction::ReadSegments 1387abb00
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketWritable 141e4cd00 ReadSegments returned [rv=0 read=0 sock-cond=0 again=1]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=1387abb00 status=804b000a progress=0]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpConnection::ResumeRecv [this=141e4cd00]
 0:08.04 GECKO(40178) [Parent 40178: Main Thread]: D/nsWebSocket WebSocketChannel::GetInterface() 141d1ca00
 0:08.04 GECKO(40178) [Parent 40178: Main Thread]: D/nsWebSocket WebSocketChannelParent::GetInterface() 138fe9d00
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpConnection::OnSocketReadable [this=141e4cd00]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpTransaction::WriteSegments 1387abb00
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpConnectionMgr::ShouldThrottle trans=1387abb00
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: E/nsHttp nsHttpTransaction::OnSocketStatus [this=1387abb00 status=804b0006 progress=225]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpTransaction::WritePipeSegment 1387abb00 written=175
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: E/nsHttp nsHttpTransaction::ProcessData [this=1387abb00 count=175]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpTransaction::ParseHead [count=175]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: E/nsHttp nsHttpTransaction::ParseLine [HTTP/1.1 101 Switching Protocols]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: V/nsHttp nsHttpResponseHead::ParseVersion [version=HTTP/1.1 101 Switching Protocols]
 0:08.04 GECKO(40178) [Parent 40178: Socket Thread]: E/nsHttp Have status line [version=11 status=101 statusText=Switching Protocols]

Note that in both cases, Firefox received 509 bytes, but in the failed case, the data was obviously wrong.
Still not sure which part is wrong.

Looking at Bug 1784824 - Intermittent toolkit/content/tests/chrome/test_about_networking.html | single tracking bug which seems to also be failing because of the websocket server,

I see this in the log:

[task 2022-07-20T08:57:35.143Z] 08:57:35     INFO - TEST-START | toolkit/content/tests/chrome/test_about_networking.html
[task 2022-07-20T08:57:36.215Z] 08:57:36     INFO -   could not open connection to the real server
[task 2022-07-20T08:58:00.178Z] 08:58:00     INFO - GECKO(1983) | 2022-07-20 08:58:00.177 firefox[1983:28845] Persistent UI failed to open file file:///Users/cltbld/Library/Saved%20Application%20State/org.mozilla.nightly.savedState/window_1.data: No such file or directory (2)
[task 2022-07-20T09:02:36.197Z] 09:02:36     INFO - TEST-INFO | started process screencapture
[task 2022-07-20T09:02:36.326Z] 09:02:36     INFO - TEST-INFO | screencapture: exit 0
[task 2022-07-20T09:02:36.326Z] 09:02:36     INFO - TEST-UNEXPECTED-FAIL | toolkit/content/tests/chrome/test_about_networking.html | Test timed out. - 

That string is coming from here
I assume that just the websocket server isn't up, and ssltunnel is returning an error?

I'll take a look if we can check exactly what the WSserver is doing.

See Also: → 1777030
See Also: → 1781672
See Also: → 1784995
See Also: → 1777102
See Also: → 1767662
See Also: → 1769016

The issue has also been observed with test tasks which got added to pushes from October and failed for these. Either the machine environment or an artifact used for test execution changed.

Only one failure in the last ~12 hours. Resolved by fixing the reboots in bug 1797816 comment 15.

Please close the bug if no further work is planned.

Thanks! Glad to see this resolved. Most of the bugs in the see also list also seem to show reduced failure rates.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(valentin.gosu)
Resolution: --- → WORKSFORME
Whiteboard: [stockwell disable-recommended] → [fixed by bug 1797816 comment 15]
See Also: 1769016

We are still seeing a lot of failures on OSX websocket tests. Could you take a look?
Should we reopen bug 1797816 or is this a different problem?

Status: RESOLVED → REOPENED
Flags: needinfo?(aerickson)
Resolution: WORKSFORME → ---
See Also: → 1801420

The latest intermittent failures robot posts are from 3 months ago. Is it still a problem?

The configuration issue has been resolved (unless there are workers that haven't been updated - which could happen).

Michelle, as an FYI. In Bug 1797816 we had a g-w config issue that caused the hosts to not exit and reboot after running one job (they're run forever). Not sure if r8 009 is still quarantined, but may want to note in inventory.

Flags: needinfo?(aerickson) → needinfo?(mgoossens)

We're still seeing some websocket tests intermittently failing only on OSX:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1777102&startday=2023-02-13&endday=2023-02-19&tree=all
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1801420&startday=2023-02-13&endday=2023-02-19&tree=all

No failures since the 16th as far as I can tell. Will ping back next week if we see any more happening.

-009 isn't quarantined anymore and looking at recent jobs it is working properly
Let's hope we get no more failures.

Flags: needinfo?(mgoossens)
Whiteboard: [fixed by bug 1797816 comment 15][stockwell disable-recommended] → [fixed by bug 1797816 comment 15]
Severity: -- → S3
Priority: -- → P3
Whiteboard: [fixed by bug 1797816 comment 15] → [fixed by bug 1797816 comment 15][necko-triaged][necko-monitor]
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.