WebSockets on a WebWorker silently drops packets and receives corrupt data

RESOLVED DUPLICATE of bug 1320744

Status

()

Core
Networking: WebSockets
RESOLVED DUPLICATE of bug 1320744
a year ago
8 months ago

People

(Reporter: Malcolm Stagg, Assigned: michal)

Tracking

53 Branch
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [necko-active])

Attachments

(2 attachments)

(Reporter)

Description

a year ago
Created attachment 8828608 [details]
webworkers_lost_data.zip

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393

Steps to reproduce:

Tested on 50.1.0 release and 53.0a1 nightly running on Windows 10.

1) Install node.js on the local machine
2) Extract the attached zip file and run "node wsserver.js"
3) Launch Firefox and open http://localhost:12345/wsclient.html
4) Observe the console log messages
5) Open http://localhost:12345/wsclient_worker.html
6) Observe console log messages


Actual results:

In http://localhost:12345/wsclient.html:

Connected to localhost:12345
Received 10000 packets for a total of 10000000 bytes
Connection closed

In http://localhost:12345/wsclient_worker.html:

Connected to localhost:12345
Byte 411000: Expected 120, got 24
Received 412 packets for a total of 411000 bytes
Connection closed

(packet offset of the data corruption changes randomly)


Expected results:

Both the wsclient.html (without webworkers) and wsclient_worker.html (with webworkers) should have received 10000 packets and 10000000 bytes.

This appears to repro most often when a lot of data is being sent quickly. You can adjust the configuration in wsserver.js to see the affects of a larger packet delay, etc. With the current wsserver.js configuration, this repros every time in my testing.

Updated

a year ago
Component: Untriaged → Networking: WebSockets
Product: Firefox → Core
hmm.. this is probably in the dom queueing somewhere, but michal do you want to see if anything looks amiss and route appropriately?
Assignee: nobody → michal.novotny
Whiteboard: [necko-active]

Comment 2

10 months ago
@Michal. Did you get a chance to triage this bug. This affects us in real traffic as well.
Flags: needinfo?(mcmanus)
Flags: needinfo?(mcmanus) → needinfo?(michal.novotny)
(Assignee)

Comment 3

9 months ago
I can't reproduce it. I tried also to increase packetLength to 10000, but it always succeeds on my computer.
Flags: needinfo?(michal.novotny)
(Reporter)

Comment 4

9 months ago
@Michal, thanks for looking at this. If it helps at all I recently tried this on Ubuntu and was also unable to reproduce. We have only reproduced this with Firefox on Windows in our testing. Is that what you tried?
(Assignee)

Comment 5

9 months ago
I tested it on Fedora. Unfortunately, I don't have Windows development machine.
(Reporter)

Comment 6

9 months ago
@Michal, FYI I just tried the latest Firefox nightly build on Windows and this bug appears to be fixed. Just still hitting it on the latest release build though.
(Reporter)

Comment 7

9 months ago
Quick update, it looks like it is also fixed in 54.0b13, so should ship next week. Thanks Michal and all.

Comment 8

9 months ago
@Malcolm were you able to find the patch that fixed this issue?
(Reporter)

Comment 9

9 months ago
@Antony, no I took a bit of a look but haven't found any changes that are clearly related
(Reporter)

Comment 10

8 months ago
This is still broken in the Firefox 54.0 release build running on Windows 10, unfortunately. The 55.0b1 beta is working for us though.
(Reporter)

Comment 11

8 months ago
@Michal or anyone else familiar with the Firefox release process.

Strangely enough, betas installed from here: 
https://download-installer.cdn.mozilla.net/pub/firefox/releases/54.0b1/win32/en-US/Firefox%20Setup%20Stub%2054.0b1.exe
do *not* repro. 

Betas installed from here:
https://ftp.mozilla.org/pub/firefox/releases/54.0b1/win32/en-US/Firefox%20Setup%2054.0b1.exe
*do* repro.

Release builds installed from either download-installer.cdn.mozilla.net or ftp.mozilla.org/pub/firefox/releases *do* repro.

This is true going back to 50.0 or earlier. 

Would this give any hint about where the problem lies? Thanks!
(Reporter)

Comment 12

8 months ago
Ah, my apologies for incorrect information in the last message, I see that Firefox Setup Stub *.exe is always updating to 55.0b1 automatically, so that was not comparing the same versions. 

Now, 54.0b13 repros this bug when installed from either 
https://download-installer.cdn.mozilla.net/pub/firefox/releases/54.0b13/win32/en-US/Firefox%20Setup%2054.0b13.exe (no "Stub")
or
https://ftp.mozilla.org/pub/firefox/releases/54.0b13/win32/en-US/Firefox%20Setup%2054.0b13.exe

However, I had 54.0b13 installed over a week ago from Firefox%20Setup%20Stub%2054.0b13.exe when it was the newest version without having a repro of the bug. 

FYI, we are not getting a repro of the bug with 55.0b1, installed from any of the sources.

Comment 13

8 months ago
@Malcolm, is it possible to run the tool mozregression to narrow down which patch fixed this issue?
You can see http://mozilla.github.io/mozregression/ for details.  Thanks!
Flags: needinfo?(mstagg)
(Reporter)

Comment 14

8 months ago
Hi Shian-Yow,

Thanks for the link! I wasn't aware of that tool - that is really useful. I ran it twice to make sure, with the same result:

2017-06-15T10:59:46: INFO : Narrowed inbound regression window from [a8bb5c8e, bf0f8972] (3 revisions) to [5182b2c4, bf0f8972] (2 revisions) (~1 steps left)
2017-06-15T10:59:46: DEBUG : Starting merge handling...
2017-06-15T10:59:46: DEBUG : Using url: https://hg.mozilla.org/mozilla-central/json-pushes?changeset=bf0f897261395ed2488f8fef72cf5353da1e105e&full=1
2017-06-15T10:59:47: DEBUG : Found commit message:
Bug 1350814 - Replace use of int64_t for microseconds by TimeUnit in AudioSink. r=kikuo

MozReview-Commit-ID: 3diOpJu7g8i

2017-06-15T10:59:47: INFO : The bisection is done.
2017-06-15T10:59:47: INFO : Stopped

From looking over that changeset ( https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a8bb5c8e0e80e87e21ad63997f8881991b5347cf&tochange=bf0f897261395ed2488f8fef72cf5353da1e105e ), it looks very possible it was fixed by this commit:

Shih-Chiang Chien <schien@mozilla.com>
Mon, 20 Mar 2017 17:29:07 +0800 (2 months ago)

Bug 1320744 - Part 2, Allow ChannelEventQueue to perform flush on multiple threads. r=mayhemer

MozReview-Commit-ID: Egu2mvwFTUF 

Thanks for your help,

Malcolm

Comment 15

8 months ago
Thanks for this info, Malcolm! 

Normally the mozregression can keep bisecting to find the exact commit. Your bisection seems stopped at the mozilla-central tree of the merged commit, and didn't bisect further into mozilla-inbound tree. I tried on my Ubuntu with the following commands, by manually taking the last 2 revisions in your result, and it can keep bisecting deeper into mozilla-inbound tree.  Maybe you can try if it works on Windows.

$ mozregression --find-fix -b 5182b2c4 -g bf0f8972
**********
You should use a config file. Please use the --write-config command line flag to help you create one.
**********

 0:00.39 WARNING: You are using mozregression version 2.3.12, however version 2.3.13 is available.
 0:00.39 WARNING: You should consider upgrading via the 'pip install --upgrade mozregression' command.
 0:00.39 INFO: 5182b2c4 is not a release, assuming it's a hash...
 0:00.39 INFO: bf0f8972 is not a release, assuming it's a hash...
 0:00.39 INFO: Getting mozilla-inbound builds between bf0f8972 and 5182b2c4
 0:04.90 INFO: Testing good and bad builds to ensure that they are really good and bad...
 0:04.90 INFO: Downloading build from: https://queue.taskcluster.net/v1/task/P-2HG-lyTASX_gUECkIOwQ/runs/0/artifacts/public%2Fbuild%2Ftarget.tar.bz2
===== Downloaded 100% =====
 0:29.11 INFO: Running mozilla-inbound build built on 2017-03-28 11:36:30.966000, revision 490b80ca
 0:41.16 INFO: Launching /tmp/tmp6s74iY/firefox/firefox
 0:41.16 INFO: Application command: /tmp/tmp6s74iY/firefox/firefox -profile /tmp/tmpcgvrCm.mozrunner
 0:41.17 INFO: application_buildid: 20170328110610
 0:41.17 INFO: application_changeset: 490b80ca23d23f750fdf5517651d6c1759b8afbc
 0:41.17 INFO: application_name: Firefox
 0:41.17 INFO: application_repository: https://hg.mozilla.org/integration/mozilla-inbound
 0:41.17 INFO: application_version: 55.0a1

Comment 16

8 months ago
SC, according to comment 14, this bug is possibly fixed by your patch in bug 1320744.  Do you think it's relevant?
Flags: needinfo?(schien)
It is possible that my modification on ChannelEventQueue changes the behavior of WebSocket and accidentally fix this bug. However it'll be better to further bisect the inbound revision as a confirmation.

Malcolm can you do one more run of verification following comment #15, since you have the test environment setup nicely? :)
Flags: needinfo?(schien)
(Reporter)

Comment 18

8 months ago
Hi Shian-Yow and Shih-Chiang,

Thanks for your help!

I ran the tool with those options, and the bisection into mozilla-inbound resulted in:

 2:59.83 INFO: Narrowed inbound regression window from [4fd1e9b7, 490b80ca] (3 revisions) to [13eb5319, 490b80ca] (2 revisions) (~1 steps left)
 2:59.84 INFO: No more inbound revisions, bisection finished.
 2:59.84 INFO: First good revision: 490b80ca23d23f750fdf5517651d6c1759b8afbc
 2:59.84 INFO: Last bad revision: 13eb53195041d2adcd3f7ddf467480b1a94d1f53
 2:59.84 INFO: Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=13eb53195041d2adcd3f7ddf467480b1a94d1f53&tochange=490b80ca23d23f750fdf5517651d6c1759b8afbc

This appears to cover mostly the same range of commits.

I'll attach the full mozregression output to the bug for your reference.
Flags: needinfo?(mstagg)
(Reporter)

Comment 19

8 months ago
Created attachment 8878624 [details]
Output from mozregression --find-fix -b 5182b2c4 -g bf0f8972
Thanks Malcolm, I've confirmed that Bug 1320744 part 2 fixed this issue.

An assertion will be triggered before that patch, which looks like a ChannelEvent out-of-order issue.
>Assertion failure: enqueue == true || mEventQueue.IsEmpty() (Should always enqueue if ChannelEventQueue not empty), at /mozilla-central/obj-firefox.noindex/dist/include/mozilla/net/ChannelEventQueue.h:129
>#01: mozilla::net::WebSocketChannelChild::RecvOnBinaryMessageAvailable(nsCString const&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x59259a]
>#02: mozilla::net::PWebSocketChild::OnMessageReceived(IPC::Message const&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x8d5afb]
>#03: mozilla::dom::PContentChild::OnMessageReceived(IPC::Message const&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0xa1fadf]
>#04: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x67e8cb]
>#05: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x67c7c4]
>#06: mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x67d924]
>#07: mozilla::ipc::MessageChannel::MessageTask::Run()[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x67e0ac]
>#08: nsThread::ProcessNextEvent(bool, bool*)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x120a5b]
>#09: NS_ProcessPendingEvents(nsIThread*, unsigned int)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x11df73]
>#10: nsBaseAppShell::NativeEventCallback()[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x2a06591]
>#11: nsAppShell::ProcessGeckoEvents(void*)[/mozilla-central/obj-firefox.noindex/dist/NightlyDebug.app/Contents/MacOS/XUL +0x2a6c89e]
Status: UNCONFIRMED → RESOLVED
Last Resolved: 8 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1320744
You need to log in before you can comment on or make changes to this bug.