Intermittent test_websocket1.html,test_websocket2.html,test_websocket3.html,test_websocket4.html | Test timed out. || test 34 custom server code | test 34 custom server reason

RESOLVED FIXED in Firefox 47

Status

()

Core
DOM: Workers
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: KWierso, Assigned: baku)

Tracking

({intermittent-failure})

47 Branch
mozilla48
intermittent-failure
Points:
---

Firefox Tracking Flags

(firefox47 fixed, firefox48 fixed)

Details

Attachments

(1 attachment)

(Reporter)

Comment 1

2 years ago
https://treeherder.mozilla.org/logviewer.html#?job_id=23252125&repo=mozilla-inbound
Summary: Intermittent test_websocket2.html | Test timed out. → Intermittent test_websocket2.html,test_websocket4.html | Test timed out.
(Reporter)

Comment 2

2 years ago
https://treeherder.mozilla.org/logviewer.html#?job_id=23279784&repo=mozilla-inbound
Summary: Intermittent test_websocket2.html,test_websocket4.html | Test timed out. → Intermittent test_websocket1.html,test_websocket2.html,test_websocket4.html | Test timed out.
(Reporter)

Comment 3

2 years ago
https://treeherder.mozilla.org/logviewer.html#?job_id=23284391&repo=mozilla-inbound
Summary: Intermittent test_websocket1.html,test_websocket2.html,test_websocket4.html | Test timed out. → Intermittent test_websocket1.html,test_websocket2.html,test_websocket4.html | Test timed out. || test 34 custom server code | test 34 custom server reason
Flags: needinfo?(amarchesini)
Summary: Intermittent test_websocket1.html,test_websocket2.html,test_websocket4.html | Test timed out. || test 34 custom server code | test 34 custom server reason → Intermittent test_websocket1.html,test_websocket2.html,test_websocket3.html,test_websocket4.html | Test timed out. || test 34 custom server code | test 34 custom server reason
(Assignee)

Comment 4

2 years ago
Created attachment 8730005 [details] [diff] [review]
mutex.patch

The reason why we have this intermittent failure is that WebSocketChannel uses ChannelEventQueue from different threads. With this patch, I made ChannelEventQueue thread-safe, basically adding a mutex, removing ShouldEnqueue+Enqueue and adding RunOrEnqueue().

RunOrEnqueue() uses a mutex to check if we can run the ChannelEvent immediately or if we have to store it until a pending operation is completed.

This patch fixes another 'potential' error: PrependEvents() adds new events in the queue, but what about if this is called by a callback scheduled by ForceFlush()? Without this patch we use an index to schedule callbacks in order, but if we change the array in the meantime, that index points to wrong callbacks.
Assignee: nobody → amarchesini
Flags: needinfo?(amarchesini)
Attachment #8730005 - Flags: review?(jduell.mcbugs)

Comment 5

2 years ago
20 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 15
* fx-team: 3
* mozilla-central: 2

Platform breakdown:
* windows7-32: 13
* osx-10-10: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1254730&startday=2016-03-07&endday=2016-03-13&tree=all
(Assignee)

Updated

2 years ago
Attachment #8730005 - Flags: review?(jduell.mcbugs) → review?(michal.novotny)
Comment on attachment 8730005 [details] [diff] [review]
mutex.patch

Review of attachment 8730005 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, r+ when you keep somehow the assertions.

::: netwerk/ipc/ChannelEventQueue.h
@@ +206,3 @@
>  }
>  
> +// Ensures that RunOrEnqueue() will collecting events during its lifetime

will be collecting

::: netwerk/protocol/ftp/FTPChannelChild.cpp
@@ -382,5 @@
> -  if (mEventQ->ShouldEnqueue()) {
> -    mEventQ->Enqueue(
> -      new FTPDataAvailableEvent(this, channelStatus, data, offset, count));
> -  } else {
> -    MOZ_RELEASE_ASSERT(!mDivertingToParent,

We're losing this check in the new code (also at other places in this patch). Given that it is release assert I assume it's important and should be kept. Maybe RunOrEnqueue could have an argument assertWhenNotQueued or something like that.
Attachment #8730005 - Flags: review?(michal.novotny) → review+

Comment 8

2 years ago
Is this something we need on other branches too?
> +  // @param aCallback - the ChannelEvent
> +  // @param aAssertWhenNotQueued - this optional param will be used in an
> +  //   assertion when the event is executed directly.
> +  inline void RunOrEnqueue(ChannelEvent* aCallback,
> +                           bool aAssertWhenNotQueued = true);

Description of the argument and the logic is correct, but the from the name I would expect that if the argument is true then the assertion is raised if the event is not queued. Maybe aAssertionWhenNotQueued would be better name?

And all the places below should IMO use the assertion argument as well.

> -  MOZ_RELEASE_ASSERT(mEventQ->ShouldEnqueue());
> -
> -  mEventQ->Enqueue(new HttpFlushedForDiversionEvent(this));
> +
> +  mEventQ->RunOrEnqueue(new HttpFlushedForDiversionEvent(this));

 
> -  if (mEventQ->ShouldEnqueue()) {
> -    mEventQ->Enqueue(new FTPFlushedForDiversionEvent(this));
> -  } else {
> -    MOZ_CRASH();
> -  }
> +  mEventQ->RunOrEnqueue(new FTPFlushedForDiversionEvent(this));


> -  if (mEventQ->ShouldEnqueue()) {
> -    mEventQ->Enqueue(new StopRequestEvent(this, channelStatus, timing));
> -  } else {
> -    MOZ_ASSERT(!mDivertingToParent, "ShouldEnqueue when diverting to parent!");
> -
> -    OnStopRequest(channelStatus, timing);
> -  }
> +  mEventQ->RunOrEnqueue(new StopRequestEvent(this, channelStatus, timing));

Comment 11

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/db5ef9e43187
https://hg.mozilla.org/mozilla-central/rev/a8ae52b3b96c
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox48: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
(Assignee)

Comment 12

2 years ago
Comment on attachment 8730005 [details] [diff] [review]
mutex.patch

Approval Request Comment
[Feature/regressing bug #]: WebSocket in workers
[User impact if declined]: a race condition can block the scheduling of events.
[Describe test coverage new/current, TreeHerder]: none
[Risks and why]: The patch makes ChannelEventQueue a thread-safe object. It's a small patch but it uses mutex.
[String/UUID change made/needed]: none

This approval request is for both m-i commits.
Attachment #8730005 - Flags: approval-mozilla-aurora?
Wes, Carsten: Can you please confirm whether this fixes the intermittent or not? It's a pretty big code change and I want to be sure this helps the situation. Thanks!
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
(Reporter)

Comment 14

2 years ago
Possibly too soon to know for sure, but there's been less occurrences since this landed: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1254730
Flags: needinfo?(wkocher)
Flags: needinfo?(cbook)
Comment on attachment 8730005 [details] [diff] [review]
mutex.patch

This seems to be helping the intermittent failure, Aurora47+
Attachment #8730005 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Updated

2 years ago
status-firefox47: --- → affected

Comment 17

2 years ago
11 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 5
* mozilla-inbound: 3
* mozilla-central: 2
* ash: 1

Platform breakdown:
* windows7-32: 7
* osx-10-10: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1254730&startday=2016-03-14&endday=2016-03-20&tree=all
You need to log in before you can comment on or make changes to this bug.