Connectivity issues with Hello with frequent server errors (perhaps regional)

RESOLVED WORKSFORME

Status

--
critical
RESOLVED WORKSFORME
4 years ago
4 years ago

People

(Reporter: mreavy, Assigned: tarek)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
As of 1/27 ~5-6pm EST we're seeing a number of connection failures and lost calls.  This is being seen by at least two people, both in the Philadelphia area.  They are being seen on local inbound builds, Aurora/dev-edition and Beta builds.

Classes of problems:
* Inability to call from machine to machine
**  Calling between two machines on the same desk, we're seeing significant failures to connect.  We get notification on the room creator side that someone has entered the room, but we either see no attempt to connect (stays on self-image), or after a while it tries to connect and eventually goes to something went wrong.  On the link-clicker (standalone) side, we may see "you're the only one here" or it may attempt (after a minute or 3) to connect, and then after some longer period fails.
**  Something similar was seen calling one's self in the same browser, which normally works (create a room, then copy the link and paste it into a tab in the same browser).
**  Direct calling has similar problems.  Notifications of calls weren't coming through or if they did, it took long times to connect after answering the call (up to one minute to receive audio and video on both sides)
  
* Dropped calls:
**  When calling to/from people in other areas, typically we were able to connect (sometimes with no video on one side for a while at first).  However, most/all of these calls failed unexpectedly within a few minutes to 10 minutes (with "Something went wrong").  

Looking in the browser console and web console, we saw a number of variants of socket/wss errors reported referencing the tokbox servers:
Firefox can't establish a connection to the server at wss://media005-nyj.tokbox.com/rumorwebsocketsv2.
The connection to wss://media014-nyj.tokbox.com/rumorwebsocketsv2 was interrupted while the page was loading.
The connection to wss://media008-nyj.tokbox.com/rumorwebsocketsv2 was interrupted while the page was loading.
"Rumor.Socket: Rumor Socket Disconnected: Connectivity loss was detected as it was too long since the socket received the last PONG message"
1501 "Session.subscribe :: InvalidStreamID" sdk.js:910:14 and "OT.exception :: title: undefined (1501) msg: Session.subscribe :: InvalidStreamID" sdk.js:910:14
"WebSocketConnection:null:Timed out while waiting for the Rumor socket to connect" OT.exception :: title: Connect Failed (1006) msg: WebSocketConnection:null:Timed out while waiting for the Rumor socket to connect." "Failed to complete connection" Object { code: 1006, message: "WebSocketConnection:null:Timed out while...

The relevant errors are http://logs.glob.uno/?c=mozilla%23loop#c28534 and http://logs.glob.uno/?c=mozilla%23loop#c28542 and http://logs.glob.uno/?c=mozilla%23loop#c28599

Comment 1

4 years ago
We were unable to reproduce this at TokBox and also when I tried a call many times with Nils.

We see failures in our logs for one of the sessions you reported, but do not see any spike in failures system wide.

We are going to check if there were any connectivity issues in any of our data centers. I'll update once we have this information.
As bobm mentioned in Bug 1126410, this morning I inadvertently launched a loadtest pointing to the production server.  Although it was a relatively light load (30 users), folks immediately began seeing significant connection failures.  After an hour I realized this had happened and terminated the test.  Throughout the day however people continued to report on IRC that that were experiencing intermittent connectivity and media quality issues.  

Wes discovered this afternoon that one of the two loads-tool slaves (loads-slave3) had hung and was still hitting our prod server.  While on a call with msander, I rebooted the slave and we didn't see any connectivity, nor media quality issues on 2 separate calls.  If there are no further issues with Hello by tomorrow morning, then this was likely the cause.
I know nothing of the load test code, but would it be possible to add a check in for the server url and query the instigator if they really do want to run the tests against production?
The loadtests currently don't target production, so one would need to specifically target the prod server to do that (by default it queries the stage server).

I believe it's possible to add a blacklist of servers which shouldn't be hit by one specific loadtest, yes, or at least the public DNS name of loop.

Richard, how have you run this against prod? was it a temporary amazon DNS? (if so there is not much I can do).
Flags: needinfo?(rpappalardo)
Also, it seems that we may have a problem with Tokbox being unresponsive / slow when too many queries are sent there. See bug 1125777.
Assignee: nobody → alexis+bugs
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1126410
(In reply to Alexis Metaireau (:alexis) from comment #4)
> Richard, how have you run this against prod? was it a temporary amazon DNS?

We usually run a single make test against prod to verify the deployment.  I temporarily modified the Makefile, but neglected to switch it back.  The following day when I kicked off another round of tests (I thought on stage), the test launched against prod.  As Mark suggested, adding something like a warning in the Makefile might help to prevent user error.
Flags: needinfo?(rpappalardo)
It seems to be working for me this morning (I was one of the people in Philadelphia affected in the report).  Connections are basically instant, with the same browsers/builds used last night.

I suspect that either rpapa's load test, or the HTTP connection limit Alexis referred to, or perhaps the combination of the two was at fault.  It might be useful to find out if this was really the case, using a stage server to duplicate the problem, since per rpapa's comment the load test was a "light" load test (~30 user) and caused almost immediate problems.

Comment 9

4 years ago
(In reply to Alexis Metaireau (:alexis) from comment #6)
> 
> *** This bug has been marked as a duplicate of bug 1126410 ***


I don't think that's correct. The subject of Bug 1126410 is Loop server failures: requests sent to the loop server were timing out and/or producing 500-class messages.

The subject of this bug is a series of highly reproducible and ongoing websockets failures clearly associated with the TokBox infrastructure. These errors are of the form:

 * Firefox can't establish a connection to the server at wss://media005-nyj.tokbox.com/rumorwebsocketsv2. [1]

 * The connection to wss://media008-nyj.tokbox.com/rumorwebsocketsv2 was interrupted while the page was loading. [2]

 * The connection to wss://media014-nyj.tokbox.com/rumorwebsocketsv2 was interrupted while the page was loading. [3]


I want to emphasize as strongly as possible that these were not three discrete failures; these were three samples from an ongoing, highly-reproducible series of failures that lasted, minimally, for most of an hour.

This disturbs me on two fronts.

First, we had a clear, highly reproducible, and nearly total outage for at least some portion of the network (potentially the entire eastern half of the US), and no alarms were raised.

Second, and this is far more worrying: even *knowing* that we had this issue, we can't seem to locate any information pertaining to the failure for a root cause analysis. I really, really hope this is simply an issue of looking harder.

I'm reopening this bug to track the problem we saw with the machines in the subdomain tokbox.com (see the above messages), leaving 1126410 for the machines in the subdomain mozilla.com. It's not inconceivable that the problems are related, but it seems unlikely (given the multi-hour delay between the mozilla.com problems and the tokbox.com problems).

Mike: you said you'd check for data center connectivity issues. Did those turn anything up?

___
[1] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28542
[2] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28599
[3] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28621
Status: RESOLVED → REOPENED
Flags: needinfo?(msander)
Resolution: DUPLICATE → ---
Comment hidden (offtopic)

Comment 11

4 years ago
We did not have any connectivity issues in any of our data centers yesterday.

There was also no difference in loop connect times before/after 2pm yesterday.

Today we will look specifically at loop connectivity and publish/subscribe failure rates by hour.
Flags: needinfo?(msander)

Comment 12

4 years ago
(In reply to Adam Roach [:abr] from comment #9)
> (In reply to Alexis Metaireau (:alexis) from comment #6)
>  * Firefox can't establish a connection to the server at
> wss://media005-nyj.tokbox.com/rumorwebsocketsv2. [1]
> 
>  * The connection to wss://media008-nyj.tokbox.com/rumorwebsocketsv2 was
> interrupted while the page was loading. [2]
> 
>  * The connection to wss://media014-nyj.tokbox.com/rumorwebsocketsv2 was
> interrupted while the page was loading. [3]
>
> [1] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28542
> [2] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28599
> [3] http://logs.glob.uno/?c=mozilla%23loop&s=27+Jan+2015&e=27+Jan+2015#c28621

[2] & [3] may be a red herring. Those errors appear when you refresh the page during a call.

Comment 13

4 years ago
There was a spike in failures during the 2-3pm PST hour for Fx36 & 37. This traces back to a single user in Pennsylvania (jessup?). All other delays were in single digit seconds and were not out of the norm.
Assigning Tarek here since he's seing with them how to detect problems and how to react.
Assignee: alexis+bugs → tarek
I'm closing this as a WFM for now, since we haven't seen any problems about that lately. In case it happens again, we'll reopen with some more detailed information, thanks.
Status: REOPENED → RESOLVED
Last Resolved: 4 years ago4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.