Closed
Bug 1256244
Opened 8 years ago
Closed 8 years ago
V1.0 regression - too many UNEXPECTED_SERVER_RESPONSE (2001) exceptions seen desktop client side
Categories
(Hello (Loop) :: Client, defect, P1)
Hello (Loop)
Client
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: RT, Assigned: standard8)
References
Details
(Whiteboard: [btpp-fix-now][47])
User Story
Brief from Erik: We're started getting a lot of UNEXPECTED_SERVER_RESPONSE (2001) exceptions. My last report script had a bug in it. More accurate output is below. Of the 2001s, 148 (26%) were thrown by clickers, 405 by built-ins. For comparison, 12 (10%) were thrown by clickers and 109 by built-ins between 2/27 and 3/6. So we're having a lot of 2001s, with a much higher proportion from built-ins than usual. Looks a lot like a regression in 45. Computing furthest-state histogram for 2016-03-04... No common states between built-in and clicker. This may be due to timing slop. Going with max state. refresh 20 leave 2 join **** 101 waiting * 22 starting 20 receiving 7 sending ** 43 sendrecv ** 45 success *************************************************************************************** 1769 Exceptions within sendrecv: 2001 ***************** 8 1554 **************************** 13 1553 *********** 5 4010 ****** 3 1013 ***************** 8 1500 ************* 6 1006 **** 2 119 sessions span midnight (5.86495810744%). Computing furthest-state histogram for 2016-03-05... refresh 0 leave 0 join ***** 93 waiting * 24 starting 15 receiving 9 sending * 25 sendrecv * 37 success ***************************************************************************************** 1648 Exceptions within sendrecv: 1553 ************* 5 1554 ******************************** 12 4010 **************** 6 1006 ******** 3 2001 ***** 2 1500 ***** 2 1013 ****************** 7 108 sessions span midnight (5.83468395462%). Computing furthest-state histogram for 2016-03-06... refresh 1 leave 0 join ****** 146 waiting * 41 starting 13 receiving 11 sending ** 53 sendrecv * 44 success ************************************************************************************** 1961 Exceptions within sendrecv: 1553 *********** 5 1013 ******************** 9 4010 *********** 5 1500 **** 2 1554 ************************************ 16 2001 *************** 7 111 sessions span midnight (4.88986784141%). Computing furthest-state histogram for 2016-03-07... No common states between built-in and clicker. This may be due to timing slop. Going with max state. refresh 0 leave 0 join ****** 111 waiting * 21 starting 10 receiving 9 sending ** 40 sendrecv ** 47 success *************************************************************************************** 1606 Exceptions within sendrecv: 1554 ******************************* 15 1500 ********** 5 1013 ************************* 12 4010 **** 2 1006 ****** 3 2001 ********** 5 1553 ********** 5 115 sessions span midnight (6.23644251627%). Computing furthest-state histogram for 2016-03-08... refresh 0 leave 0 join ******* 177 waiting * 41 starting 17 receiving 13 sending ** 58 sendrecv ********** 234 success **************************************************************************** 1719 Exceptions within sendrecv: 2001 *************************** 65 1013 ************************************************************** 146 1500 ** 7 1553 2 1554 *** 9 1006 * 3 4010 2 95 sessions span midnight (4.20540061974%). Computing furthest-state histogram for 2016-03-09... refresh 0 leave 0 join ****** 169 waiting ** 58 starting 14 receiving 8 sending ** 74 sendrecv ********************** 636 success ***************************************************************** 1820 Exceptions within sendrecv: 2001 ************************************* 240 1013 ******************************************************** 360 1500 ** 16 1554 * 11 4010 6 1006 2 1553 1 115 sessions span midnight (4.13817920115%). Computing furthest-state histogram for 2016-03-10... No common states between built-in and clicker. This may be due to timing slop. Going with max state. refresh 0 leave 0 join ****** 156 waiting * 43 starting 20 receiving 9 sending ** 54 sendrecv ************************ 602 success *************************************************************** 1570 Exceptions within sendrecv: 1500 ***** 35 2001 ***************************************** 248 1013 ************************************************** 303 1006 3 1554 * 8 1553 4 4010 1 94 sessions span midnight (3.8304808476%).
Attachments
(1 file, 1 obsolete file)
708.70 KB,
image/png
|
Details |
No description provided.
Reporter | ||
Updated•8 years ago
|
Rank: 5
User Story: (updated)
Priority: -- → P1
Reporter | ||
Comment 1•8 years ago
|
||
Can you please clarify at which point in the flow the exception occurs and if user experience gets affected to help set a priority for this?
Flags: needinfo?(standard8)
Flags: needinfo?(erik)
Assignee | ||
Comment 2•8 years ago
|
||
Ok, I've dug into the logs. I took a random sampling of about 100 exception reports (some of these were from the same rooms/duplicates, so it was probably around 50 distinct reports, maybe less). For numbers referenced below: - 2001 = unexpected server response - 1013 = peer-to-peer connection failed From this, I can see that: - The 2001 exceptions on the link clicker side, only seem to occur after a 1013. I think this has been ongoing for a while. - The 2001 exceptions on the link generator side, occur in one of two distinct occasions: a) The link generator is the only person in the room. At some stage the 2001 exception occurs, OpenTok sees this as a publishing failure. b) There have been two people in the room. The link clicker exits the room, and so the link generator sees the clicker's session being destroyed, and for some reason, this then triggers the 2001. Since a) wouldn't affect the graphs, I suspect b) is most likely here. As far as I can tell there's no degradation in user experience - it seems that the sdk automatically re-publishes. Erik, are we able to filter out exceptions after the first person has left the room (if we aren't doing this already)? If so, does that improve the statistics?
Flags: needinfo?(standard8)
Assignee | ||
Comment 3•8 years ago
|
||
Also, need-info to Mike to make him aware of this - Mike, I'll seen you additional info in email direct.
Flags: needinfo?(msander)
Comment 4•8 years ago
|
||
Mark, when a person leaves a room, the session ends; that's how I've defined sessions for metrics purposes. So, the only exception that could be construed as post-leave would occur as part of the leave event itself. If I understand (b) correctly, those shouldn't concern us. That said, the event after a leave would begin a new session, as long as 2 people meet again in the room at some point, so it's possible this later session has your (b) exception right off the bat.
Flags: needinfo?(erik)
Comment 5•8 years ago
|
||
HOWEVER, those later sessions would not be reported as part of the exception stats I generated for this bug, unless those sessions themselves never got all the way to Success. I suspect that tamps them down past the point of diminishing returns.
Comment 6•8 years ago
|
||
I got some exceptions related to this in the console (screenshot attached), and it put me into the situation where video is lost, audio stays on, and the invitation screen comes up.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → standard8
Assignee | ||
Comment 7•8 years ago
|
||
I've been able to do some debugging around this, and I've found what appears to be the main issue: - On the link clicker, when a room transitions to "SESSION_CONNECTED" (i.e. connected to Tokbox's servers) from any other state, it attempts to start browser sharing. - The browser sharing process will add a new listener to the privileged chrome process to get the tab information, and it sets the tab sharing into a pending state. When the listener is added, there's an initial response that we use to start browser sharing with the appropriate window id. This window id gets passed to the Tokbox SDK with a publish request. - To get to the SESSION_CONNECTED state, there's currently two ways: 1) When the link generator opens the room, they transition through gathering, joining (and more) to SESSION_CONNECTED. 2) When a link clicker is in the room at the same time as the link generator, the room state will be HAS_PARTICIPANTS, but then, when the clicker leaves the room, the room state reverts to SESSION_CONNECTED. - Its the latter case that is the issue. Reverting to the fact the session is connected, triggers another start browser sharing request, and another listener is added. At this time, the Tokbox sdk complains: OT.Publisher State Change Failed: 'PublishingToSession' cannot transition to 'PublishingToSession'sdk.js:32172:5 OT.exception :: title: Unexpected Server Response (2001) msg: undefined 2001 Session.publish :: Unexpected server response. Try this operation again later. 2001 Session.publish :: Session.publish :: Unexpected server response. Try this operation again later. - When the link clicker joins again (without the link generator having closed the room), everything works, but the SDK appears to be setting up multiple peer connections/streams i.e. one for the tab sharing, plus one for each time the link clicker has exited the room.
Assignee | ||
Comment 8•8 years ago
|
||
From the statistics perspective, I've discussed with Erik how the graph is assessing the furthest state for sendrecv versus success. Here's an outline of our logs in these error cases: - Generator join - Clicker join *** - Clicker leaves - Other status messages from generator +++ - <2001 exception from generator> - Clicker joins again *** The *** represents where we would expect to classify a session of two people in the room to start from. However, the stats code is actually starting the second session from +++. So, for the cases where a room has more than one session, we're seeing any session after the first one being reported as having an exception, leading to the reduced success rates of our graphs. We should fix the stats, but it probably wouldn't revert us to our normal rates until we fix the issue, as I've seen cases where the 2001 comes in after the next clicker join.
Comment 9•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8730975 -
Attachment is obsolete: true
Assignee | ||
Comment 10•8 years ago
|
||
Summary time: - There's an issue in our code which tries to restart tab sharing, this may cause extra connections/higher load, but doesn't seem to functionally affect things. I've moved this fix our to bug 1257009. - The stats code is miss-reporting failures outside of the session. - The Tokbox code appears to detect a duplicate publish, but then allows it anyway. Maybe that's intentional, but I think the errors shouldn't be presented if that's the case. I'm going to leave the needinfo to Mike to see if there's an SDK issue here.
Reporter | ||
Comment 11•8 years ago
|
||
Thanks Mark, lowering the priority now since given the analysis it does not seem like a must fix for 1.2
Rank: 5 → 20
Priority: P1 → P2
Comment 12•8 years ago
|
||
Just chatted with Mark again. To clarify https://bugzilla.mozilla.org/show_bug.cgi?id=1256244#c8 about the stats: * A stats "session segment" is not a "session of 2 people"; it's just a way of dividing up the stream of events in a room into chunks over which we can establish furthest states. * Starting a session segment at +++ is expected, as specced at https://github.com/erikrose/hello_stats/blob/b42e0c265c63fc30b7bfed6c4299a8fe1e135433/hello_stats/sessions.py#L271-L272. However, we should not allow states or exceptions to figure in the stats until a second participant joins. We could either move the boundaries of session segments or ignore those states or exceptions. In any case, those events between +++ and *** are needed to establish that the generator is still in the room.
Comment 13•8 years ago
|
||
Adam, could you take a look at this and see if there is something we'd like to change in the SDK to handle this case better?
Flags: needinfo?(msander) → needinfo?(adam)
Comment 14•8 years ago
|
||
Sure, I have filed a ticket internally for us and we'll investigate it.
Flags: needinfo?(adam)
Assignee | ||
Comment 15•8 years ago
|
||
This is probably already fixed, but we need to watch it as 1.2 goes out next week.
Rank: 20 → 19
Priority: P2 → P1
Whiteboard: [btpp-fix-now][47]
Assignee | ||
Comment 16•8 years ago
|
||
So bug 1257009 has now been released. We've seen a very small improvement (probably 3%) in the calls complete successfully case. However, we're still seeing about 20% of conversations not completing successfully after two people have connected to the room. My local analysis over a week shows that in about 65% of these cases, we're seeing a 1013 (P2P connection failed) error being logged, there's another 30% of cases where we're seeing 2001s (unexpected server response). Since these all started around FF 45, our best guess is that it is something to do with having turned on screen sharing by default. Unfortunately our stats don't tell us which connection is failing. Tokbox is currently looking into the errors as well.
Assignee | ||
Updated•8 years ago
|
Rank: 19 → 10
Updated•8 years ago
|
Rank: 10 → 22
Assignee | ||
Comment 17•8 years ago
|
||
As of Wednesday 13th July, bug 1285916 has been deployed to the standalone. It appears to be reducing the rate that we were seeing some of these exceptions. We're still waiting on more data though.
Assignee | ||
Comment 18•8 years ago
|
||
Short summary of investigations: - 2001 is a generic error returned by Tokbox when a client calls to the server to do something. This seems to happen in a lot of failure cases (e.g. connection failure) and is coupled with other exceptions. - The connection rates we were seeing on the graphs were also masked by other stats issues. - We found a small percentage of issues (~ < 5%) that looked to be a Tokbox issue on their server, and have been reported to Tokbox. - Most of the rest of issues we would have needed more detailed metrics, which we weren't able to get due to Loop being shut down. Hence closing this bug as no more to do now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•