Closed Bug 1256244 Opened 8 years ago Closed 8 years ago

V1.0 regression - too many UNEXPECTED_SERVER_RESPONSE (2001) exceptions seen desktop client side

Categories

(Hello (Loop) :: Client, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: RT, Assigned: standard8)

References

Details

(Whiteboard: [btpp-fix-now][47])

User Story

Brief from Erik:
We're started getting a lot of UNEXPECTED_SERVER_RESPONSE (2001) exceptions. My last report script had a bug in it. More accurate output is below.

Of the 2001s, 148 (26%) were thrown by clickers, 405 by built-ins. For comparison, 12 (10%) were thrown by clickers and 109 by built-ins between 2/27 and 3/6. So we're having a lot of 2001s, with a much higher proportion from built-ins than usual. Looks a lot like a regression in 45.



Computing furthest-state histogram for 2016-03-04...
No common states between built-in and clicker. This may be due to timing slop. Going with max state.
  refresh  20
    leave  2
     join **** 101
  waiting * 22
 starting  20
receiving  7
  sending ** 43
 sendrecv ** 45
  success *************************************************************************************** 1769

Exceptions within sendrecv:
     2001 ***************** 8
     1554 **************************** 13
     1553 *********** 5
     4010 ****** 3
     1013 ***************** 8
     1500 ************* 6
     1006 **** 2
119 sessions span midnight (5.86495810744%).


Computing furthest-state histogram for 2016-03-05...
  refresh  0
    leave  0
     join ***** 93
  waiting * 24
 starting  15
receiving  9
  sending * 25
 sendrecv * 37
  success ***************************************************************************************** 1648

Exceptions within sendrecv:
     1553 ************* 5
     1554 ******************************** 12
     4010 **************** 6
     1006 ******** 3
     2001 ***** 2
     1500 ***** 2
     1013 ****************** 7
108 sessions span midnight (5.83468395462%).


Computing furthest-state histogram for 2016-03-06...
  refresh  1
    leave  0
     join ****** 146
  waiting * 41
 starting  13
receiving  11
  sending ** 53
 sendrecv * 44
  success ************************************************************************************** 1961

Exceptions within sendrecv:
     1553 *********** 5
     1013 ******************** 9
     4010 *********** 5
     1500 **** 2
     1554 ************************************ 16
     2001 *************** 7
111 sessions span midnight (4.88986784141%).


Computing furthest-state histogram for 2016-03-07...
No common states between built-in and clicker. This may be due to timing slop. Going with max state.
  refresh  0
    leave  0
     join ****** 111
  waiting * 21
 starting  10
receiving  9
  sending ** 40
 sendrecv ** 47
  success *************************************************************************************** 1606

Exceptions within sendrecv:
     1554 ******************************* 15
     1500 ********** 5
     1013 ************************* 12
     4010 **** 2
     1006 ****** 3
     2001 ********** 5
     1553 ********** 5
115 sessions span midnight (6.23644251627%).


Computing furthest-state histogram for 2016-03-08...
  refresh  0
    leave  0
     join ******* 177
  waiting * 41
 starting  17
receiving  13
  sending ** 58
 sendrecv ********** 234
  success **************************************************************************** 1719

Exceptions within sendrecv:
     2001 *************************** 65
     1013 ************************************************************** 146
     1500 ** 7
     1553  2
     1554 *** 9
     1006 * 3
     4010  2
95 sessions span midnight (4.20540061974%).


Computing furthest-state histogram for 2016-03-09...
  refresh  0
    leave  0
     join ****** 169
  waiting ** 58
 starting  14
receiving  8
  sending ** 74
 sendrecv ********************** 636
  success ***************************************************************** 1820

Exceptions within sendrecv:
     2001 ************************************* 240
     1013 ******************************************************** 360
     1500 ** 16
     1554 * 11
     4010  6
     1006  2
     1553  1
115 sessions span midnight (4.13817920115%).


Computing furthest-state histogram for 2016-03-10...
No common states between built-in and clicker. This may be due to timing slop. Going with max state.
  refresh  0
    leave  0
     join ****** 156
  waiting * 43
 starting  20
receiving  9
  sending ** 54
 sendrecv ************************ 602
  success *************************************************************** 1570

Exceptions within sendrecv:
     1500 ***** 35
     2001 ***************************************** 248
     1013 ************************************************** 303
     1006  3
     1554 * 8
     1553  4
     4010  1
94 sessions span midnight (3.8304808476%).

Attachments

(1 file, 1 obsolete file)

      No description provided.
Rank: 5
User Story: (updated)
Priority: -- → P1
Can you please clarify at which point in the flow the exception occurs and if user experience gets affected to help set a priority for this?
Flags: needinfo?(standard8)
Flags: needinfo?(erik)
Ok, I've dug into the logs. I took a random sampling of about 100 exception reports (some of these were from the same rooms/duplicates, so it was probably around 50 distinct reports, maybe less).

For numbers referenced below:

- 2001 = unexpected server response
- 1013 = peer-to-peer connection failed

From this, I can see that:

- The 2001 exceptions on the link clicker side, only seem to occur after a 1013. I think this has been ongoing for a while.

- The 2001 exceptions on the link generator side, occur in one of two distinct occasions:

a) The link generator is the only person in the room. At some stage the 2001 exception occurs, OpenTok sees this as a publishing failure.

b) There have been two people in the room. The link clicker exits the room, and so the link generator sees the clicker's session being destroyed, and for some reason, this then triggers the 2001.

Since a) wouldn't affect the graphs, I suspect b) is most likely here. As far as I can tell there's no degradation in user experience - it seems that the sdk automatically re-publishes.

Erik, are we able to filter out exceptions after the first person has left the room (if we aren't doing this already)? If so, does that improve the statistics?
Flags: needinfo?(standard8)
Also, need-info to Mike to make him aware of this - Mike, I'll seen you additional info in email direct.
Flags: needinfo?(msander)
Mark, when a person leaves a room, the session ends; that's how I've defined sessions for metrics purposes. So, the only exception that could be construed as post-leave would occur as part of the leave event itself. If I understand (b) correctly, those shouldn't concern us.

That said, the event after a leave would begin a new session, as long as 2 people meet again in the room at some point, so it's possible this later session has your (b) exception right off the bat.
Flags: needinfo?(erik)
HOWEVER, those later sessions would not be reported as part of the exception stats I generated for this bug, unless those sessions themselves never got all the way to Success. I suspect that tamps them down past the point of diminishing returns.
I got some exceptions related to this in the console (screenshot attached), and it put me into the situation where video is lost, audio stays on, and the invitation screen comes up.
Assignee: nobody → standard8
I've been able to do some debugging around this, and I've found what appears to be the main issue:

- On the link clicker, when a room transitions to "SESSION_CONNECTED" (i.e. connected to Tokbox's servers) from any other state, it attempts to start browser sharing.

- The browser sharing process will add a new listener to the privileged chrome process to get the tab information, and it sets the tab sharing into a pending state. When the listener is added, there's an initial response that we use to start browser sharing with the appropriate window id. This window id gets passed to the Tokbox SDK with a publish request.

- To get to the SESSION_CONNECTED state, there's currently two ways:

1) When the link generator opens the room, they transition through gathering, joining (and more) to SESSION_CONNECTED.

2) When a link clicker is in the room at the same time as the link generator, the room state will be HAS_PARTICIPANTS, but then, when the clicker leaves the room, the room state reverts to SESSION_CONNECTED.

- Its the latter case that is the issue. Reverting to the fact the session is connected, triggers another start browser sharing request, and another listener is added. At this time, the Tokbox sdk complains:

OT.Publisher State Change Failed:  'PublishingToSession' cannot transition to 'PublishingToSession'sdk.js:32172:5
OT.exception :: title: Unexpected Server Response (2001) msg: undefined
2001 Session.publish :: Unexpected server response. Try this operation again later.
2001 Session.publish :: Session.publish :: Unexpected server response. Try this operation again later.

- When the link clicker joins again (without the link generator having closed the room), everything works, but the SDK appears to be setting up multiple peer connections/streams i.e. one for the tab sharing, plus one for each time the link clicker has exited the room.
From the statistics perspective, I've discussed with Erik how the graph is assessing the furthest state for sendrecv versus success. Here's an outline of our logs in these error cases:

- Generator join 
- Clicker join ***
- Clicker leaves
- Other status messages from generator +++
- <2001 exception from generator>
- Clicker joins again ***

The *** represents where we would expect to classify a session of two people in the room to start from. However, the stats code is actually starting the second session from +++.

So, for the cases where a room has more than one session, we're seeing any session after the first one being reported as having an exception, leading to the reduced success rates of our graphs.

We should fix the stats, but it probably wouldn't revert us to our normal rates until we fix the issue, as I've seen cases where the 2001 comes in after the next clicker join.
Depends on: 1257009
Attachment #8730975 - Attachment is obsolete: true
Summary time:

- There's an issue in our code which tries to restart tab sharing, this may cause extra connections/higher load, but doesn't seem to functionally affect things. I've moved this fix our to bug 1257009.

- The stats code is miss-reporting failures outside of the session.

- The Tokbox code appears to detect a duplicate publish, but then allows it anyway. Maybe that's intentional, but I think the errors shouldn't be presented if that's the case.

I'm going to leave the needinfo to Mike to see if there's an SDK issue here.
Thanks Mark, lowering the priority now since given the analysis it does not seem like a must fix for 1.2
Rank: 5 → 20
Priority: P1 → P2
Just chatted with Mark again. To clarify https://bugzilla.mozilla.org/show_bug.cgi?id=1256244#c8 about the stats:

* A stats "session segment" is not a "session of 2 people"; it's just a way of dividing up the stream of events in a room into chunks over which we can establish furthest states.
* Starting a session segment at +++ is expected, as specced at https://github.com/erikrose/hello_stats/blob/b42e0c265c63fc30b7bfed6c4299a8fe1e135433/hello_stats/sessions.py#L271-L272. However, we should not allow states or exceptions to figure in the stats until a second participant joins. We could either move the boundaries of session segments or ignore those states or exceptions. In any case, those events between +++ and *** are needed to establish that the generator is still in the room.
Adam, could you take a look at this and see if there is something we'd like to change in the SDK to handle this case better?
Flags: needinfo?(msander) → needinfo?(adam)
Sure, I have filed a ticket internally for us and we'll investigate it.
Flags: needinfo?(adam)
This is probably already fixed, but we need to watch it as 1.2 goes out next week.
Rank: 20 → 19
Priority: P2 → P1
Whiteboard: [btpp-fix-now][47]
So bug 1257009 has now been released. We've seen a very small improvement (probably 3%) in the calls complete successfully case.

However, we're still seeing about 20% of conversations not completing successfully after two people have connected to the room.

My local analysis over a week shows that in about 65% of these cases, we're seeing a 1013 (P2P connection failed) error being logged, there's another 30% of cases where we're seeing 2001s (unexpected server response).

Since these all started around FF 45, our best guess is that it is something to do with having turned on screen sharing by default.

Unfortunately our stats don't tell us which connection is failing.

Tokbox is currently looking into the errors as well.
Rank: 19 → 10
Depends on: 1274237
Rank: 10 → 22
Depends on: 1281252
Depends on: 1282765
Depends on: 1285517
Depends on: 1285916
As of Wednesday 13th July, bug 1285916 has been deployed to the standalone. It appears to be reducing the rate that we were seeing some of these exceptions. We're still waiting on more data though.
Short summary of investigations:

- 2001 is a generic error returned by Tokbox when a client calls to the server to do something. This seems to happen in a lot of failure cases (e.g. connection failure) and is coupled with other exceptions.
- The connection rates we were seeing on the graphs were also masked by other stats issues.
- We found a small percentage of issues (~ < 5%) that looked to be a Tokbox issue on their server, and have been reported to Tokbox.
- Most of the rest of issues we would have needed more detailed metrics, which we weren't able to get due to Loop being shut down.

Hence closing this bug as no more to do now.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: