1119574 - Way too many Loop calls fail after being answered

Reporter

Description

•

9 years ago

I wrote a script to analyze the Loop sever websockets logs for a one-week period spanning 2014-12-30 21:11:20 UTC to 2015-01-05 23:57:06 UTC, specifically with an eye towards determining how many calls reach the point that the called user accepts the call, but which don't ultimately end up succeeding.

> Of 5175 answered calls...
>   connected:           2936 (56.73%) (avg post-answer-delay = 5.38 seconds)
>   answered-elsewhere:  6 (0.11%): Successful call (user logged into 2 or more browsers)
>   busy:                2 (0.03%): ??? (some kind of multi-client race?)
>   cancel:              1 (0.01%): Caller canceled simultaneous with answer
>   closed:              572 (11.05%): Gave up after connection was likely
>   gave-up-too-soon:    127 (2.45%): Gave up while connection was still likely (<6 seconds)
>   media-fail:          13 (0.25%): Media (ICE) Failure
>   timeout:             1518 (29.33%): Caller and callee waited 30 seconds, call never set up

The fact that somewhere between 40% and 45% of *answered* calls ultimately fail is a really, really bad sign.

I do note with some skepticism that the media failure numbers reported via websocket are both unrealistically low and don't match the ICE failure reports we collect (our ICE failures on Loop are running about 80 or so a day between both rooms and direct calls, or somewhere in the 5% - 10% range).

This is a bug to investigate the causes for these failure rates being so high, as well as to chase down the discrepancy between the two ice failure rate metrics. Once the root causes are determined, we'll open other bugs to address the causes.

Eric Rescorla (:ekr)

Comment 1

•

9 years ago

1. Can you post the logs for these?
2. Can we get the ICE logs that correspond to these failures?

Mark Banner (:standard8)

Assignee

Comment 2

•

9 years ago

I believe we're actually getting some of the values wrong, e.g. I know the desktop is sending media-fail at a potentially wrong time, and I think timeout is actually meaning multiple things.

I need to check a couple of things, so I'm not going to be able to give a full description today, but I'll try and get my thoughts down on Monday.

Flags: needinfo?(standard8)

Mark Banner (:standard8)

Assignee

Updated

•

9 years ago

Depends on: 1120365

Mark Banner (:standard8)

Assignee

Updated

•

9 years ago

Depends on: 1120375

Mark Banner (:standard8)

Assignee

Comment 3

•

9 years ago

(In reply to Adam Roach [:abr] from comment #0)
Can you post your script that you use for the logs please? I think it might help with understanding how some of these are generated.

> >   busy:                2 (0.03%): ??? (some kind of multi-client race?)

This is either the user was already on a call (there's probably a small chance of that even on the limited number of users - especially if were were testing in that timeslot) *or* there's a couple of bugs like bug 1031439 and bug 1118518. 

> >   gave-up-too-soon:    127 (2.45%): Gave up while connection was still likely (<6 seconds)

Adam, was this your script that determined this? Without knowing what its from, its hard to give an assessment.

> >   media-fail:          13 (0.25%): Media (ICE) Failure

Assuming you're taking this direct from the "Reason" field on desktop, then it appears we have some issues:

- On desktop (and standalone), we're somehow sending this to mean the wrong thing. It currently sends media-fail only when a user hangs up a call whilst it is still connecting. I've filed bug 1120365 on this.
- On FxOS, I looked through the code, and I could see this only happening when gUM fails, not when ICE failed (as far as I could tell, not being experienced with that code).

> >   timeout:             1518 (29.33%): Caller and callee waited 30 seconds, call never set up

Is this a count of the raw "timeout" reason as sent by the server in the terminate message? If so, then this can actually mean a bunch of things:

- The server did not receive a call websocket connection response to the push notification
- The user didn't accept the call
- Doing the call (connecting -> connected) timed out, e.g. media/ice failed in some way or took too long.

I've just filed bug 1120375 on this, as I think it would be worth splitting that up for this bug, and it would also help with client side fixes.

Adam Roach [:abr]

Reporter

Comment 4

•

9 years ago

(In reply to Mark Banner (:standard8) from comment #3)
> (In reply to Adam Roach [:abr] from comment #0)
> Can you post your script that you use for the logs please? I think it might
> help with understanding how some of these are generated.

Sure, but it's some really ugly hacked-together perl. :)

> > >   busy:                2 (0.03%): ??? (some kind of multi-client race?)
> 
> This is either the user was already on a call (there's probably a small
> chance of that even on the limited number of users - especially if were were
> testing in that timeslot) *or* there's a couple of bugs like bug 1031439 and
> bug 1118518.

Pay special attention to the leader line here: "Of 5175 answered calls..."

For this analysis, I'm discarding calls *unless* they enter the "connecting" state. This means that we're getting an "accept" from the called party, and we're *also* getting a "terminated / busy" from the called party.

If we look at one of these calls, you'll see three hello exchanges -- one for the calling party, and two for the called party:

 "messageType":"hello"
 
 "messageType":"hello"
  "state":"init",
 
 "messageType":"progress"
  "state":"alerting",
 
 "messageType":"hello"
 
 "messageType":"progress"
  "state":"alerting",
 
 "messageType":"hello"
  "state":"init",
 
 "messageType":"progress"
  "state":"connecting",
 
 "messageType":"action",
  "event":"accept",
 
 "messageType":"progress"
  "state":"connecting",
 
 "messageType":"hello"
 
 "messageType":"hello"
  "state":"connecting",

 "messageType":"action"
  "event":"terminate",
  "reason":"busy",
  
 "messageType":"progress"
  "reason":"busy",
  "state":"terminated",
 
 "messageType":"progress"
  "reason":"busy",
  "state":"terminated",
 
 "messageType":"progress"
  "reason":"busy",
  "state":"terminated",
 

> > >   gave-up-too-soon:    127 (2.45%): Gave up while connection was still likely (<6 seconds)
> 
> Adam, was this your script that determined this? Without knowing what its
> from, its hard to give an assessment.

This is a judgement call. Basically, when the calling party cancels a call before the called party answers, it's going to be for one of two reasons: either (a) they decided they didn't want to communicate after call, or (b) they still want to communicate but have decided that the system isn't going to work. The first is normal behavior, while the second is a flaw. Since there's no way to measure intent, we have to make reasonable guesses here.

I drew a line at 6 seconds, since the average time to success was 5.38 seconds. Anyone who abandons before this mark left a call that still had a reasonable chance of being established.

> > >   media-fail:          13 (0.25%): Media (ICE) Failure
> 
> Assuming you're taking this direct from the "Reason" field on desktop, then
> it appears we have some issues:
> 
> - On desktop (and standalone), we're somehow sending this to mean the wrong
> thing. It currently sends media-fail only when a user hangs up a call whilst
> it is still connecting. I've filed bug 1120365 on this.

Good catch. Thanks for filing the bug.

> - On FxOS, I looked through the code, and I could see this only happening
> when gUM fails, not when ICE failed (as far as I could tell, not being
> experienced with that code).

Given the volume of FxOS calls, this seems unlikely to impact our statistics at the moment. However, we should get a bug open on it so that it remains correct. Can I have you take care of that?
 
> > >   timeout:             1518 (29.33%): Caller and callee waited 30 seconds, call never set up
> 
> Is this a count of the raw "timeout" reason as sent by the server in the
> terminate message?

No; as before, this is only those timeout events that occurred *after* a user answered the call.

> I've just filed bug 1120375 on this, as I think it would be worth splitting
> that up for this bug, and it would also help with client side fixes.

That may be useful for other reasons, but I can guarantee that every timeout in the statistics I list above is what you call "timeout-connect" in your proposal.

Adam Roach [:abr]

Reporter

Comment 5

•

9 years ago

Attached file post-accept-2.pl — Details

This is the script Mark asks for. It's not pretty, but it works.

Mark Banner (:standard8)

Assignee

Updated

•

9 years ago

Depends on: 1122486

Mark Banner (:standard8)

Assignee

Comment 6

•

9 years ago

So having done some looking at the bugs this depends on, I think bug 1122486 could be accounting for some of the timeouts, gave-up-too-soon, and closed. What proportion is hard to tell until we can start rolling out the fixes.

I would expect the ICE failures to be either "timeouts" or maybe the "gave-up-too-soon" ones as well, depending on the user interactions. I don't think it'd account for many of the "closed" ones.

Flags: needinfo?(standard8)

OS: Mac OS X → All

Hardware: x86 → All

Eric Rescorla (:ekr)

Comment 7

•

9 years ago

(In reply to Mark Banner (:standard8) from comment #6)
> So having done some looking at the bugs this depends on, I think bug 1122486
> could be accounting for some of the timeouts, gave-up-too-soon, and closed.
> What proportion is hard to tell until we can start rolling out the fixes.
> 
> I would expect the ICE failures to be either "timeouts" or maybe the
> "gave-up-too-soon" ones as well, depending on the user interactions. I don't
> think it'd account for many of the "closed" ones.

Why don't these get separately reported? ICE failures are reported up by the
PC API?

Mark Banner (:standard8)

Assignee

Comment 8

•

9 years ago

(In reply to Eric Rescorla (:ekr) from comment #7)
> (In reply to Mark Banner (:standard8) from comment #6)
> > I would expect the ICE failures to be either "timeouts" or maybe the
> > "gave-up-too-soon" ones as well, depending on the user interactions. I don't
> > think it'd account for many of the "closed" ones.
> 
> Why don't these get separately reported? ICE failures are reported up by the
> PC API?

We haven't hooked up to the sdk correctly - this is bug 1120365 I previously mentioned.

Ian Bicking (:ianbicking)

Updated

•

8 years ago

Rank: 21

Priority: -- → P2

Ian Bicking (:ianbicking)

Updated

•

8 years ago

Assignee: nobody → standard8

Rank: 21 → 25

Mark Banner (:standard8)

Assignee

Comment 9

•

8 years ago

Bug 1256244 comment 18 has a very brief summary. Closing now that Hello is being shut down.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WONTFIX

Bugzilla

Quick Search

Way too many Loop calls fail after being answered

Categories

(Hello (Loop) :: General, defect, P2)

Tracking

(Not tracked)

People

(Reporter: abr, Assigned: standard8)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Updated

Comment 3

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Updated

Comment 9

Attachment

General

Description

File Name

Content Type