Closed
Bug 1112789
Opened 10 years ago
Closed 8 years ago
Add Hello direct call setup metrics
Categories
(Cloud Services Graveyard :: Metrics: Pipeline, defect, P5)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: abr, Assigned: kparlante)
References
Details
Attachments
(1 file)
66.05 KB,
image/png
|
Details |
As a follow-on to Bug 1085252, we'd like to be able to extract a variety of information from the call progress logs, once this server deploys. An initial list of questions we'd like to be able to answer are:
1) When a call transitions to "terminated," what state was it in immediately before doing so? Ideally this would be a pie chart over some relatively small time-local window (e.g., one week), along with a time-series line graph that shows which state the failure occurred in, as a percentage of all failures. Roughly, something like the attached image. I'd like us to include "connected" and "terminated" in the list; even though these should be "0%" by definition, I'd like to include them as a double-check that the server isn't doing something it shouldn't.
2) For calls that transition to "terminated" from "alerting", what are the reason codes? Ideally, the same two kinds of graphs as for (1)
3) For calls that transition to "terminated" from "connecting" or "half-connected", what are the reason codes? Ideally, the same two kinds of graphs as for (1)
4) For calls that include an "accept" event, what percentage eventually enter the "connected" state? This would be a simple time series line graph.
5) For calls that include an "accept" event and fail, how long is it between the "accept" event and the "terminate" event? Ideally, this is a bucketed distribution graph, similar to what we currently have on the telemetry website; i.e., a bar graph of the number of calls that failed <1 second after "accept"; number of calls that failed 1-2 seconds after, etc. Due to the supervisory timers we run, I would expect this to top out at 10 seconds: having a single bucket to catch "> 10 seconds" is probably sufficient.
6) Same as (5), except for calls that include an "accept" event and eventually *succeed*.
Note that all of the preceding analyses should disregard "terminated" messages with a reason of "answered-elsewhere".
Comment 1•10 years ago
|
||
(In reply to Adam Roach [:abr] from comment #0)
> Created attachment 8538103 [details]
> Graphs for Call Failure States
>
> As a follow-on to Bug 1085252, we'd like to be able to extract a variety of
> information from the call progress logs, once this server deploys. An
> initial list of questions we'd like to be able to answer are:
>
> 1) When a call transitions to "terminated," what state was it in immediately
> before doing so? Ideally this would be a pie chart over some relatively
> small time-local window (e.g., one week), along with a time-series line
> graph that shows which state the failure occurred in, as a percentage of all
> failures. Roughly, something like the attached image. I'd like us to include
> "connected" and "terminated" in the list; even though these should be "0%"
> by definition, I'd like to include them as a double-check that the server
> isn't doing something it shouldn't.
>
> 2) For calls that transition to "terminated" from "alerting", what are the
> reason codes? Ideally, the same two kinds of graphs as for (1)
>
> 3) For calls that transition to "terminated" from "connecting" or
> "half-connected", what are the reason codes? Ideally, the same two kinds of
> graphs as for (1)
>
> 4) For calls that include an "accept" event, what percentage eventually
> enter the "connected" state? This would be a simple time series line graph.
>
> 5) For calls that include an "accept" event and fail, how long is it between
> the "accept" event and the "terminate" event? Ideally, this is a bucketed
> distribution graph, similar to what we currently have on the telemetry
> website; i.e., a bar graph of the number of calls that failed <1 second
> after "accept"; number of calls that failed 1-2 seconds after, etc. Due to
> the supervisory timers we run, I would expect this to top out at 10 seconds:
> having a single bucket to catch "> 10 seconds" is probably sufficient.
What about the reason code for this? And some way to filter this to the ICE logs
> 6) Same as (5), except for calls that include an "accept" event and
> eventually *succeed*.
>
> Note that all of the preceding analyses should disregard "terminated"
> messages with a reason of "answered-elsewhere".
Comment 2•9 years ago
|
||
Moving to a more actively triaged component.
Component: Operations: Metrics/Monitoring → Metrics: Pipeline
Priority: -- → P5
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•