Closed Bug 1188917 Opened 9 years ago Closed 8 years ago

Hello can get in an unresponsive state

Categories

(Hello (Loop) :: Client, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: abr, Unassigned)

References

Details

(Whiteboard: [quality][investigation])

We've had a report of Hello getting into a state where it refused to display a user's list of rooms (the user was not logged in to FxA). Attempts to create rooms also failed.

Throughout this series of failures, the console log showed no errors. No failure dialogs were displayed to the user.

The problem resolved on browser restart.

As a first step for approaching this bug, I would propose that we add detailed instrumentation to log success and failure steps when retrieving a user's list of rooms. To determine the potential impact of this issue, we should subsequently make sure that any detected errors are sent to the network (e.g., logged to the loop server or telemetry).
I'd like to propose that the alternative first step, or early step is also to implement UX around what happens when the server is down.

Currently, we have two failure points:

- No connection to push server
- No connection to loop server

The first of these is currently most severe on start-up - if we can't get a link to the push server, then we implement a back-off algorithm. Until that succeeds, we won't register with the loop server (iirc, it can't as the push server is a required param).

If the loop-server is unavailable on startup, we'll hit the same issue with the push servers - because we won't be able to get the push server url.

If the loop-server is available on startup, but we loose the ability to talk to it later, we display "something went wrong" when the user tries to do things.

With the push server being down, the user just won't know about it.

So in both of the startup cases, and one of the already-running cases, we're displaying zero UX to the user, which is why I think we should prioritise that slightly over the logging of how much we're hitting.

I would reference the bug but I can't find it at the moment.
need to break down into smaller bug components across different areas.  one being instrumentation.
Rank: 29
Flags: needinfo?(standard8)
Priority: -- → P2
Whiteboard: [quality][investigation]
Starting to break this down a bit:

- Bug 1194622 has implemented a loading indication, this should at least let users know that there's something up.
- Bug 1203138 for giving the user an indication that the connection is down.

Adam: for reporting, I think we should use telemetry - if there's network issues, then its more likely that it'll get reported via telemetry that it happened. However, I'm also a bit concerned whether or not we can report this well enough to avoid issues like the computer going to sleep and killing the connection (which would confuse the data).

Thoughts?
Depends on: 1194622, 1203138
Flags: needinfo?(standard8) → needinfo?(adam)
The issue with telemetry -- at least using the normal interface -- is that we're not going to get a coherent log like we do with the loop server.

We can define a different record type, like we do with ICE failure reports, it fixes that issue; however, we need someone to write the analysis job for it if we go that way. It's also going to require additional scrutiny from the data stewards, since it's a new kind of data collection rather than incremental tweaks to an existing data collection service.
Flags: needinfo?(adam)
FYI, I think bug 1207300 should have fixed the main cause for this. Bug 1210501 is tracking ongoing work to improve the UX. We can leave this one open to decide on what other logging we want to end up with.
Depends on: 1207300, 1210501
Rank: 29 → 39
Priority: P2 → P3
Support for Hello/Loop has been discontinued.

https://support.mozilla.org/kb/hello-status

Hence closing the old bugs. Thank you for your support.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.