Closed Bug 788965 Opened 13 years ago Closed 13 years ago

Add message pair to check provider health

Categories

(Firefox Graveyard :: SocialAPI, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: markh, Unassigned)

Details

Felipe and I were chatting on IRC about the error handling and recovery code and specifically worker errors happening *after* the worker successfully loaded. It is almost impossible for us to detect that case programatically. We felt it might be worthwhile to add a message specifically for this case - FF would periodically send a "health-check-request" message (but with a better name) to the worker, and if it fails to respond with an "ok" message we consider the worker to be in error and reload the entire provider. This would also be valuable for the content, but (a) it's a little harder for us to do as we don't currently send messages to the sidebar etc and (b) it's probably less important there as errors are more likely to be obvious as the content is visible. Plus, we could encourage providers to implement the "health-check-request" message internally by checking its own content (as it already will have ports established between the worker and the content), so a well-written provider could effectively check the health of the worker *and* the content all in one go.
Could you describe what error cases this would help us resolve?
It should help resolve any error which happens after initialization and which causes the worker to become non-responsive. This could be as simple as a logic error which causes internal state to get screwed up such that attempting to respond to any message causes an exception. The only other alternative seems to be to assume that once a worker is loaded it can *never* gets into an error state, which seems somewhat optimistic.
I also meant to add - it would also help cover the case where a logic error or problem caused the worker to respond very slowly (eg, a "leak" causing, say, some internal array to get huge) - so if the worker is alive but is taking 10 seconds to respond to a message, we can also declare it unhealthy. This might end up being deemed unnecessary, which is fine, but we thought it worth capturing while discussing error detection and recovery scenarios.
Perhaps "social.is-alive" and "social.is-alive-response" for the message names?
If I understand, the issues that we're trying to resolve are: - non responsive worker (bug 756588) - memory leaks - logic errors resulting in failure to respond Outside of those, I'm not certain that a pingback will tell us much about the health of a worker. Not saying I'm against a pingback, I'm just having a problem with coming up with situations where the pingback would indicate an issue outside the above, and whether it would be effective to detect the above. For the non-reponsive worker situation, there was a fix (bug 771977) to stop workers in the hidden window, but is there a way to detect that the script was stopped? I think that would be better than a pingback. For memory leaks, I wonder if the APIs used by about:memory would be a more reliable mechanism to track whether it is the worker leaking or something else. Not sure that is even possible. I guess the problem I really have is, what does it really mean if there is not a response, and how do we know that doing anything would resolve that problem?
let's reconsider this when we revisit the general topic
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.