Closed Bug 804910 Opened 12 years ago Closed 12 years ago

handle worker load failures

Categories

(Firefox Graveyard :: SocialAPI, defect)

x86
macOS
defect
Not set
normal

Tracking

(firefox17 fixed, firefox18 fixed)

RESOLVED FIXED
Firefox 19
Tracking Status
firefox17 --- fixed
firefox18 --- fixed

People

(Reporter: dougt, Assigned: Felipe)

References

Details

(Whiteboard: [Fx17][qa-])

Attachments

(4 files, 4 obsolete files)

Today the Facebook Social API has been down for all of us.  I initially thought that my Firefox build was just busted.  The sidebar just had about 1/3 of the content that I'd expect.  The Facebook button shows some text about "Not logged in".  I asked around internally a bit, and found out that, indeed, the facebook service was down for some reason.

What I am concerned about is the general user experience in this case is pretty bad.  My first thoughts were ugh... firefox is busted.

I am wondering if there is some way that we can detect this "service is down" and instead of showing some sort of partial ui, we could, instead, do something "better".
I was actually hoping bug 766616 would handle this case, but it doesn't seem to handle the worker itself not having appropriate content.  We should harden the code that landed in that bug to check the worker (a) returned 200 and (b) has an onconnect global after load.

(IIUC, the current outage is simply that the worker doesn't load - all other content does)
Yeah, this has been on my mind.  Part of the problem for us, as Mark said, is that content has not become unavailable, but rather we're receiving broken content.  Simple checks that the worker (if one is configured) passes messages is good.  We need to think about what kinds of processes we can add to the api that would make the system more resilient to failures.  I chatted briefly with Florian about this, and will be thinking more about additional things we can do.
bug 788965 was also an idea for this that we've discussed in the past
Summary: Service downtime UI → handle worker load failures
Attached patch frameworker errors.patch (obsolete) — Splinter Review
port and frameworker portion of error handling, with notifications on error.  frontend needs to handle the notifications and decide what to do with that.
Attachment #675422 - Flags: feedback?(felipc)
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

driveby

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", "noScript");
> +        return;
> +      }
> +

As an additional (or alternative) check, how about checking there is an onconnect callable in the sandbox after the code is evaluated?

@@ +217,5 @@
>        } catch (e) {
>          Cu.reportError("FrameWorker: Error evaluating worker script for " + worker.name + ": " + e + "; " +
>              (e.lineNumber ? ("Line #" + e.lineNumber) : "") +
>              (e.stack ? ("\n" + e.stack) : ""));
> +        Services.obs.notifyObservers(null, "frameworkerError", null);

It might turn out to be useful (to tests at least) to always notify observers when a frameworker loads (or fails to load), and possibly include the origin of the worker?

@@ +329,5 @@
> +    if (failure && aStatus != Components.results.NS_BINDING_ABORTED) {
> +      aRequest.cancel(Components.results.NS_BINDING_ABORTED);
> +      this.setErrorMessage(aWebProgress);
> +    }
> +  },

What will this pick up that checking for onconnect() would not?

@@ +348,5 @@
> +  onSecurityChange: function SPL_onSecurityChange() {},
> +
> +  setErrorMessage: function(aWebProgress) {
> +    Services.obs.notifyObservers(null, "frameworkerError", "networkError");
> +  }

param unused?

@@ +408,5 @@
>        // No "port-create" here - client ports are created explicitly.
> +      case "port-error":
> +        // onconnect failed, we cannot connect the port, the worker has
> +        // become invalid
> +        Services.obs.notifyObservers(null, "frameworkerError", "onconnect failure");

Maybe "port-connection-error" (as it isn't also covering the "simple" case of an onmessage handler failing for only 1 single message, for example)
Whiteboard: [Fx17]
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, let me see what I can do with it. As Mark mentioned, I'm still not convinced that the network error listeners would catch something that the try { onconnect() } catch (e) { ... } wouldn't
Attachment #675422 - Flags: feedback?(felipc) → feedback+
Ok, I did a little more testing, we do not need to have the webprogresslistener so long as we don't care what kind of error causes us problems.  

Without webprogresslistener we know:

- body is empty
- Cu.evalInSandbox fails
- onconnect fails

With webprogresslistener we could know the http error codes which might indicate to us a way to react.  e.g. if we get a 404, should we simply disable?

I'd be fine with ignoring webprogresslistener for now, but we should think a little for 19/20 whether we want to enhance the error handling.
(In reply to Shane Caraveo (:mixedpuppy) from comment #8)
> I'd be fine with ignoring webprogresslistener for now, but we should think a
> little for 19/20 whether we want to enhance the error handling.

I think that makes perfect sense - if we do end up with a recovery strategy that depends on the specific http errors then that would seem a good time to capture them.
Attached patch v2 (obsolete) — Splinter Review
Attachment #675422 - Attachment is obsolete: true
Attachment #676403 - Flags: feedback?(mhammond)
Attachment #676403 - Flags: feedback?(felipc)
Comment on attachment 676403 [details] [diff] [review]
v2

ugh.  wrong patch
Attachment #676403 - Flags: feedback?(mhammond)
Attachment #676403 - Flags: feedback?(felipc)
Attachment #676403 - Attachment is obsolete: true
Attached patch v2 (obsolete) — Splinter Review
Attachment #676406 - Flags: feedback?(mhammond)
Attachment #676406 - Flags: feedback?(felipc)
Comment on attachment 676406 [details] [diff] [review]
v2

Review of attachment 676406 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!  Just one minor comment

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", worker.url);
> +        return;
> +      }
> +

I'm still inclined to think this is a premature optimization - that we could just check for onconnect after evaling the text - or even just skip both and allow the onconnect exception handler to fire in both cases - we should hit the problem rarely and having one clear error path sounds like a win.  That said though, this should work fine for that case, so I'll leave that to the judgement of others.
Attachment #676406 - Flags: feedback?(mhammond) → feedback+
Attached patch v2 with testsSplinter Review
Attachment #676406 - Attachment is obsolete: true
Attachment #676406 - Flags: feedback?(felipc)
Attachment #676471 - Flags: feedback?(felipc)
Attachment #676471 - Flags: feedback?(felipc) → feedback+
This is Shane's v2 patch with:
 - removed XPCOMUtils import
 - frameworkerError -> social:frameworker-error
 - changed "load" listener to DOMContentLoaded
Attachment #676493 - Attachment is patch: true
Attached patch UI part (obsolete) — Splinter Review
And this is the UI part that listens for the notification and displays the error in the sidebar. The reset basically does:
  Social.enabled = false
  Social.enabled = true

We could get much more fancy with the worker/UI reload but this is a very simple method that get us a long way into error recovering

The patch is a bit longer than it should because the sidebar handling got a bit complicated with the automatic unloading. If we set the sidebar to display the error page, then the user hides it, and it gets unloaded after 10sec.. we still need to properly show the error if the sidebar is unhidden (loaded) again
> -    if (failure) {
> +    if (failure && !Social.errorState == "frameworker-error") {

should be

> +    if (failure && Social.errorState != "frameworker-error") {
Attachment #676495 - Flags: review?(jaws)
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

Original patch was Shane's, I just updated it with a small name tweak that was necessary for the UI part.
Attachment #676493 - Flags: review+
On the try run, the UI part seems to have caused a leak on the test browser_social_sidebar.js, https://tbpl.mozilla.org/?tree=Try&rev=d212ca145101
I've been looking at it for a while but I haven't yet figured out what is causing it, as I can't reproduce the problem on my machine

I'm imagining it's something similar to bug 803255 but it's a guess so far

I triggered a new try run to see if I can get more data: https://tbpl.mozilla.org/?tree=Try&rev=dd7e4bc6a9cb

Shane's patch alone does not trigger that leak (tried in this run: https://tbpl.mozilla.org/?tree=Try&rev=2f9de0b80f16), so I imagine it's either caused by setting the error page to the sidebar when it's invisible, or the errorState variable needs to get cleared somehow during the test
Yeah, I've confirmed that this is bug 803255 striking again. And it only happens when
browser_social_sidebar.js and browser_frameworker.js runs together. I've made a small change to the UI patch that was not strictly necessary but help avoid the bug: it doesn't try to set the error UI URL if there is no provider, and browser_frameworker.js runs without a provider, and with that we avoid bug 803255.

posted to try:
https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c
Attached patch UI part v2Splinter Review
The change is wrapping the frameworker-error observer in `if (Social.provider)`. Also included the fix from comment 18.

I won't be around the morning to land this so if it gets the review on time for landing in central and beta during the day I appreciate if someone could land it
Attachment #676495 - Attachment is obsolete: true
Attachment #676495 - Flags: review?(jaws)
Attachment #676580 - Flags: review?(jaws)
Attachment #676580 - Flags: review?(gavin.sharp)
> posted to try:
> https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c

fwiw both patches needs to be pushed together. Shane's patch doesn't show up in tbpl here as it was already present on try as part of a different head
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/social/FrameWorker.jsm
@@ -208,1 @@
>          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);

the declaration of scriptText was removed in this patch. how is this going to work?
Attachment #676493 - Flags: review-
Comment on attachment 676580 [details] [diff] [review]
UI part v2

Review of attachment 676580 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/browser-social.js
@@ +1042,5 @@
> +
> +      case "frameworker-error":
> +        sbrowser.setAttribute("src", "about:socialerror?mode=workerFailure");
> +        break;
> +    }

nit: add a default branch which uses Cu.reportError for unsupported types. this can be done in a follow-up though.
Attachment #676580 - Flags: review?(jaws) → review+
(In reply to Jared Wein [:jaws] from comment #24)
> Comment on attachment 676493 [details] [diff] [review]
> v2 with tests + small changes
> 
> Review of attachment 676493 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/social/FrameWorker.jsm
> @@ -208,1 @@
> >          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);
> 
> the declaration of scriptText was removed in this patch. how is this going
> to work?

 It wasn't removed, it was moved to earlier in the function
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Sorry, I didn't see that |scriptText| got moved higher up in the script.
Attachment #676493 - Flags: review-
Attachment #676580 - Flags: review?(gavin.sharp)
The two patches rolled up together for the branches. Green on try last night and going green on inbound

[Approval Request Comment]
Bug caused by (feature/regressing bug #): social api
User impact if declined: no error recovery from failure to load worker
Testing completed (on m-c, etc.): try and inbound
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch: none
Attachment #676778 - Flags: review+
Attachment #676778 - Flags: approval-mozilla-beta?
Attachment #676778 - Flags: approval-mozilla-aurora?
Comment on attachment 676778 [details] [diff] [review]
Rollup patch for aurora/beta

We discussed this in triage yesterday - risky to be taking this much code churn on beta, but the alternative of not handling this error case seems worse. a=me for Firefox 17/18.
Attachment #676778 - Flags: approval-mozilla-beta?
Attachment #676778 - Flags: approval-mozilla-beta+
Attachment #676778 - Flags: approval-mozilla-aurora?
Attachment #676778 - Flags: approval-mozilla-aurora+
Flags: in-testsuite+
Assignee: nobody → felipc
https://hg.mozilla.org/mozilla-central/rev/9104ce0e7396
https://hg.mozilla.org/mozilla-central/rev/08f97e0fa9da
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 19
Please remove [qa-] whiteboard tag and add verifyme keyword if there's some QA testing needed here. Otherwise we will skip verification.
Whiteboard: [Fx17] → [Fx17][qa-]
Product: Firefox → Firefox Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: