Last Comment Bug 804910 - handle worker load failures
: handle worker load failures
Status: RESOLVED FIXED
[Fx17][qa-]
:
Product: Firefox
Classification: Client Software
Component: SocialAPI (show other bugs)
: unspecified
: x86 Mac OS X
: -- normal (vote)
: Firefox 19
Assigned To: :Felipe Gomes (needinfo me!)
:
Mentors:
: 805203 (view as bug list)
Depends on: 788920
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-23 21:36 PDT by Doug Turner (:dougt)
Modified: 2012-12-04 13:50 PST (History)
10 users (show)
felipc: in‑testsuite+
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed
fixed


Attachments
frameworker errors.patch (8.66 KB, patch)
2012-10-25 20:12 PDT, Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16]
felipc: feedback+
Details | Diff | Review
v2 (29.39 KB, patch)
2012-10-29 17:24 PDT, Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16]
no flags Details | Diff | Review
v2 (5.43 KB, patch)
2012-10-29 17:28 PDT, Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16]
markh: feedback+
Details | Diff | Review
v2 with tests (8.32 KB, patch)
2012-10-29 21:57 PDT, Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16]
felipc: feedback+
Details | Diff | Review
v2 with tests + small changes (8.17 KB, patch)
2012-10-29 23:15 PDT, :Felipe Gomes (needinfo me!)
felipc: review+
Details | Diff | Review
UI part (7.67 KB, patch)
2012-10-29 23:19 PDT, :Felipe Gomes (needinfo me!)
no flags Details | Diff | Review
UI part v2 (7.64 KB, patch)
2012-10-30 06:59 PDT, :Felipe Gomes (needinfo me!)
jaws: review+
Details | Diff | Review
Rollup patch for aurora/beta (15.57 KB, patch)
2012-10-30 14:32 PDT, :Felipe Gomes (needinfo me!)
felipc: review+
gavin.sharp: approval‑mozilla‑aurora+
gavin.sharp: approval‑mozilla‑beta+
Details | Diff | Review

Description Doug Turner (:dougt) 2012-10-23 21:36:30 PDT
Today the Facebook Social API has been down for all of us.  I initially thought that my Firefox build was just busted.  The sidebar just had about 1/3 of the content that I'd expect.  The Facebook button shows some text about "Not logged in".  I asked around internally a bit, and found out that, indeed, the facebook service was down for some reason.

What I am concerned about is the general user experience in this case is pretty bad.  My first thoughts were ugh... firefox is busted.

I am wondering if there is some way that we can detect this "service is down" and instead of showing some sort of partial ui, we could, instead, do something "better".
Comment 1 Mark Hammond [:markh] 2012-10-23 21:46:30 PDT
I was actually hoping bug 766616 would handle this case, but it doesn't seem to handle the worker itself not having appropriate content.  We should harden the code that landed in that bug to check the worker (a) returned 200 and (b) has an onconnect global after load.

(IIUC, the current outage is simply that the worker doesn't load - all other content does)
Comment 2 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-23 22:36:39 PDT
Yeah, this has been on my mind.  Part of the problem for us, as Mark said, is that content has not become unavailable, but rather we're receiving broken content.  Simple checks that the worker (if one is configured) passes messages is good.  We need to think about what kinds of processes we can add to the api that would make the system more resilient to failures.  I chatted briefly with Florian about this, and will be thinking more about additional things we can do.
Comment 3 :Felipe Gomes (needinfo me!) 2012-10-24 00:33:23 PDT
bug 788965 was also an idea for this that we've discussed in the past
Comment 4 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-10-25 11:33:24 PDT
*** Bug 805203 has been marked as a duplicate of this bug. ***
Comment 5 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-25 20:12:39 PDT
Created attachment 675422 [details] [diff] [review]
frameworker errors.patch

port and frameworker portion of error handling, with notifications on error.  frontend needs to handle the notifications and decide what to do with that.
Comment 6 Mark Hammond [:markh] 2012-10-25 20:45:37 PDT
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

driveby

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", "noScript");
> +        return;
> +      }
> +

As an additional (or alternative) check, how about checking there is an onconnect callable in the sandbox after the code is evaluated?

@@ +217,5 @@
>        } catch (e) {
>          Cu.reportError("FrameWorker: Error evaluating worker script for " + worker.name + ": " + e + "; " +
>              (e.lineNumber ? ("Line #" + e.lineNumber) : "") +
>              (e.stack ? ("\n" + e.stack) : ""));
> +        Services.obs.notifyObservers(null, "frameworkerError", null);

It might turn out to be useful (to tests at least) to always notify observers when a frameworker loads (or fails to load), and possibly include the origin of the worker?

@@ +329,5 @@
> +    if (failure && aStatus != Components.results.NS_BINDING_ABORTED) {
> +      aRequest.cancel(Components.results.NS_BINDING_ABORTED);
> +      this.setErrorMessage(aWebProgress);
> +    }
> +  },

What will this pick up that checking for onconnect() would not?

@@ +348,5 @@
> +  onSecurityChange: function SPL_onSecurityChange() {},
> +
> +  setErrorMessage: function(aWebProgress) {
> +    Services.obs.notifyObservers(null, "frameworkerError", "networkError");
> +  }

param unused?

@@ +408,5 @@
>        // No "port-create" here - client ports are created explicitly.
> +      case "port-error":
> +        // onconnect failed, we cannot connect the port, the worker has
> +        // become invalid
> +        Services.obs.notifyObservers(null, "frameworkerError", "onconnect failure");

Maybe "port-connection-error" (as it isn't also covering the "simple" case of an onmessage handler failing for only 1 single message, for example)
Comment 7 :Felipe Gomes (needinfo me!) 2012-10-26 23:59:45 PDT
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, let me see what I can do with it. As Mark mentioned, I'm still not convinced that the network error listeners would catch something that the try { onconnect() } catch (e) { ... } wouldn't
Comment 8 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-29 15:19:13 PDT
Ok, I did a little more testing, we do not need to have the webprogresslistener so long as we don't care what kind of error causes us problems.  

Without webprogresslistener we know:

- body is empty
- Cu.evalInSandbox fails
- onconnect fails

With webprogresslistener we could know the http error codes which might indicate to us a way to react.  e.g. if we get a 404, should we simply disable?

I'd be fine with ignoring webprogresslistener for now, but we should think a little for 19/20 whether we want to enhance the error handling.
Comment 9 Mark Hammond [:markh] 2012-10-29 15:22:43 PDT
(In reply to Shane Caraveo (:mixedpuppy) from comment #8)
> I'd be fine with ignoring webprogresslistener for now, but we should think a
> little for 19/20 whether we want to enhance the error handling.

I think that makes perfect sense - if we do end up with a recovery strategy that depends on the specific http errors then that would seem a good time to capture them.
Comment 10 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-29 17:24:05 PDT
Created attachment 676403 [details] [diff] [review]
v2
Comment 11 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-29 17:25:21 PDT
Comment on attachment 676403 [details] [diff] [review]
v2

ugh.  wrong patch
Comment 12 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-29 17:28:06 PDT
Created attachment 676406 [details] [diff] [review]
v2
Comment 13 Mark Hammond [:markh] 2012-10-29 17:37:45 PDT
Comment on attachment 676406 [details] [diff] [review]
v2

Review of attachment 676406 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!  Just one minor comment

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", worker.url);
> +        return;
> +      }
> +

I'm still inclined to think this is a premature optimization - that we could just check for onconnect after evaling the text - or even just skip both and allow the onconnect exception handler to fire in both cases - we should hit the problem rarely and having one clear error path sounds like a win.  That said though, this should work fine for that case, so I'll leave that to the judgement of others.
Comment 14 Shane Caraveo (:mixedpuppy) [on leave 5/16-7/16] 2012-10-29 21:57:58 PDT
Created attachment 676471 [details] [diff] [review]
v2 with tests
Comment 15 :Felipe Gomes (needinfo me!) 2012-10-29 23:15:05 PDT
Created attachment 676493 [details] [diff] [review]
v2 with tests + small changes

This is Shane's v2 patch with:
 - removed XPCOMUtils import
 - frameworkerError -> social:frameworker-error
 - changed "load" listener to DOMContentLoaded
Comment 16 :Felipe Gomes (needinfo me!) 2012-10-29 23:19:40 PDT
Created attachment 676495 [details] [diff] [review]
UI part

And this is the UI part that listens for the notification and displays the error in the sidebar. The reset basically does:
  Social.enabled = false
  Social.enabled = true

We could get much more fancy with the worker/UI reload but this is a very simple method that get us a long way into error recovering

The patch is a bit longer than it should because the sidebar handling got a bit complicated with the automatic unloading. If we set the sidebar to display the error page, then the user hides it, and it gets unloaded after 10sec.. we still need to properly show the error if the sidebar is unhidden (loaded) again
Comment 17 :Felipe Gomes (needinfo me!) 2012-10-29 23:23:55 PDT
Sent to try at: https://tbpl.mozilla.org/?tree=Try&rev=d212ca145101
Comment 18 :Felipe Gomes (needinfo me!) 2012-10-29 23:33:16 PDT
> -    if (failure) {
> +    if (failure && !Social.errorState == "frameworker-error") {

should be

> +    if (failure && Social.errorState != "frameworker-error") {
Comment 19 :Felipe Gomes (needinfo me!) 2012-10-29 23:42:40 PDT
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

Original patch was Shane's, I just updated it with a small name tweak that was necessary for the UI part.
Comment 20 :Felipe Gomes (needinfo me!) 2012-10-30 05:15:33 PDT
On the try run, the UI part seems to have caused a leak on the test browser_social_sidebar.js, https://tbpl.mozilla.org/?tree=Try&rev=d212ca145101
I've been looking at it for a while but I haven't yet figured out what is causing it, as I can't reproduce the problem on my machine

I'm imagining it's something similar to bug 803255 but it's a guess so far

I triggered a new try run to see if I can get more data: https://tbpl.mozilla.org/?tree=Try&rev=dd7e4bc6a9cb

Shane's patch alone does not trigger that leak (tried in this run: https://tbpl.mozilla.org/?tree=Try&rev=2f9de0b80f16), so I imagine it's either caused by setting the error page to the sidebar when it's invisible, or the errorState variable needs to get cleared somehow during the test
Comment 21 :Felipe Gomes (needinfo me!) 2012-10-30 06:54:22 PDT
Yeah, I've confirmed that this is bug 803255 striking again. And it only happens when
browser_social_sidebar.js and browser_frameworker.js runs together. I've made a small change to the UI patch that was not strictly necessary but help avoid the bug: it doesn't try to set the error UI URL if there is no provider, and browser_frameworker.js runs without a provider, and with that we avoid bug 803255.

posted to try:
https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c
Comment 22 :Felipe Gomes (needinfo me!) 2012-10-30 06:59:26 PDT
Created attachment 676580 [details] [diff] [review]
UI part v2

The change is wrapping the frameworker-error observer in `if (Social.provider)`. Also included the fix from comment 18.

I won't be around the morning to land this so if it gets the review on time for landing in central and beta during the day I appreciate if someone could land it
Comment 23 :Felipe Gomes (needinfo me!) 2012-10-30 07:05:58 PDT
> posted to try:
> https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c

fwiw both patches needs to be pushed together. Shane's patch doesn't show up in tbpl here as it was already present on try as part of a different head
Comment 24 (Away 6/25-7/4) Jared Wein [:jaws] (reviews and needinfo disabled until back) 2012-10-30 08:58:45 PDT
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/social/FrameWorker.jsm
@@ -208,1 @@
>          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);

the declaration of scriptText was removed in this patch. how is this going to work?
Comment 25 (Away 6/25-7/4) Jared Wein [:jaws] (reviews and needinfo disabled until back) 2012-10-30 08:58:58 PDT
Comment on attachment 676580 [details] [diff] [review]
UI part v2

Review of attachment 676580 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/browser-social.js
@@ +1042,5 @@
> +
> +      case "frameworker-error":
> +        sbrowser.setAttribute("src", "about:socialerror?mode=workerFailure");
> +        break;
> +    }

nit: add a default branch which uses Cu.reportError for unsupported types. this can be done in a follow-up though.
Comment 26 :Felipe Gomes (needinfo me!) 2012-10-30 12:40:03 PDT
(In reply to Jared Wein [:jaws] from comment #24)
> Comment on attachment 676493 [details] [diff] [review]
> v2 with tests + small changes
> 
> Review of attachment 676493 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/social/FrameWorker.jsm
> @@ -208,1 @@
> >          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);
> 
> the declaration of scriptText was removed in this patch. how is this going
> to work?

 It wasn't removed, it was moved to earlier in the function
Comment 28 (Away 6/25-7/4) Jared Wein [:jaws] (reviews and needinfo disabled until back) 2012-10-30 13:46:11 PDT
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Sorry, I didn't see that |scriptText| got moved higher up in the script.
Comment 29 :Felipe Gomes (needinfo me!) 2012-10-30 14:32:27 PDT
Created attachment 676778 [details] [diff] [review]
Rollup patch for aurora/beta

The two patches rolled up together for the branches. Green on try last night and going green on inbound

[Approval Request Comment]
Bug caused by (feature/regressing bug #): social api
User impact if declined: no error recovery from failure to load worker
Testing completed (on m-c, etc.): try and inbound
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch: none
Comment 30 :Gavin Sharp [email: gavin@gavinsharp.com] 2012-10-30 18:35:33 PDT
Comment on attachment 676778 [details] [diff] [review]
Rollup patch for aurora/beta

We discussed this in triage yesterday - risky to be taking this much code churn on beta, but the alternative of not handling this error case seems worse. a=me for Firefox 17/18.
Comment 34 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2012-12-04 13:50:56 PST
Please remove [qa-] whiteboard tag and add verifyme keyword if there's some QA testing needed here. Otherwise we will skip verification.

Note You need to log in before you can comment on or make changes to this bug.