The default bug view has changed. See this FAQ.

handle worker load failures

RESOLVED FIXED in Firefox 17

Status

()

Firefox
SocialAPI
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: dougt, Assigned: Felipe)

Tracking

unspecified
Firefox 19
x86
Mac OS X
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(firefox17 fixed, firefox18 fixed)

Details

(Whiteboard: [Fx17][qa-])

Attachments

(4 attachments, 4 obsolete attachments)

(Reporter)

Description

5 years ago
Today the Facebook Social API has been down for all of us.  I initially thought that my Firefox build was just busted.  The sidebar just had about 1/3 of the content that I'd expect.  The Facebook button shows some text about "Not logged in".  I asked around internally a bit, and found out that, indeed, the facebook service was down for some reason.

What I am concerned about is the general user experience in this case is pretty bad.  My first thoughts were ugh... firefox is busted.

I am wondering if there is some way that we can detect this "service is down" and instead of showing some sort of partial ui, we could, instead, do something "better".
I was actually hoping bug 766616 would handle this case, but it doesn't seem to handle the worker itself not having appropriate content.  We should harden the code that landed in that bug to check the worker (a) returned 200 and (b) has an onconnect global after load.

(IIUC, the current outage is simply that the worker doesn't load - all other content does)
Yeah, this has been on my mind.  Part of the problem for us, as Mark said, is that content has not become unavailable, but rather we're receiving broken content.  Simple checks that the worker (if one is configured) passes messages is good.  We need to think about what kinds of processes we can add to the api that would make the system more resilient to failures.  I chatted briefly with Florian about this, and will be thinking more about additional things we can do.
(Assignee)

Comment 3

5 years ago
bug 788965 was also an idea for this that we've discussed in the past
Duplicate of this bug: 805203
Summary: Service downtime UI → handle worker load failures
Created attachment 675422 [details] [diff] [review]
frameworker errors.patch

port and frameworker portion of error handling, with notifications on error.  frontend needs to handle the notifications and decide what to do with that.
Attachment #675422 - Flags: feedback?(felipc)
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

driveby

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", "noScript");
> +        return;
> +      }
> +

As an additional (or alternative) check, how about checking there is an onconnect callable in the sandbox after the code is evaluated?

@@ +217,5 @@
>        } catch (e) {
>          Cu.reportError("FrameWorker: Error evaluating worker script for " + worker.name + ": " + e + "; " +
>              (e.lineNumber ? ("Line #" + e.lineNumber) : "") +
>              (e.stack ? ("\n" + e.stack) : ""));
> +        Services.obs.notifyObservers(null, "frameworkerError", null);

It might turn out to be useful (to tests at least) to always notify observers when a frameworker loads (or fails to load), and possibly include the origin of the worker?

@@ +329,5 @@
> +    if (failure && aStatus != Components.results.NS_BINDING_ABORTED) {
> +      aRequest.cancel(Components.results.NS_BINDING_ABORTED);
> +      this.setErrorMessage(aWebProgress);
> +    }
> +  },

What will this pick up that checking for onconnect() would not?

@@ +348,5 @@
> +  onSecurityChange: function SPL_onSecurityChange() {},
> +
> +  setErrorMessage: function(aWebProgress) {
> +    Services.obs.notifyObservers(null, "frameworkerError", "networkError");
> +  }

param unused?

@@ +408,5 @@
>        // No "port-create" here - client ports are created explicitly.
> +      case "port-error":
> +        // onconnect failed, we cannot connect the port, the worker has
> +        // become invalid
> +        Services.obs.notifyObservers(null, "frameworkerError", "onconnect failure");

Maybe "port-connection-error" (as it isn't also covering the "simple" case of an onmessage handler failing for only 1 single message, for example)
Depends on: 788920
Whiteboard: [Fx17]
(Assignee)

Comment 7

5 years ago
Comment on attachment 675422 [details] [diff] [review]
frameworker errors.patch

Review of attachment 675422 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, let me see what I can do with it. As Mark mentioned, I'm still not convinced that the network error listeners would catch something that the try { onconnect() } catch (e) { ... } wouldn't
Attachment #675422 - Flags: feedback?(felipc) → feedback+
Ok, I did a little more testing, we do not need to have the webprogresslistener so long as we don't care what kind of error causes us problems.  

Without webprogresslistener we know:

- body is empty
- Cu.evalInSandbox fails
- onconnect fails

With webprogresslistener we could know the http error codes which might indicate to us a way to react.  e.g. if we get a 404, should we simply disable?

I'd be fine with ignoring webprogresslistener for now, but we should think a little for 19/20 whether we want to enhance the error handling.
(In reply to Shane Caraveo (:mixedpuppy) from comment #8)
> I'd be fine with ignoring webprogresslistener for now, but we should think a
> little for 19/20 whether we want to enhance the error handling.

I think that makes perfect sense - if we do end up with a recovery strategy that depends on the specific http errors then that would seem a good time to capture them.
Created attachment 676403 [details] [diff] [review]
v2
Attachment #675422 - Attachment is obsolete: true
Attachment #676403 - Flags: feedback?(mhammond)
Attachment #676403 - Flags: feedback?(felipc)
Comment on attachment 676403 [details] [diff] [review]
v2

ugh.  wrong patch
Attachment #676403 - Flags: feedback?(mhammond)
Attachment #676403 - Flags: feedback?(felipc)
Attachment #676403 - Attachment is obsolete: true
Created attachment 676406 [details] [diff] [review]
v2
Attachment #676406 - Flags: feedback?(mhammond)
Attachment #676406 - Flags: feedback?(felipc)
Comment on attachment 676406 [details] [diff] [review]
v2

Review of attachment 676406 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!  Just one minor comment

::: toolkit/components/social/FrameWorker.jsm
@@ +191,5 @@
> +      if (!scriptText) {
> +        Services.obs.notifyObservers(null, "frameworkerError", worker.url);
> +        return;
> +      }
> +

I'm still inclined to think this is a premature optimization - that we could just check for onconnect after evaling the text - or even just skip both and allow the onconnect exception handler to fire in both cases - we should hit the problem rarely and having one clear error path sounds like a win.  That said though, this should work fine for that case, so I'll leave that to the judgement of others.
Attachment #676406 - Flags: feedback?(mhammond) → feedback+
Created attachment 676471 [details] [diff] [review]
v2 with tests
Attachment #676406 - Attachment is obsolete: true
Attachment #676406 - Flags: feedback?(felipc)
Attachment #676471 - Flags: feedback?(felipc)
(Assignee)

Updated

5 years ago
Attachment #676471 - Flags: feedback?(felipc) → feedback+
(Assignee)

Comment 15

5 years ago
Created attachment 676493 [details] [diff] [review]
v2 with tests + small changes

This is Shane's v2 patch with:
 - removed XPCOMUtils import
 - frameworkerError -> social:frameworker-error
 - changed "load" listener to DOMContentLoaded
(Assignee)

Updated

5 years ago
Attachment #676493 - Attachment is patch: true
(Assignee)

Comment 16

5 years ago
Created attachment 676495 [details] [diff] [review]
UI part

And this is the UI part that listens for the notification and displays the error in the sidebar. The reset basically does:
  Social.enabled = false
  Social.enabled = true

We could get much more fancy with the worker/UI reload but this is a very simple method that get us a long way into error recovering

The patch is a bit longer than it should because the sidebar handling got a bit complicated with the automatic unloading. If we set the sidebar to display the error page, then the user hides it, and it gets unloaded after 10sec.. we still need to properly show the error if the sidebar is unhidden (loaded) again
(Assignee)

Comment 17

5 years ago
Sent to try at: https://tbpl.mozilla.org/?tree=Try&rev=d212ca145101
(Assignee)

Comment 18

5 years ago
> -    if (failure) {
> +    if (failure && !Social.errorState == "frameworker-error") {

should be

> +    if (failure && Social.errorState != "frameworker-error") {
(Assignee)

Updated

5 years ago
Attachment #676495 - Flags: review?(jaws)
(Assignee)

Comment 19

5 years ago
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

Original patch was Shane's, I just updated it with a small name tweak that was necessary for the UI part.
Attachment #676493 - Flags: review+
(Assignee)

Comment 20

5 years ago
On the try run, the UI part seems to have caused a leak on the test browser_social_sidebar.js, https://tbpl.mozilla.org/?tree=Try&rev=d212ca145101
I've been looking at it for a while but I haven't yet figured out what is causing it, as I can't reproduce the problem on my machine

I'm imagining it's something similar to bug 803255 but it's a guess so far

I triggered a new try run to see if I can get more data: https://tbpl.mozilla.org/?tree=Try&rev=dd7e4bc6a9cb

Shane's patch alone does not trigger that leak (tried in this run: https://tbpl.mozilla.org/?tree=Try&rev=2f9de0b80f16), so I imagine it's either caused by setting the error page to the sidebar when it's invisible, or the errorState variable needs to get cleared somehow during the test
(Assignee)

Comment 21

5 years ago
Yeah, I've confirmed that this is bug 803255 striking again. And it only happens when
browser_social_sidebar.js and browser_frameworker.js runs together. I've made a small change to the UI patch that was not strictly necessary but help avoid the bug: it doesn't try to set the error UI URL if there is no provider, and browser_frameworker.js runs without a provider, and with that we avoid bug 803255.

posted to try:
https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c
(Assignee)

Comment 22

5 years ago
Created attachment 676580 [details] [diff] [review]
UI part v2

The change is wrapping the frameworker-error observer in `if (Social.provider)`. Also included the fix from comment 18.

I won't be around the morning to land this so if it gets the review on time for landing in central and beta during the day I appreciate if someone could land it
Attachment #676495 - Attachment is obsolete: true
Attachment #676495 - Flags: review?(jaws)
Attachment #676580 - Flags: review?(jaws)
(Assignee)

Updated

5 years ago
Attachment #676580 - Flags: review?(gavin.sharp)
(Assignee)

Comment 23

5 years ago
> posted to try:
> https://tbpl.mozilla.org/?tree=Try&rev=9d478868883c

fwiw both patches needs to be pushed together. Shane's patch doesn't show up in tbpl here as it was already present on try as part of a different head
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Review of attachment 676493 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/social/FrameWorker.jsm
@@ -208,1 @@
>          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);

the declaration of scriptText was removed in this patch. how is this going to work?
Attachment #676493 - Flags: review-
Comment on attachment 676580 [details] [diff] [review]
UI part v2

Review of attachment 676580 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/base/content/browser-social.js
@@ +1042,5 @@
> +
> +      case "frameworker-error":
> +        sbrowser.setAttribute("src", "about:socialerror?mode=workerFailure");
> +        break;
> +    }

nit: add a default branch which uses Cu.reportError for unsupported types. this can be done in a follow-up though.
Attachment #676580 - Flags: review?(jaws) → review+
(Assignee)

Comment 26

5 years ago
(In reply to Jared Wein [:jaws] from comment #24)
> Comment on attachment 676493 [details] [diff] [review]
> v2 with tests + small changes
> 
> Review of attachment 676493 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/components/social/FrameWorker.jsm
> @@ -208,1 @@
> >          Cu.evalInSandbox(scriptText, sandbox, "1.8", workerWindow.location.href, 1);
> 
> the declaration of scriptText was removed in this patch. how is this going
> to work?

 It wasn't removed, it was moved to earlier in the function
(Assignee)

Comment 27

5 years ago
With comment 26 said, landed on inbound:

https://hg.mozilla.org/integration/mozilla-inbound/rev/9104ce0e7396
https://hg.mozilla.org/integration/mozilla-inbound/rev/08f97e0fa9da
Comment on attachment 676493 [details] [diff] [review]
v2 with tests + small changes

Sorry, I didn't see that |scriptText| got moved higher up in the script.
Attachment #676493 - Flags: review-
(Assignee)

Updated

5 years ago
Attachment #676580 - Flags: review?(gavin.sharp)
(Assignee)

Comment 29

5 years ago
Created attachment 676778 [details] [diff] [review]
Rollup patch for aurora/beta

The two patches rolled up together for the branches. Green on try last night and going green on inbound

[Approval Request Comment]
Bug caused by (feature/regressing bug #): social api
User impact if declined: no error recovery from failure to load worker
Testing completed (on m-c, etc.): try and inbound
Risk to taking this patch (and alternatives if risky): 
String or UUID changes made by this patch: none
Attachment #676778 - Flags: review+
Attachment #676778 - Flags: approval-mozilla-beta?
Attachment #676778 - Flags: approval-mozilla-aurora?
Comment on attachment 676778 [details] [diff] [review]
Rollup patch for aurora/beta

We discussed this in triage yesterday - risky to be taking this much code churn on beta, but the alternative of not handling this error case seems worse. a=me for Firefox 17/18.
Attachment #676778 - Flags: approval-mozilla-beta?
Attachment #676778 - Flags: approval-mozilla-beta+
Attachment #676778 - Flags: approval-mozilla-aurora?
Attachment #676778 - Flags: approval-mozilla-aurora+
(Assignee)

Comment 31

5 years ago
https://hg.mozilla.org/releases/mozilla-beta/rev/0965ce010472
https://hg.mozilla.org/releases/mozilla-beta/rev/d180e9e0a61c
status-firefox17: --- → fixed
(Assignee)

Comment 32

5 years ago
https://hg.mozilla.org/releases/mozilla-aurora/rev/6518691991bb
https://hg.mozilla.org/releases/mozilla-aurora/rev/2b2d7457e130
status-firefox18: --- → fixed
(Assignee)

Updated

5 years ago
Flags: in-testsuite+
Assignee: nobody → felipc
https://hg.mozilla.org/mozilla-central/rev/9104ce0e7396
https://hg.mozilla.org/mozilla-central/rev/08f97e0fa9da
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 19
Please remove [qa-] whiteboard tag and add verifyme keyword if there's some QA testing needed here. Otherwise we will skip verification.
Whiteboard: [Fx17] → [Fx17][qa-]
You need to log in before you can comment on or make changes to this bug.