Closed Bug 1110681 Opened 10 years ago Closed 9 years ago

HealthReporter AsyncShutdownTimeout due to blocking provider

Categories

(Firefox Health Report Graveyard :: Client: Desktop, defect)

defect
Not set
normal

Tracking

(firefox34 wontfix, firefox35 wontfix, firefox36 wontfix, firefox37 affected, firefox38 affected)

RESOLVED WONTFIX
Tracking Status
firefox34 --- wontfix
firefox35 --- wontfix
firefox36 --- wontfix
firefox37 --- affected
firefox38 --- affected

People

(Reporter: gfritzsche, Assigned: Yoric)

References

Details

Attachments

(1 file, 3 obsolete files)

From bp-552aac00-cfc6-48b1-9631-1fd372141212, the AsyncShutdownTimeout data is:
{"phase":"Metrics Storage Backend","conditions":[{"name":"FHR: Flushing storage shutdown","state":{"shutdownInitiated":false,"initialized":false,"shutdownRequested":true,"initializeHadError":false,"providerManagerInProgress":true,"storageInProgress":false,"hasProviderManager":true,"hasStorage":true,"shutdownComplete":false},"filename":"resource://gre/modules/HealthReport.jsm","lineNumber":4335,"stack":["resource://gre/modules/HealthReport.jsm:AbstractHealthReporter.prototype<.init/<:4335",""]}]}

That shows that we got to the provider manager shutdown here:
http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l580
... but returned because the provider manager is still initializing - i.e. we're still waiting on the init here:
http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l430

Now, in the shutdown blocker here:
http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l398
... we won't retry _initiateShutdown() in this scenario and hence hang indefinitely.

I think we have to either
* have the _initiateShutdown() wait on initialization to be finished
* or add another check for _shutdownRequested after initialization finished around here:
  http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l433
  ... similar to here:
  http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l433
OS: Mac OS X → All
Hardware: x86 → All
Attached patch Fix potential hang (obsolete) — Splinter Review
This seems the simplest possible solution:
Retry shutdown after finishing initialization.

gps, what do you think?
Assignee: nobody → georg.fritzsche
Status: NEW → ASSIGNED
Attachment #8535635 - Flags: review?(gps)
Comment on attachment 8535635 [details] [diff] [review]
Fix potential hang

This is the AsyncShutdownTimeout top-crash cause per bug 1035290, comment 15.
Trying rnewman as gps is on PTO.
Attachment #8535635 - Flags: review?(gps) → review?(rnewman)
Comment on attachment 8535635 [details] [diff] [review]
Fix potential hang

Review of attachment 8535635 [details] [diff] [review]:
-----------------------------------------------------------------

Looks OK-ish. But this doesn't get r+ until there is a test. See test_healthreporter.js.

::: services/healthreport/healthreporter.jsm
@@ +434,5 @@
>  
>          yield this._initializeProviderManager();
>          yield this._onProviderManagerInitialized();
>          this._initializedDeferred.resolve();
> +        yield this.onInit();

This change shouldn't matter. The API here is a bit wonky. init() is returning a Task that formerly resolved to a promise (this._initializedDeferred) that resolves to nothing (see line above). That's kinda wonky.

Unless someone is actually waiting on this promise, this line can be removed.

@@ +443,5 @@
> +        // We may have tried to shutdown during provider manager
> +        // initialization and skipped it; check for it now.
> +        if (this._shutdownRequested) {
> +          this._initiateShutdown();
> +        }

I don't think this needs to be in a finally block. Instead, it can be in-lined before |this._initializedDeferred.resolve()|
Attachment #8535635 - Flags: review?(rnewman)
Actually, rereading the code to evaluate testability... i don't think i identified the issue correctly.

Given |"hasProviderManager":true| and the other state information:
{
  "shutdownInitiated":false,
  "initialized":false,
  "shutdownRequested":true,
  "initializeHadError":false,
  "providerManagerInProgress":true,
  "storageInProgress":false,
  "hasProviderManager":true,
  "hasStorage":true,
  "shutdownComplete":false
}

That means we still have to be in _initializeProviderManager() [0], as _onProviderManagerInitialized() sets |providerManagerInProgress=true| [1].

Which means that we probably hang waiting on a provider to start up.
I wonder if we can break anything provider startups depend on by hitting _initiateShutdown() too early (i.e. setting this._initialized=false and this._shutdownRequested=false).

gps, can you sanity check my reading above?
Any input/ideas?

[0] http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l430
[1] http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l507
Flags: needinfo?(gps)
I was still logged in to the wrong bugzilla account while trying to merge them.
Resending needinfo from correct account.
Flags: needinfo?(gps)
(In reply to cygmatic from comment #4)
> That means we still have to be in _initializeProviderManager() [0], as
> _onProviderManagerInitialized() sets |providerManagerInProgress=true| [1].
> 
> Which means that we probably hang waiting on a provider to start up.

Ok, just checked this by making ExperimentsProvider.postInit() just return a non-resolving promise. I ended up with the same AsyncShutdownTimeout state as here.
Adding diagnostics for this.
Attachment #8537853 - Flags: review?(dteller)
Attachment #8535635 - Attachment is obsolete: true
Promises from providers "hanging" is a known issue in FHR. See also bug 1030270 and bug 989788. This bug is just a variant that occurs during provider init/shutdown instead of collection. Same outcome.

Quite frankly, I'm surprised it took so long for this kind of issue to surface as a top crasher. Your forensics will likely indicate a failure and legitimate bug in some other component. That other component should be fixed. However, I still think FHR should practice defense in depth and implement timers for promise resolution when dealing with external systems. The chances that a client will encounter this class of error will increase as the number of providers increases. The problem will only get worse over time.
Comment on attachment 8537853 [details] [diff] [review]
AsyncShutdownTimeout diagnostic for FHR provider init

Review of attachment 8537853 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, thanks.
Also, gps is right, we should probably add a timeout, see e.g. PromiseUtils.resolveOrTimeout.

::: services/metrics/providermanager.jsm
@@ +126,5 @@
>          Cu.import(uri, ns);
>  
>          let promise = this.registerProviderFromType(ns[entry]);
>          if (promise) {
> +          promiseMap.set(entry, promise);

Are we sure that `entry` is unique?
If not, we should rather store an array of {entry, promise}, shouldn't we?

@@ +136,5 @@
>          continue;
>        }
>      }
>  
>      return Task.spawn(function wait() {

`function*`, btw
Attachment #8537853 - Flags: review?(dteller) → review+
Side-note: we still want data on providers that don't terminate. Perhaps we should crash in DEBUG mode and timeout in non-DEBUG? Or can we upload such data meaningfully using FHR itself?
(In reply to David Rajchenbach-Teller [:Yoric] (hard to reach until January 10th - use "needinfo") from comment #10)
> Side-note: we still want data on providers that don't terminate. Perhaps we
> should crash in DEBUG mode and timeout in non-DEBUG? Or can we upload such
> data meaningfully using FHR itself?

Hm, this is a little interesting as this basically means to submit data into FHR while its shutting down.
Comment on attachment 8537853 [details] [diff] [review]
AsyncShutdownTimeout diagnostic for FHR provider init

Approval Request Comment
[Feature/regressing bug #]: Some FHR provider is timing out.
[User impact if declined]: No info on which provider is at fault.
[Describe test coverage new/current, TBPL]: Automated test-coverage.
[Risks and why]: Low-risk, just adding some diagnostics.
[String/UUID change made/needed]: None.
Attachment #8537853 - Flags: approval-mozilla-beta?
Attachment #8537853 - Flags: approval-mozilla-aurora?
Summary: HealthReporter AsyncShutdown blocker may wait indefinitely if run during HealthReporter initialization → HealthReporter AsyncShutdownTimeout due to blocking provider
(In reply to Gregory Szorc [:gps] from comment #8)
> Quite frankly, I'm surprised it took so long for this kind of issue to
> surface as a top crasher. Your forensics will likely indicate a failure and
> legitimate bug in some other component. That other component should be
> fixed. However, I still think FHR should practice defense in depth and
> implement timers for promise resolution when dealing with external systems.

Ok, but then we also need to cross-check all FHR code to properly fail when accessed late in each respective scenarios, which might open up another bucket of problems.
And we definitely need to figure out how to submit the offending providers via FHR and Telemetry, which may be hard during/after shutdown of these systems.
Shutdown ordering between the two might then also be a concern.
Attachment #8537853 - Flags: approval-mozilla-beta?
Attachment #8537853 - Flags: approval-mozilla-beta+
Attachment #8537853 - Flags: approval-mozilla-aurora?
Attachment #8537853 - Flags: approval-mozilla-aurora+
This isn't fixed on 37 IIUC.
We do have some data in "are we shutting down yet" now that has currentProviderInInit and currentProviderInShutdown.
There is already quite some reports with "currentProviderInInit":"SearchesProvider".

Given the spike for these in 34, this points to bug 1029031 ("Record the currently selected search provider in FHR"):
http://hg.mozilla.org/mozilla-central/rev/b77106f50779

However, that bug was uplifted to 33 & 32 and apparently we didn't have a spike there.
I quickly sampled "Are we shutting down yet" for this today again [0] manually - most of the timeouts for 35+ are still with "currentProviderInInit":"SearchesProvider".

Florian, it looks like you are good to ask on SearchService issues?

We apparently have the FHR SearchesProvider hanging around here:
https://hg.mozilla.org/mozilla-central/annotate/2a193b7f395c/services/healthreport/providers.jsm#l1626

... which would mean that we don't get called back from SearchService.init():
https://hg.mozilla.org/mozilla-central/annotate/2a193b7f395c/toolkit/components/search/nsSearchService.js#l3998

The first thing that stands out is that we don't seem to call back the observer when |this._initStarted| is true.
Am i missing something here? Any other failure points you can think of?

[0] http://yoric.github.io/are-we-shutting-down-yet-/?signature=~FHR%3A+Flushing+storage+shutdown#
Flags: needinfo?(florian)
(In reply to Georg Fritzsche [:gfritzsche] from comment #19)
> The first thing that stands out is that we don't seem to call back the
> observer when |this._initStarted| is true.
> Am i missing something here?

Yoric wrote this code, but I think you might be. If _initStarted is true, then we should fall directly into the if (observer) case, and this._initObservers.promise will either already be resolved, or be in the process of resolving itself.

From a quick skim I don't see any code paths that can lead to _initObservers.promise not being resolved, but maybe there are some...
Flags: needinfo?(florian)
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #20)
> Yoric wrote this code, but I think you might be. If _initStarted is true,
> then we should fall directly into the if (observer) case, and
> this._initObservers.promise will either already be resolved, or be in the
> process of resolving itself.

Ah, indeed, i missed that .then() handlers are still called when the promise is already resolved.

Yoric, any ideas here?
Flags: needinfo?(dteller)
I have looked at SearchService initialization. I believe that the asynchronous initialization never completes. This may be due to a bug in the code (the loading of search engines, in particular, is way too complicated to be easy to trust) or in a notification observer that could be spinning the event loop and causing havoc.

To find out more, I believe that the best solution is to make the asynchronous initialization an AsyncShutdown blocker, and use its `fetchState` to determine how far we have managed to go in the initialization.
Flags: needinfo?(dteller)
Assignee: gfritzsche → nobody
This should give us more details about what's going wrong.
Attachment #8550274 - Flags: review?(felipc)
Comment on attachment 8550274 [details] [diff] [review]
AsyncShutdown diagnostics for SearchService

Adding Florian in case he wants to take a look or can get to this before myself, as he has worked a lot on this recently
Attachment #8550274 - Flags: review?(florian)
Comment on attachment 8550274 [details] [diff] [review]
AsyncShutdown diagnostics for SearchService

Review of attachment 8550274 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/search/nsSearchService.js
@@ +240,5 @@
>   */
>  const SEARCH_LOG_PREFIX = "*** Search: ";
>  
>  /**
> + * The latest entries to the log.

Isn't there a word missing in this comment?

@@ +278,5 @@
>  /**
>   * Otherwise, don't log at all by default. This can be overridden at startup
>   * by the pref, see SearchService's _init method.
>   */
> +var LOG = function(aText){

nit: space between ) and {.

@@ +3106,5 @@
> +   * part of a crash report in case of AsyncShutdown timeout.
> +   */
> +  debugState: {
> +    get initStarted() {
> +      return SearchService.prototype._initStarted;

Isn't this (on the prototype, rather than on the SearchService instance) always null?
Could someone take over this patch? I won't have time to work on it in January.
Flags: needinfo?(gfritzsche)
Flags: needinfo?(florian)
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe

Pull down this commit:

hg pull review -r 4e368c82aab468e20bf0acd56349b6124a4e594e
Attachment #8550274 - Attachment is obsolete: true
Attachment #8550274 - Flags: review?(florian)
Attachment #8550274 - Flags: review?(felipc)
Attachment #8551512 - Flags: review?(felipc) → review?(florian)
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe
/r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian

Pull down these commits:

hg pull review -r 6ff1375352e785841ade4aea530a72afa5ae1735
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe
/r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian

Pull down these commits:

hg pull review -r 6ff1375352e785841ade4aea530a72afa5ae1735
Applied feedback.
https://reviewboard.mozilla.org/r/2695/#review1787

::: toolkit/components/search/nsSearchService.js
(Diff revision 1)
> +      return this._initStarted;

It seems to me that "this" here points to the 'debugState' object, so I'm afraid _initStarted and _initRV will be undefined.
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe
/r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian
/r/2699 - Bug 1110681 - Instrumenting SearchService;r=florian

Pull down these commits:

hg pull review -r b94aa40682472d51a42f4b6709597a98e0fe6cc7
https://reviewboard.mozilla.org/r/2695/#review1789

> It seems to me that "this" here points to the 'debugState' object, so I'm afraid _initStarted and _initRV will be undefined.

You are, of course, right. My bad.
(In reply to David Rajchenbach-Teller [:Yoric] (away until Febrary - use "needinfo") from comment #28)
> Could someone take over this patch? I won't have time to work on it in
> January.

Benjamin, do you know who could?
Flags: needinfo?(gfritzsche) → needinfo?(benjamin)
https://reviewboard.mozilla.org/r/2693/#review1791

::: toolkit/components/search/nsSearchService.js
(Diff revision 3)
> -var LOG = function(){};
> +var LOG = function (aText){

I noted in comment 27 that a space was missing between ) and {, you added one between "function" and "(".

::: toolkit/components/search/nsSearchService.js
(Diff revision 3)
> +      return !!self._batchTaskl

nit: missing ; here.

::: toolkit/components/search/nsSearchService.js
(Diff revision 3)
> +  this.debugState = {

It seems this object and the gLatestLogEntries array become useless once _initObservers.promise has resolved (and the search service init is no longer blocking shutdown). Should we set them to null so that they are gc'ed?
Georg, if this is indeed important, then it's something that you and Alessio will need to do. But as I said earlier, I'm not convinced that spending time on this is worthwhile since we're planning on sunsetting all this FHR code in the 38 cycle anyway.
Flags: needinfo?(benjamin)
Comment on attachment 8551512 [details]
MozReview Request: bz://1110681/Yoric

Clearing the review flag (my review comments are in comment 38) and needinfo per comment 39.
Flags: needinfo?(florian)
Attachment #8551512 - Flags: review?(florian)
Assignee: nobody → dteller
Comment on attachment 8551512 [details]
MozReview Request: bz://1110681/Yoric

/r/2695 - Bug 1110681 - Instrumenting nsSearchService;r=florian

Pull down this commit:

hg pull review -r d1baa28e7614a44315ede5101472a789f6cb75b3
Attachment #8551512 - Flags: review?(florian)
Ah, right, my bad, I forgot to click "publish" in mozReview, so Florian had never received my updated review request. Here we go.
Florian?
Flags: needinfo?(florian)
Ah, good offline point: the blocked bug has been WONTFIXed, canceling review.
Flags: needinfo?(florian)
Attachment #8551512 - Attachment is obsolete: true
Attachment #8551512 - Flags: review?(florian)
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: