Closed
Bug 1110681
Opened 9 years ago
Closed 9 years ago
HealthReporter AsyncShutdownTimeout due to blocking provider
Categories
(Firefox Health Report Graveyard :: Client: Desktop, defect)
Firefox Health Report Graveyard
Client: Desktop
Tracking
(firefox34 wontfix, firefox35 wontfix, firefox36 wontfix, firefox37 affected, firefox38 affected)
People
(Reporter: gfritzsche, Assigned: Yoric)
References
Details
Attachments
(1 file, 3 obsolete files)
5.75 KB,
patch
|
Yoric
:
review+
Sylvestre
:
approval-mozilla-aurora+
Sylvestre
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
From bp-552aac00-cfc6-48b1-9631-1fd372141212, the AsyncShutdownTimeout data is: {"phase":"Metrics Storage Backend","conditions":[{"name":"FHR: Flushing storage shutdown","state":{"shutdownInitiated":false,"initialized":false,"shutdownRequested":true,"initializeHadError":false,"providerManagerInProgress":true,"storageInProgress":false,"hasProviderManager":true,"hasStorage":true,"shutdownComplete":false},"filename":"resource://gre/modules/HealthReport.jsm","lineNumber":4335,"stack":["resource://gre/modules/HealthReport.jsm:AbstractHealthReporter.prototype<.init/<:4335",""]}]} That shows that we got to the provider manager shutdown here: http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l580 ... but returned because the provider manager is still initializing - i.e. we're still waiting on the init here: http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l430 Now, in the shutdown blocker here: http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l398 ... we won't retry _initiateShutdown() in this scenario and hence hang indefinitely. I think we have to either * have the _initiateShutdown() wait on initialization to be finished * or add another check for _shutdownRequested after initialization finished around here: http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l433 ... similar to here: http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l433
Reporter | ||
Updated•9 years ago
|
OS: Mac OS X → All
Hardware: x86 → All
Reporter | ||
Comment 1•9 years ago
|
||
This seems the simplest possible solution: Retry shutdown after finishing initialization. gps, what do you think?
Reporter | ||
Comment 2•9 years ago
|
||
Comment on attachment 8535635 [details] [diff] [review] Fix potential hang This is the AsyncShutdownTimeout top-crash cause per bug 1035290, comment 15. Trying rnewman as gps is on PTO.
Attachment #8535635 -
Flags: review?(gps) → review?(rnewman)
Comment 3•9 years ago
|
||
Comment on attachment 8535635 [details] [diff] [review] Fix potential hang Review of attachment 8535635 [details] [diff] [review]: ----------------------------------------------------------------- Looks OK-ish. But this doesn't get r+ until there is a test. See test_healthreporter.js. ::: services/healthreport/healthreporter.jsm @@ +434,5 @@ > > yield this._initializeProviderManager(); > yield this._onProviderManagerInitialized(); > this._initializedDeferred.resolve(); > + yield this.onInit(); This change shouldn't matter. The API here is a bit wonky. init() is returning a Task that formerly resolved to a promise (this._initializedDeferred) that resolves to nothing (see line above). That's kinda wonky. Unless someone is actually waiting on this promise, this line can be removed. @@ +443,5 @@ > + // We may have tried to shutdown during provider manager > + // initialization and skipped it; check for it now. > + if (this._shutdownRequested) { > + this._initiateShutdown(); > + } I don't think this needs to be in a finally block. Instead, it can be in-lined before |this._initializedDeferred.resolve()|
Attachment #8535635 -
Flags: review?(rnewman)
Reporter | ||
Comment 4•9 years ago
|
||
Actually, rereading the code to evaluate testability... i don't think i identified the issue correctly. Given |"hasProviderManager":true| and the other state information: { "shutdownInitiated":false, "initialized":false, "shutdownRequested":true, "initializeHadError":false, "providerManagerInProgress":true, "storageInProgress":false, "hasProviderManager":true, "hasStorage":true, "shutdownComplete":false } That means we still have to be in _initializeProviderManager() [0], as _onProviderManagerInitialized() sets |providerManagerInProgress=true| [1]. Which means that we probably hang waiting on a provider to start up. I wonder if we can break anything provider startups depend on by hitting _initiateShutdown() too early (i.e. setting this._initialized=false and this._shutdownRequested=false). gps, can you sanity check my reading above? Any input/ideas? [0] http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l430 [1] http://hg.mozilla.org/mozilla-central/annotate/5288b15d22de/services/healthreport/healthreporter.jsm#l507
Flags: needinfo?(gps)
Reporter | ||
Comment 5•9 years ago
|
||
I was still logged in to the wrong bugzilla account while trying to merge them. Resending needinfo from correct account.
Flags: needinfo?(gps)
Reporter | ||
Comment 6•9 years ago
|
||
(In reply to cygmatic from comment #4) > That means we still have to be in _initializeProviderManager() [0], as > _onProviderManagerInitialized() sets |providerManagerInProgress=true| [1]. > > Which means that we probably hang waiting on a provider to start up. Ok, just checked this by making ExperimentsProvider.postInit() just return a non-resolving promise. I ended up with the same AsyncShutdownTimeout state as here.
Reporter | ||
Comment 7•9 years ago
|
||
Adding diagnostics for this.
Attachment #8537853 -
Flags: review?(dteller)
Reporter | ||
Updated•9 years ago
|
Attachment #8535635 -
Attachment is obsolete: true
Comment 8•9 years ago
|
||
Promises from providers "hanging" is a known issue in FHR. See also bug 1030270 and bug 989788. This bug is just a variant that occurs during provider init/shutdown instead of collection. Same outcome. Quite frankly, I'm surprised it took so long for this kind of issue to surface as a top crasher. Your forensics will likely indicate a failure and legitimate bug in some other component. That other component should be fixed. However, I still think FHR should practice defense in depth and implement timers for promise resolution when dealing with external systems. The chances that a client will encounter this class of error will increase as the number of providers increases. The problem will only get worse over time.
Assignee | ||
Comment 9•9 years ago
|
||
Comment on attachment 8537853 [details] [diff] [review] AsyncShutdownTimeout diagnostic for FHR provider init Review of attachment 8537853 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, thanks. Also, gps is right, we should probably add a timeout, see e.g. PromiseUtils.resolveOrTimeout. ::: services/metrics/providermanager.jsm @@ +126,5 @@ > Cu.import(uri, ns); > > let promise = this.registerProviderFromType(ns[entry]); > if (promise) { > + promiseMap.set(entry, promise); Are we sure that `entry` is unique? If not, we should rather store an array of {entry, promise}, shouldn't we? @@ +136,5 @@ > continue; > } > } > > return Task.spawn(function wait() { `function*`, btw
Attachment #8537853 -
Flags: review?(dteller) → review+
Assignee | ||
Comment 10•9 years ago
|
||
Side-note: we still want data on providers that don't terminate. Perhaps we should crash in DEBUG mode and timeout in non-DEBUG? Or can we upload such data meaningfully using FHR itself?
Reporter | ||
Comment 11•9 years ago
|
||
(In reply to David Rajchenbach-Teller [:Yoric] (hard to reach until January 10th - use "needinfo") from comment #10) > Side-note: we still want data on providers that don't terminate. Perhaps we > should crash in DEBUG mode and timeout in non-DEBUG? Or can we upload such > data meaningfully using FHR itself? Hm, this is a little interesting as this basically means to submit data into FHR while its shutting down.
Reporter | ||
Comment 12•9 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/02b168bd2232
Keywords: leave-open
Reporter | ||
Comment 13•9 years ago
|
||
Comment on attachment 8537853 [details] [diff] [review] AsyncShutdownTimeout diagnostic for FHR provider init Approval Request Comment [Feature/regressing bug #]: Some FHR provider is timing out. [User impact if declined]: No info on which provider is at fault. [Describe test coverage new/current, TBPL]: Automated test-coverage. [Risks and why]: Low-risk, just adding some diagnostics. [String/UUID change made/needed]: None.
Attachment #8537853 -
Flags: approval-mozilla-beta?
Attachment #8537853 -
Flags: approval-mozilla-aurora?
Reporter | ||
Updated•9 years ago
|
Summary: HealthReporter AsyncShutdown blocker may wait indefinitely if run during HealthReporter initialization → HealthReporter AsyncShutdownTimeout due to blocking provider
Reporter | ||
Comment 14•9 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #8) > Quite frankly, I'm surprised it took so long for this kind of issue to > surface as a top crasher. Your forensics will likely indicate a failure and > legitimate bug in some other component. That other component should be > fixed. However, I still think FHR should practice defense in depth and > implement timers for promise resolution when dealing with external systems. Ok, but then we also need to cross-check all FHR code to properly fail when accessed late in each respective scenarios, which might open up another bucket of problems. And we definitely need to figure out how to submit the offending providers via FHR and Telemetry, which may be hard during/after shutdown of these systems. Shutdown ordering between the two might then also be a concern.
Updated•9 years ago
|
Updated•9 years ago
|
Attachment #8537853 -
Flags: approval-mozilla-beta?
Attachment #8537853 -
Flags: approval-mozilla-beta+
Attachment #8537853 -
Flags: approval-mozilla-aurora?
Attachment #8537853 -
Flags: approval-mozilla-aurora+
Reporter | ||
Comment 16•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/888ab5f9e567 https://hg.mozilla.org/releases/mozilla-beta/rev/0c82d41ff213
Comment 17•9 years ago
|
||
This isn't fixed on 37 IIUC.
Reporter | ||
Updated•9 years ago
|
status-firefox34:
--- → wontfix
Reporter | ||
Comment 18•9 years ago
|
||
We do have some data in "are we shutting down yet" now that has currentProviderInInit and currentProviderInShutdown. There is already quite some reports with "currentProviderInInit":"SearchesProvider". Given the spike for these in 34, this points to bug 1029031 ("Record the currently selected search provider in FHR"): http://hg.mozilla.org/mozilla-central/rev/b77106f50779 However, that bug was uplifted to 33 & 32 and apparently we didn't have a spike there.
Reporter | ||
Comment 19•9 years ago
|
||
I quickly sampled "Are we shutting down yet" for this today again [0] manually - most of the timeouts for 35+ are still with "currentProviderInInit":"SearchesProvider". Florian, it looks like you are good to ask on SearchService issues? We apparently have the FHR SearchesProvider hanging around here: https://hg.mozilla.org/mozilla-central/annotate/2a193b7f395c/services/healthreport/providers.jsm#l1626 ... which would mean that we don't get called back from SearchService.init(): https://hg.mozilla.org/mozilla-central/annotate/2a193b7f395c/toolkit/components/search/nsSearchService.js#l3998 The first thing that stands out is that we don't seem to call back the observer when |this._initStarted| is true. Am i missing something here? Any other failure points you can think of? [0] http://yoric.github.io/are-we-shutting-down-yet-/?signature=~FHR%3A+Flushing+storage+shutdown#
Flags: needinfo?(florian)
Comment 20•9 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #19) > The first thing that stands out is that we don't seem to call back the > observer when |this._initStarted| is true. > Am i missing something here? Yoric wrote this code, but I think you might be. If _initStarted is true, then we should fall directly into the if (observer) case, and this._initObservers.promise will either already be resolved, or be in the process of resolving itself. From a quick skim I don't see any code paths that can lead to _initObservers.promise not being resolved, but maybe there are some...
Flags: needinfo?(florian)
Reporter | ||
Comment 21•9 years ago
|
||
(In reply to :Gavin Sharp [email: gavin@gavinsharp.com] from comment #20) > Yoric wrote this code, but I think you might be. If _initStarted is true, > then we should fall directly into the if (observer) case, and > this._initObservers.promise will either already be resolved, or be in the > process of resolving itself. Ah, indeed, i missed that .then() handlers are still called when the promise is already resolved. Yoric, any ideas here?
Flags: needinfo?(dteller)
Assignee | ||
Comment 22•9 years ago
|
||
I have looked at SearchService initialization. I believe that the asynchronous initialization never completes. This may be due to a bug in the code (the loading of search engines, in particular, is way too complicated to be easy to trust) or in a notification observer that could be spinning the event loop and causing havoc. To find out more, I believe that the best solution is to make the asynchronous initialization an AsyncShutdown blocker, and use its `fetchState` to determine how far we have managed to go in the initialization.
Flags: needinfo?(dteller)
Reporter | ||
Updated•9 years ago
|
Assignee: gfritzsche → nobody
Assignee | ||
Comment 23•9 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=cdda26ec865b
Assignee | ||
Comment 24•9 years ago
|
||
This should give us more details about what's going wrong.
Attachment #8550274 -
Flags: review?(felipc)
Assignee | ||
Comment 25•9 years ago
|
||
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2bfb9a5e6579
Comment 26•9 years ago
|
||
Comment on attachment 8550274 [details] [diff] [review] AsyncShutdown diagnostics for SearchService Adding Florian in case he wants to take a look or can get to this before myself, as he has worked a lot on this recently
Attachment #8550274 -
Flags: review?(florian)
Comment 27•9 years ago
|
||
Comment on attachment 8550274 [details] [diff] [review] AsyncShutdown diagnostics for SearchService Review of attachment 8550274 [details] [diff] [review]: ----------------------------------------------------------------- ::: toolkit/components/search/nsSearchService.js @@ +240,5 @@ > */ > const SEARCH_LOG_PREFIX = "*** Search: "; > > /** > + * The latest entries to the log. Isn't there a word missing in this comment? @@ +278,5 @@ > /** > * Otherwise, don't log at all by default. This can be overridden at startup > * by the pref, see SearchService's _init method. > */ > +var LOG = function(aText){ nit: space between ) and {. @@ +3106,5 @@ > + * part of a crash report in case of AsyncShutdown timeout. > + */ > + debugState: { > + get initStarted() { > + return SearchService.prototype._initStarted; Isn't this (on the prototype, rather than on the SearchService instance) always null?
Assignee | ||
Comment 28•9 years ago
|
||
Could someone take over this patch? I won't have time to work on it in January.
Flags: needinfo?(gfritzsche)
Flags: needinfo?(florian)
Assignee | ||
Comment 29•9 years ago
|
||
Attachment #8551512 -
Flags: review?(felipc)
Assignee | ||
Comment 30•9 years ago
|
||
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe Pull down this commit: hg pull review -r 4e368c82aab468e20bf0acd56349b6124a4e594e
Assignee | ||
Updated•9 years ago
|
Attachment #8550274 -
Attachment is obsolete: true
Attachment #8550274 -
Flags: review?(florian)
Attachment #8550274 -
Flags: review?(felipc)
Assignee | ||
Updated•9 years ago
|
Attachment #8551512 -
Flags: review?(felipc) → review?(florian)
Assignee | ||
Comment 31•9 years ago
|
||
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe /r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian Pull down these commits: hg pull review -r 6ff1375352e785841ade4aea530a72afa5ae1735
Assignee | ||
Comment 32•9 years ago
|
||
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe /r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian Pull down these commits: hg pull review -r 6ff1375352e785841ade4aea530a72afa5ae1735
Assignee | ||
Comment 33•9 years ago
|
||
Applied feedback.
Comment 34•9 years ago
|
||
https://reviewboard.mozilla.org/r/2695/#review1787 ::: toolkit/components/search/nsSearchService.js (Diff revision 1) > + return this._initStarted; It seems to me that "this" here points to the 'debugState' object, so I'm afraid _initStarted and _initRV will be undefined.
Assignee | ||
Comment 35•9 years ago
|
||
/r/2695 - Bug 1110681 - Instrumenting SearchService;r=felipe /r/2697 - Bug 1110681 - Instrumenting SearchService;r=florian /r/2699 - Bug 1110681 - Instrumenting SearchService;r=florian Pull down these commits: hg pull review -r b94aa40682472d51a42f4b6709597a98e0fe6cc7
Assignee | ||
Comment 36•9 years ago
|
||
https://reviewboard.mozilla.org/r/2695/#review1789 > It seems to me that "this" here points to the 'debugState' object, so I'm afraid _initStarted and _initRV will be undefined. You are, of course, right. My bad.
Reporter | ||
Comment 37•9 years ago
|
||
(In reply to David Rajchenbach-Teller [:Yoric] (away until Febrary - use "needinfo") from comment #28) > Could someone take over this patch? I won't have time to work on it in > January. Benjamin, do you know who could?
Flags: needinfo?(gfritzsche) → needinfo?(benjamin)
Comment 38•9 years ago
|
||
https://reviewboard.mozilla.org/r/2693/#review1791 ::: toolkit/components/search/nsSearchService.js (Diff revision 3) > -var LOG = function(){}; > +var LOG = function (aText){ I noted in comment 27 that a space was missing between ) and {, you added one between "function" and "(". ::: toolkit/components/search/nsSearchService.js (Diff revision 3) > + return !!self._batchTaskl nit: missing ; here. ::: toolkit/components/search/nsSearchService.js (Diff revision 3) > + this.debugState = { It seems this object and the gLatestLogEntries array become useless once _initObservers.promise has resolved (and the search service init is no longer blocking shutdown). Should we set them to null so that they are gc'ed?
Comment 39•9 years ago
|
||
Georg, if this is indeed important, then it's something that you and Alessio will need to do. But as I said earlier, I'm not convinced that spending time on this is worthwhile since we're planning on sunsetting all this FHR code in the 38 cycle anyway.
Flags: needinfo?(benjamin)
Comment 40•9 years ago
|
||
Comment on attachment 8551512 [details] MozReview Request: bz://1110681/Yoric Clearing the review flag (my review comments are in comment 38) and needinfo per comment 39.
Flags: needinfo?(florian)
Attachment #8551512 -
Flags: review?(florian)
Assignee | ||
Comment 41•9 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1baa28e7614
Updated•9 years ago
|
Assignee | ||
Comment 42•9 years ago
|
||
Comment on attachment 8551512 [details] MozReview Request: bz://1110681/Yoric /r/2695 - Bug 1110681 - Instrumenting nsSearchService;r=florian Pull down this commit: hg pull review -r d1baa28e7614a44315ede5101472a789f6cb75b3
Attachment #8551512 -
Flags: review?(florian)
Assignee | ||
Comment 43•9 years ago
|
||
Ah, right, my bad, I forgot to click "publish" in mozReview, so Florian had never received my updated review request. Here we go.
Assignee | ||
Comment 45•9 years ago
|
||
Ah, good offline point: the blocked bug has been WONTFIXed, canceling review.
Flags: needinfo?(florian)
Assignee | ||
Updated•9 years ago
|
Attachment #8551512 -
Attachment is obsolete: true
Attachment #8551512 -
Flags: review?(florian)
Assignee | ||
Updated•9 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Comment 46•6 years ago
|
||
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Updated•6 years ago
|
Product: Firefox Health Report → Firefox Health Report Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•