Closed
Bug 1402267
Opened 7 years ago
Closed 7 years ago
Make sessionstore resilient to worker exceptions
Categories
(Firefox :: Session Restore, enhancement, P1)
Firefox
Session Restore
Tracking
()
RESOLVED
FIXED
Firefox 58
Tracking | Status | |
---|---|---|
firefox58 | --- | fixed |
People
(Reporter: mikedeboer, Assigned: mikedeboer)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
I noticed, after a crash of my Fx Beta during an upgrade from 55 to 56, that my sessionstore (backup) files were over a month old. So that was the best I could restore to. Whilst backtracking to figure out _why_ sessionstore might have stopped writing files to disk, I noticed that we never re-instantiate the SessionFile worker after catching an X amount of failures. The right thing to here would be to 1. Start counting the amount of failures of SessionFile write requests. 2. After an X amount of failures, - Terminate the current worker, for good measure - Spawn a new worker and initialize it properly. 3. Rinse, repeat.
Assignee | ||
Updated•7 years ago
|
status-firefox58:
affected → ---
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 3•7 years ago
|
||
mozreview-review |
Comment on attachment 8911114 [details] Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed. https://reviewboard.mozilla.org/r/182608/#review187872 ::: toolkit/components/promiseworker/PromiseWorker.jsm:363 (Diff revision 1) > > }.bind(this))(); > + }, > + > + /** > + * Terminate the worker, if it has been created at all. Could you document what happens to pending messages? Maybe we should fail them all, by the way. ::: toolkit/components/promiseworker/PromiseWorker.jsm:372 (Diff revision 1) > + if (this.__worker) { > + this.__worker.terminate(); > + } > + } catch (ex) { > + // Ignore exceptions. > + this.log("Error whilst terminating ChromeWorker: " + ex.message); I admire your declinations :) ::: toolkit/components/promiseworker/PromiseWorker.jsm:374 (Diff revision 1) > + } > + } catch (ex) { > + // Ignore exceptions. > + this.log("Error whilst terminating ChromeWorker: " + ex.message); > + } > + }, Why doesn't this remove `this.__worker`? ::: toolkit/components/promiseworker/PromiseWorker.jsm:380 (Diff revision 1) > + > + /** > + * Restart the worker, which essentially means to terminate the current worker > + * instance and set things up to be instantiated lazily again on the next `post()`. > + */ > + restart() { I don't think the method is useful for the moment. Let's just add `delete this.__worker` to `terminate()`.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 6•7 years ago
|
||
mozreview-review |
Comment on attachment 8911114 [details] Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed. https://reviewboard.mozilla.org/r/182608/#review187948 lgtm, thanks
Attachment #8911114 -
Flags: review?(dteller) → review+
Comment 7•7 years ago
|
||
mozreview-review |
Comment on attachment 8911114 [details] Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed. https://reviewboard.mozilla.org/r/182608/#review188136 ::: toolkit/components/promiseworker/PromiseWorker.jsm:378 (Diff revision 2) > + let error; > + while (!this._queue.isEmpty()) { Do we need this when `this.__worker==null`? Couldn't we just have a `if (!this.__worker) return` at the top here?
Comment 8•7 years ago
|
||
mozreview-review |
Comment on attachment 8911115 [details] Bug 1402267 - Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref. https://reviewboard.mozilla.org/r/182610/#review188138 ::: browser/components/sessionstore/SessionFile.jsm:353 (Diff revision 2) > + SessionWorker.terminate(); > + this._workerHealth.failures = 0; This is a great place to add some telemetry. We should find out how commonly this is needed. ::: browser/components/sessionstore/SessionFile.jsm:424 (Diff revision 2) > > if (isFinalWrite) { > Services.obs.notifyObservers(null, "sessionstore-final-state-write-complete"); > } > + > + this._checkWorkerHealth(); We probably shouldn't do this if `isFinalWrite==true`, we're shutting down anyway.
Attachment #8911115 -
Flags: review?(ttaubert) → review-
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 16•7 years ago
|
||
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, Requesting review from a data peer, but I believe :gfritsche will be out until Oct 3rd. Rebecca, are you taking over or is there someone else you can forward this request to? Thanks!
Attachment #8913224 -
Flags: review?(rweiss)
Comment 17•7 years ago
|
||
:mikedeboer, I don't do the Telemetry review (you can flag :Dexter or :chutten for that since :gfritzsche is out). I can do data review as a steward, which is separate. Reply back to this bug thread with the answers to the 8 questions in this doc and I'll address the review?. https://docs.google.com/document/d/1SSn5w8DfCSkHWJS8DNTd7ya82diWRxaDUFM5aL4UDDo/edit
Flags: needinfo?(mdeboer)
Assignee | ||
Comment 18•7 years ago
|
||
Thanks, Rebecca! I created a copy (just in case) and filled in the form here: https://docs.google.com/document/d/17oQhXg-EzSe4uc65cV4Jvj0HFDSYms_OrcZ6cehO93U/edit# Can you please take a look?
Flags: needinfo?(mdeboer) → needinfo?(rweiss)
Assignee | ||
Updated•7 years ago
|
Attachment #8913224 -
Flags: review?(rweiss) → review?(chutten)
Comment 19•7 years ago
|
||
mozreview-review |
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, https://reviewboard.mozilla.org/r/184626/#review190138 Your use and testing of Telemetry is grand, but the histogram would work better as a scalar. ::: toolkit/components/telemetry/Histograms.json:6739 (Diff revision 3) > "expires_in_version": "default", > "kind": "enumerated", > "n_values": 50, > "description": "Session restore: Number of tabs restored eagerly in the session that has just been restored." > }, > + "FX_SESSION_RESTORE_WORKER_RESTART_COUNT": { boolean histograms are for measuring counts of trues and falses. You're only using it to count trues, so it'd be more compact and easier to analyse if you wrote this as a uint scalar instead. See https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/scalars.html for more information (the tl;dr is, use Scalars.yaml instead of Histograms.json, use a category.name that's dotted and a scalar_name that's underscored, and then every other field is essentially the same) ::: toolkit/components/telemetry/Histograms.json:6740 (Diff revision 3) > "kind": "enumerated", > "n_values": 50, > "description": "Session restore: Number of tabs restored eagerly in the session that has just been restored." > }, > + "FX_SESSION_RESTORE_WORKER_RESTART_COUNT": { > + "record_in_processes": ["main", "content"], session restore workers operate in content child processes? I would've thought they were parent-process only.
Attachment #8913224 -
Flags: review?(chutten) → review-
Assignee | ||
Comment 20•7 years ago
|
||
I was hoping for something better to be available, thanks!
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 24•7 years ago
|
||
mozreview-review |
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, https://reviewboard.mozilla.org/r/184626/#review190152 ::: browser/components/sessionstore/test/unit/test_scalar_worker_restarts.js:21 (Diff revision 4) > + * to be registered in "toolkit/components/telemetry/Scalars.yaml". > + * This test ensures that the scalar is registered and empty. > + */ > +add_task(async function test_ensure_scalar_is_empty() { > + const scalars = Telemetry.snapshotScalars(Ci.nsITelemetry.DATASET_RELEASE_CHANNEL_OPTOUT, false).parent; > + Assert.ok(!scalars, "Sanity check; no scalars should be there yet."); This might be fragile, depending on who adds new scalars and when they're recorded to. Maybe Assert that !(ScalarId in scalars)? ::: toolkit/components/telemetry/Scalars.yaml:370 (Diff revision 4) > + bug_numbers: > + - 1402267 > + description: > > + A counter incremented every time the SessionFile worker is restarted due > + to too many failures. > + expires: "70" You bumped from 64 to 70. On purpose?
Attachment #8913224 -
Flags: review?(chutten) → review-
Assignee | ||
Comment 25•7 years ago
|
||
(In reply to Chris H-C :chutten from comment #24) > You bumped from 64 to 70. On purpose? Myeah, I was thinking 6mo... which version would that be? 70 was the highest number I could find in Scalars.yaml TBH...
Assignee | ||
Comment 26•7 years ago
|
||
So is 64 better? Fx65 would be closer to 6mo, no?
Flags: needinfo?(chutten)
Comment 27•7 years ago
|
||
Usually 5-6 versions from whatever's on Nightly will get you more than six month's reprieve. *checks release calendar* March 5, 2018 is listed as the merge of 60 to beta. 70 pushes the expiry extension question into... 2019? 2020? 64's more in line with what I'm used to seeing, especially for a new measurement. 65's over a year from now (Oct 15, 2018) Of course, rweiss and the Data Peers (great band name) are the ones who'll pass the final ruling on whether your expiry version's too generous.
Flags: needinfo?(chutten)
Assignee | ||
Comment 28•7 years ago
|
||
Cool, 64's mooore than fine! We/ I need a fixed, no too far away point in time where I'm 'forced' to evaluate the data.
Comment hidden (mozreview-request) |
Comment 30•7 years ago
|
||
mozreview-review |
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, https://reviewboard.mozilla.org/r/184626/#review190168
Attachment #8913224 -
Flags: review?(chutten) → review+
Assignee | ||
Comment 31•7 years ago
|
||
Rebecca, do you have time to check out https://docs.google.com/document/d/17oQhXg-EzSe4uc65cV4Jvj0HFDSYms_OrcZ6cehO93U/edit soonish? Thanks so much!
Comment 32•7 years ago
|
||
mozreview-review |
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, https://reviewboard.mozilla.org/r/184626/#review192846
Attachment #8913224 -
Flags: review?(ttaubert) → review+
Comment 33•7 years ago
|
||
mozreview-review |
Comment on attachment 8911115 [details] Bug 1402267 - Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref. https://reviewboard.mozilla.org/r/182610/#review192848
Attachment #8911115 -
Flags: review?(ttaubert) → review+
Assignee | ||
Comment 34•7 years ago
|
||
Since I have a gut feeling that this is pretty important for Sessionstore stability, is OK to land this without an explicit review from Rebecca, Chris, or are you able to delegate it to someone else for me? It's been two weeks :(
Flags: needinfo?(chutten)
Comment 36•7 years ago
|
||
It is not okay to land without data review, but we can ask for data review from a different data peer. I've redirected from :rweiss to :liuche
Flags: needinfo?(rweiss)
Flags: needinfo?(liuche)
Flags: needinfo?(chutten)
Assignee | ||
Comment 37•7 years ago
|
||
Thanks Chris!
Comment 38•7 years ago
|
||
mozreview-review |
Comment on attachment 8913224 [details] Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, https://reviewboard.mozilla.org/r/184626/#review195338 data-review only. This Scalar telemetry probe collects Type 1 data about SessionFile workers. ::: toolkit/components/telemetry/Scalars.yaml:369 (Diff revision 5) > + worker_restart_count: > + bug_numbers: > + - 1402267 > + description: > > + A counter incremented every time the SessionFile worker is restarted due > + to too many failures. Is there a definition for "too many failuers", or is that referenced somewhere?
Attachment #8913224 -
Flags: review+
Comment 39•7 years ago
|
||
As a side note, in the future, please flag me on each patch with telemetry probes that requires data-review (unless it's a bigger issue about "can we even collect data on this") rather than a general NI - I'll see and get to it more quickly :)
Flags: needinfo?(liuche)
Assignee | ||
Comment 40•7 years ago
|
||
Thanks!! The 'too many failures' references to the `kMaxWriteFailures` constant, which is set to '5' in the new 'browser.sessionstore.max_write_failures' pref. This means: after 5 reported failures, the SessionFile worker is restarted.
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 45•7 years ago
|
||
Pushed by mdeboer@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/94656e37589a Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed. r=Yoric https://hg.mozilla.org/integration/autoland/rev/57bb241801c0 Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref. r=ttaubert https://hg.mozilla.org/integration/autoland/rev/d8c1c8894971 Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, r=chutten,liuche,ttaubert
Comment 46•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/94656e37589a https://hg.mozilla.org/mozilla-central/rev/57bb241801c0 https://hg.mozilla.org/mozilla-central/rev/d8c1c8894971
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 58
You need to log in
before you can comment on or make changes to this bug.
Description
•