Closed Bug 1402267 Opened 7 years ago Closed 7 years ago

Make sessionstore resilient to worker exceptions

Categories

(Firefox :: Session Restore, enhancement, P1)

enhancement

Tracking

()

RESOLVED FIXED
Firefox 58
Tracking Status
firefox58 --- fixed

People

(Reporter: mikedeboer, Assigned: mikedeboer)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

I noticed, after a crash of my Fx Beta during an upgrade from 55 to 56, that my sessionstore (backup) files were over a month old. So that was the best I could restore to.

Whilst backtracking to figure out _why_ sessionstore might have stopped writing files to disk, I noticed that we never re-instantiate the SessionFile worker after catching an X amount of failures.

The right thing to here would be to
1. Start counting the amount of failures of SessionFile write requests.
2. After an X amount of failures,
   - Terminate the current worker, for good measure
   - Spawn a new worker and initialize it properly.
3. Rinse, repeat.
Comment on attachment 8911114 [details]
Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed.

https://reviewboard.mozilla.org/r/182608/#review187872

::: toolkit/components/promiseworker/PromiseWorker.jsm:363
(Diff revision 1)
>  
>      }.bind(this))();
> +  },
> +
> +  /**
> +   * Terminate the worker, if it has been created at all.

Could you document what happens to pending messages?

Maybe we should fail them all, by the way.

::: toolkit/components/promiseworker/PromiseWorker.jsm:372
(Diff revision 1)
> +      if (this.__worker) {
> +        this.__worker.terminate();
> +      }
> +    } catch (ex) {
> +      // Ignore exceptions.
> +      this.log("Error whilst terminating ChromeWorker: " + ex.message);

I admire your declinations :)

::: toolkit/components/promiseworker/PromiseWorker.jsm:374
(Diff revision 1)
> +      }
> +    } catch (ex) {
> +      // Ignore exceptions.
> +      this.log("Error whilst terminating ChromeWorker: " + ex.message);
> +    }
> +  },

Why doesn't this remove `this.__worker`?

::: toolkit/components/promiseworker/PromiseWorker.jsm:380
(Diff revision 1)
> +
> +  /**
> +   * Restart the worker, which essentially means to terminate the current worker
> +   * instance and set things up to be instantiated lazily again on the next `post()`.
> +   */
> +  restart() {

I don't think the method is useful for the moment. Let's just add `delete this.__worker` to `terminate()`.
Comment on attachment 8911114 [details]
Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed.

https://reviewboard.mozilla.org/r/182608/#review187948

lgtm, thanks
Attachment #8911114 - Flags: review?(dteller) → review+
Comment on attachment 8911114 [details]
Bug 1402267 - Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed.

https://reviewboard.mozilla.org/r/182608/#review188136

::: toolkit/components/promiseworker/PromiseWorker.jsm:378
(Diff revision 2)
> +    let error;
> +    while (!this._queue.isEmpty()) {

Do we need this when `this.__worker==null`? Couldn't we just have a `if (!this.__worker) return` at the top here?
Comment on attachment 8911115 [details]
Bug 1402267 - Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref.

https://reviewboard.mozilla.org/r/182610/#review188138

::: browser/components/sessionstore/SessionFile.jsm:353
(Diff revision 2)
> +      SessionWorker.terminate();
> +      this._workerHealth.failures = 0;

This is a great place to add some telemetry. We should find out how commonly this is needed.

::: browser/components/sessionstore/SessionFile.jsm:424
(Diff revision 2)
>  
>        if (isFinalWrite) {
>          Services.obs.notifyObservers(null, "sessionstore-final-state-write-complete");
>        }
> +
> +      this._checkWorkerHealth();

We probably shouldn't do this if `isFinalWrite==true`, we're shutting down anyway.
Attachment #8911115 - Flags: review?(ttaubert) → review-
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

Requesting review from a data peer, but I believe :gfritsche will be out until Oct 3rd. Rebecca, are you taking over or is there someone else you can forward this request to?

Thanks!
Attachment #8913224 - Flags: review?(rweiss)
:mikedeboer, I don't do the Telemetry review (you can flag :Dexter or :chutten for that since :gfritzsche is out).  I can do data review as a steward, which is separate.

Reply back to this bug thread with the answers to the 8 questions in this doc and I'll address the review?.

https://docs.google.com/document/d/1SSn5w8DfCSkHWJS8DNTd7ya82diWRxaDUFM5aL4UDDo/edit
Flags: needinfo?(mdeboer)
Thanks, Rebecca!

I created a copy (just in case) and filled in the form here:
https://docs.google.com/document/d/17oQhXg-EzSe4uc65cV4Jvj0HFDSYms_OrcZ6cehO93U/edit#

Can you please take a look?
Flags: needinfo?(mdeboer) → needinfo?(rweiss)
Attachment #8913224 - Flags: review?(rweiss) → review?(chutten)
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

https://reviewboard.mozilla.org/r/184626/#review190138

Your use and testing of Telemetry is grand, but the histogram would work better as a scalar.

::: toolkit/components/telemetry/Histograms.json:6739
(Diff revision 3)
>      "expires_in_version": "default",
>      "kind": "enumerated",
>      "n_values": 50,
>      "description": "Session restore: Number of tabs restored eagerly in the session that has just been restored."
>    },
> +  "FX_SESSION_RESTORE_WORKER_RESTART_COUNT": {

boolean histograms are for measuring counts of trues and falses. You're only using it to count trues, so it'd be more compact and easier to analyse if you wrote this as a uint scalar instead.

See https://firefox-source-docs.mozilla.org/toolkit/components/telemetry/telemetry/collection/scalars.html for more information

(the tl;dr is, use Scalars.yaml instead of Histograms.json, use a category.name that's dotted and a scalar_name that's underscored, and then every other field is essentially the same)

::: toolkit/components/telemetry/Histograms.json:6740
(Diff revision 3)
>      "kind": "enumerated",
>      "n_values": 50,
>      "description": "Session restore: Number of tabs restored eagerly in the session that has just been restored."
>    },
> +  "FX_SESSION_RESTORE_WORKER_RESTART_COUNT": {
> +    "record_in_processes": ["main", "content"],

session restore workers operate in content child processes? I would've thought they were parent-process only.
Attachment #8913224 - Flags: review?(chutten) → review-
I was hoping for something better to be available, thanks!
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

https://reviewboard.mozilla.org/r/184626/#review190152

::: browser/components/sessionstore/test/unit/test_scalar_worker_restarts.js:21
(Diff revision 4)
> + * to be registered in "toolkit/components/telemetry/Scalars.yaml".
> + * This test ensures that the scalar is registered and empty.
> + */
> +add_task(async function test_ensure_scalar_is_empty() {
> +  const scalars = Telemetry.snapshotScalars(Ci.nsITelemetry.DATASET_RELEASE_CHANNEL_OPTOUT, false).parent;
> +  Assert.ok(!scalars, "Sanity check; no scalars should be there yet.");

This might be fragile, depending on who adds new scalars and when they're recorded to. Maybe Assert that !(ScalarId in scalars)?

::: toolkit/components/telemetry/Scalars.yaml:370
(Diff revision 4)
> +    bug_numbers:
> +      - 1402267
> +    description: >
> +      A counter incremented every time the SessionFile worker is restarted due
> +      to too many failures.
> +    expires: "70"

You bumped from 64 to 70. On purpose?
Attachment #8913224 - Flags: review?(chutten) → review-
(In reply to Chris H-C :chutten from comment #24)
> You bumped from 64 to 70. On purpose?

Myeah, I was thinking 6mo... which version would that be? 70 was the highest number I could find in Scalars.yaml TBH...
So is 64 better? Fx65 would be closer to 6mo, no?
Flags: needinfo?(chutten)
Usually 5-6 versions from whatever's on Nightly will get you more than six month's reprieve. *checks release calendar* March 5, 2018 is listed as the merge of 60 to beta. 70 pushes the expiry extension question into... 2019? 2020?

64's more in line with what I'm used to seeing, especially for a new measurement. 65's over a year from now (Oct 15, 2018)

Of course, rweiss and the Data Peers (great band name) are the ones who'll pass the final ruling on whether your expiry version's too generous.
Flags: needinfo?(chutten)
Cool, 64's mooore than fine! We/ I need a fixed, no too far away point in time where I'm 'forced' to evaluate the data.
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

https://reviewboard.mozilla.org/r/184626/#review190168
Attachment #8913224 - Flags: review?(chutten) → review+
Rebecca, do you have time to check out https://docs.google.com/document/d/17oQhXg-EzSe4uc65cV4Jvj0HFDSYms_OrcZ6cehO93U/edit soonish? Thanks so much!
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

https://reviewboard.mozilla.org/r/184626/#review192846
Attachment #8913224 - Flags: review?(ttaubert) → review+
Comment on attachment 8911115 [details]
Bug 1402267 - Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref.

https://reviewboard.mozilla.org/r/182610/#review192848
Attachment #8911115 - Flags: review?(ttaubert) → review+
Since I have a gut feeling that this is pretty important for Sessionstore stability, is OK to land this without an explicit review from Rebecca, Chris, or are you able to delegate it to someone else for me?
It's been two weeks :(
Flags: needinfo?(chutten)
It is not okay to land without data review, but we can ask for data review from a different data peer. I've redirected from :rweiss to :liuche
Flags: needinfo?(rweiss)
Flags: needinfo?(liuche)
Flags: needinfo?(chutten)
Thanks Chris!
Comment on attachment 8913224 [details]
Bug 1402267 - Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche,

https://reviewboard.mozilla.org/r/184626/#review195338

data-review only. This Scalar telemetry probe collects Type 1 data about SessionFile workers.

::: toolkit/components/telemetry/Scalars.yaml:369
(Diff revision 5)
> +  worker_restart_count:
> +    bug_numbers:
> +      - 1402267
> +    description: >
> +      A counter incremented every time the SessionFile worker is restarted due
> +      to too many failures.

Is there a definition for "too many failuers", or is that referenced somewhere?
Attachment #8913224 - Flags: review+
As a side note, in the future, please flag me on each patch with telemetry probes that requires data-review (unless it's a bigger issue about "can we even collect data on this") rather than a general NI - I'll see and get to it more quickly :)
Flags: needinfo?(liuche)
Thanks!! The 'too many failures' references to the `kMaxWriteFailures` constant, which is set to '5' in the new 'browser.sessionstore.max_write_failures' pref.
This means: after 5 reported failures, the SessionFile worker is restarted.
Pushed by mdeboer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/94656e37589a
Allow PromiseWorkers to be 'restarted', which is terminate the current worker and instantiate a new one when needed. r=Yoric
https://hg.mozilla.org/integration/autoland/rev/57bb241801c0
Restart the SessionWorker each time there are failures reported as much as defined in the 'browser.sessionstore.max_write_failures' pref. r=ttaubert
https://hg.mozilla.org/integration/autoland/rev/d8c1c8894971
Add a scalar telemetry probe that tracks SessionFile worker restarts. data-r=liuche, r=chutten,liuche,ttaubert
Depends on: 1427007
See Also: → 1471194
You need to log in before you can comment on or make changes to this bug.