Closed Bug 959130 Opened 6 years ago Closed 6 years ago

don't use a ChromeWorker to read the session file at startup

Categories

(Firefox :: Session Restore, defect, major)

defect
Not set
major

Tracking

()

RESOLVED FIXED
Firefox 29
Tracking Status
firefox27 --- wontfix
firefox28 + fixed
firefox29 + fixed
firefox30 --- fixed
b2g-v1.3 --- fixed

People

(Reporter: Yoric, Assigned: smacleod)

References

(Depends on 1 open bug)

Details

(Keywords: perf, regression, Whiteboard: [talos_regression][qa-])

Attachments

(8 files, 4 obsolete files)

10.51 KB, application/force-download
Details
3.94 KB, patch
Details | Diff | Splinter Review
2.44 KB, patch
Details | Diff | Splinter Review
10.46 KB, application/force-download
Details
3.45 KB, patch
Details | Diff | Splinter Review
5.32 KB, patch
ttaubert
: review+
Details | Diff | Splinter Review
3.56 KB, patch
ttaubert
: review+
Details | Diff | Splinter Review
5.21 KB, patch
Details | Diff | Splinter Review
Graph http://screencast.com/t/cafAln40oXU, collected by joy, indicates a regression in the time between firstpaint and sessionrestored (this is measure d2 in the graph) in Firefox 25.

We need to investigate this. Note that this may be Session Restore, Add-on Manager or some other stuff that takes place between firstpaint and sessionrestored.
Summary: Investigate Firefox 25 regression on sessiontrestored → Investigate Firefox 25 regression on sessionrestored
My current assumption is that the sessionrestored notification fires a tad later due to moving I/O to a ChromeWorker. Reading and creating backup copies might be de-prioritized and thus leading to us firing notifications later.
(In reply to Tim Taubert [:ttaubert] from comment #2)
> My current assumption is that the sessionrestored notification fires a tad
> later due to moving I/O to a ChromeWorker. Reading and creating backup
> copies might be de-prioritized and thus leading to us firing notifications
> later.

Also, there is a cost associated to launching the worker.
Joy, before we can proceed, we need to know the duration of this regression.
Flags: needinfo?(sguha)
From 1% sample, considering all the release US profiles

For each profile compute, the mean of sessionRestored - firstpaint across sessions per week for a given version.
Versions are in the set {24, 25.0, 25.0.1, 26.0}

This is the geometric mean of d2 across all profiles for a version across weeks 

d2 = mean (sessionRestored - firstpaint in millisecond) 

       ver        d2 = mean (sessionRestored - firstpaint in millisecond) 
1:   24.0$  245.1465
2:   25.0$  897.2273
3: 25.0.1$  686.6769
4:   26.0$  747.9843

Hope this helps
Cheers
Saptarshi
Flags: needinfo?(sguha)
(In reply to "Saptarshi Guha[:joy]" from comment #6)
> From 1% sample, considering all the release US profiles
> 
> For each profile compute, the mean of sessionRestored - firstpaint across
> sessions per week for a given version.
> Versions are in the set {24, 25.0, 25.0.1, 26.0}
> 
> This is the geometric mean of d2 across all profiles for a version across
> weeks 
> 
> d2 = mean (sessionRestored - firstpaint in millisecond) 
> 
>        ver        d2 = mean (sessionRestored - firstpaint in millisecond) 
> 1:   24.0$  245.1465
> 2:   25.0$  897.2273
> 3: 25.0.1$  686.6769
> 4:   26.0$  747.9843
> 
> Hope this helps
> Cheers
> Saptarshi

If this is the geometric mean, can we get some measure of standard deviation or such as well? Some of us discussed earlier today that it might be that we caused an issue for a limited number of people and that's manipulating the average. Diagnosing that as opposed to a 'general' slowdown would help.
Flags: needinfo?(sguha)
(In reply to "Saptarshi Guha[:joy]" from comment #6)
> From 1% sample, considering all the release US profiles
> 
> For each profile compute, the mean of sessionRestored - firstpaint across
> sessions per week for a given version.
> Versions are in the set {24, 25.0, 25.0.1, 26.0}
> 
> This is the geometric mean of d2 across all profiles for a version across
> weeks 
> 
> d2 = mean (sessionRestored - firstpaint in millisecond) 
> 
>        ver        d2 = mean (sessionRestored - firstpaint in millisecond) 
> 1:   24.0$  245.1465
> 2:   25.0$  897.2273
> 3: 25.0.1$  686.6769
> 4:   26.0$  747.9843

Is that in milliseconds?

Looking at the patch, I believe that we have added the following steps:
- launching the worker;
- writeLoadStateOnceAfterStartup (in case of sync fallback).
Tim and I have have an alternative hypothesis we're looking into. Bug 898308 is also present in the range, which moves the session store initialization for the first window to a little later in the delayed startup process.

Before the change we would initialize in the _delayedStartup() function, and after we would wait until "delayed-startup-finished" was fired. This moves a significant ammount of code to before initialization.

Tim took some initial measurements and on his rMBP this accounts for ~34ms, and ~70ms on linux. On a slower machine this could be even larger.

There is also some Windows specific initialization that occurs in between, which we're trying to get measurements on. 


(In reply to "Saptarshi Guha[:joy]" from comment #6)
> From 1% sample, considering all the release US profiles
> 
> For each profile compute, the mean of sessionRestored - firstpaint across
> sessions per week for a given version.
> Versions are in the set {24, 25.0, 25.0.1, 26.0}
> 
> This is the geometric mean of d2 across all profiles for a version across
> weeks 
> 
> d2 = mean (sessionRestored - firstpaint in millisecond) 
> 
>        ver        d2 = mean (sessionRestored - firstpaint in millisecond) 
> 1:   24.0$  245.1465
> 2:   25.0$  897.2273
> 3: 25.0.1$  686.6769
> 4:   26.0$  747.9843
> 
> Hope this helps
> Cheers
> Saptarshi

joy, a breakdown by Operating system could be useful, would you be able to provide this?
(In reply to Steven MacLeod [:smacleod] from comment #9)
> Tim took some initial measurements and on his rMBP this accounts for ~34ms,
> and ~70ms on linux. On a slower machine this could be even larger.

The regression seems to be one order of magnitude larger.
I seem to remember measuring that launching a worker could explain 300ms on my machine.
Attached file foo.pdf
PDF 
D := sessionrestored - firstpaint 

page 1: Box Plot of D in seconds, x axis is a log scale

To read the following: points above the grey line (y==x) implies that quantiles of the variable on the Y axis > quantiles of variable on X axis

page 2: QQ Plot of log(D) for v25 vs log(D) for v24
page 3: QQ Plot of log(D) for v25.0.1 vs log(D) for v24
page 4: QQ Plot of log(D) for v26 vs log(D) for v24


Percentiles (seconds)

       p    v24.0$      v25.0$     v26.0$    v25.0.1$
 1: 0.00      0.000      0.000     0.000       0.000
 2: 0.01      0.019      0.059     0.054       0.062
 3: 0.05      0.030      0.115     0.095       0.112
 4: 0.10      0.041      0.163     0.128       0.153
 5: 0.20      0.063      0.261     0.185       0.229
 6: 0.30      0.090      0.384     0.251       0.317
 7: 0.40      0.129      0.553     0.335       0.431
 8: 0.50      0.186      0.792     0.454       0.587
 9: 0.60      0.280      1.156     0.635       0.815
10: 0.70      0.453      1.759     0.951       1.188
11: 0.80      0.842      2.884     1.636       1.904
12: 0.90      2.139      5.622     3.821       3.805
13: 0.95      4.544     10.016     8.259       6.952
14: 0.99     19.209     34.244    43.900      27.318
15: 1.00 918350.860 760061.238 80008.623 1245991.998
Flags: needinfo?(sguha)
We profiled on Paolo's machine and got the following profile:

https://people.mozilla.org/~bgirard/cleopatra/#report=01fc530bf31ad36ba35adbf295aa690989ccb8aa

Interesting is that we seem to wait a lot (~4s) on system calls in WaitForMessage(). This function has changed in Fx 25 (bug 896896):

https://hg.mozilla.org/mozilla-central/rev/4887f7d34df2
Correction: this function has been added in Fx 25.
If anyone wants to give it a try and compare startup times, here's a patch backing out bug 896896. I unfortunately don't have a Windows machine at hand until next week.
(In reply to Tim Taubert [:ttaubert] from comment #12)
> We profiled on Paolo's machine and got the following profile:

Did Paolo get the time to perform several rounds of profiling to check whether this was stable diagnosis?

> Interesting is that we seem to wait a lot (~4s) on system calls in
> WaitForMessage(). This function has changed in Fx 25 (bug 896896):
> 
> https://hg.mozilla.org/mozilla-central/rev/4887f7d34df2

Nicholas, any chance that you could shed some light on this?
Flags: needinfo?(ncameron)
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #15)
> (In reply to Tim Taubert [:ttaubert] from comment #12)
> > We profiled on Paolo's machine and got the following profile:
> 
> Did Paolo get the time to perform several rounds of profiling to check
> whether this was stable diagnosis?
> 
> > Interesting is that we seem to wait a lot (~4s) on system calls in
> > WaitForMessage(). This function has changed in Fx 25 (bug 896896):
> > 
> > https://hg.mozilla.org/mozilla-central/rev/4887f7d34df2
> 
> Nicholas, any chance that you could shed some light on this?

AFAIK, the implementation of WaitForMessage should be no less efficient than the previous use of WaitMessage. But these are some deep, dark windows internals things (perf-wise) so I might be wrong. We've been backing that change out of beta and release because it causes issues for some users. But as far as the docs go, it is the right approach. And it does prevent a real bug which appears when we use OMTC.

The time spent on system calls is probably just sleeping, waiting for a message. That should not be a problem, unless you think that is time that should be spent doing something.
Flags: needinfo?(ncameron)
(In reply to Nick Cameron [:nrc] from comment #16)
> The time spent on system calls is probably just sleeping, waiting for a
> message. That should not be a problem, unless you think that is time that
> should be spent doing something.

Having the main thread sleep a total of 4 seconds during startup seems a little surprising.
(In reply to David Rajchenbach Teller [:Yoric] (please use "needinfo?") from comment #17)
> (In reply to Nick Cameron [:nrc] from comment #16)
> > The time spent on system calls is probably just sleeping, waiting for a
> > message. That should not be a problem, unless you think that is time that
> > should be spent doing something.
> 
> Having the main thread sleep a total of 4 seconds during startup seems a
> little surprising.

Yeah, that seems bad. I am not sure if that is because of WaitForMultipleObjectsEx, or if there is some kind of *lock that would be there anyway which involves it. I am by no means an expert on this Windows API stuff.
We don't have the distribution of the regression across operating systems, but we suspect that it's Windows only. Testing under Windows seems to indicate a large regression (see comment 12) and small-to-trivial regressions on MacOS X and Linux that probably match bug 959598.

The attached patch lets us test without initializing or waiting for SessionWorker. Don't use this with your everyday profile.
Same  interpretation as Comment[11]. This is for Windows XP.


       p    x.24.0$   x.25.0$   x.26.0$  x.25.0.1$
 1: 0.00      0.000     0.000     0.000      0.000
 2: 0.01      0.029     0.103     0.081      0.098
 3: 0.05      0.054     0.197     0.149      0.184
 4: 0.10      0.078     0.296     0.211      0.264
 5: 0.20      0.127     0.492     0.334      0.418
 6: 0.30      0.197     0.719     0.476      0.593
 7: 0.40      0.300     1.006     0.652      0.806
 8: 0.50      0.451     1.406     0.892      1.090
 9: 0.60      0.675     1.987     1.256      1.497
10: 0.70      1.057     2.930     1.871      2.153
11: 0.80      1.856     4.573     3.144      3.374
12: 0.90      4.184     8.702     6.880      6.529
13: 0.95      8.440    15.623    14.526     11.933
14: 0.99     34.828    56.922    78.442     48.304
15: 1.00 918350.860 94209.719 62206.390 955376.719
Looking at those numbers it seems very likely that bug 896896 is the culprit. Probably in combination with SessionStore switching to its own I/O worker.

Fx 24 was good, Fx 25 was bad. Fx 25.0.1 was slightly better again because we backed out the WaitForMessage() change [1].

As Nick said this was backed out from Fx 26 as well but the numbers are still bad probably because we have all the Nightly, Beta, and Aurora usage data - which we don't have for 25.0.1.

[1] https://hg.mozilla.org/releases/mozilla-release/rev/33b48d5a8be3
(In reply to "Saptarshi Guha[:joy]" from comment #20)
> Created attachment 8361034 [details]
> Updated Numbers for WinXP

Joy, thank you for all the numbers. I'd be really interested if we see the same numbers on Windows 7. What about Linux or Mac? According to our current theory they shouldn't be at all or just slightly affected. I really hope it's not too much work getting that data :)
Flags: needinfo?(sguha)
(In reply to Tim Taubert [:ttaubert] from comment #22)
> (In reply to "Saptarshi Guha[:joy]" from comment #20)
> > Created attachment 8361034 [details]
> > Updated Numbers for WinXP
> 
> Joy, thank you for all the numbers. I'd be really interested if we see the
> same numbers on Windows 7. What about Linux or Mac? According to our current
> theory they shouldn't be at all or just slightly affected. I really hope
> it's not too much work getting that data :)
Not a problem. Will get them and update the bug accordingly
Flags: needinfo?(sguha)
After some more investigation it looks like WaitForMessage() isn't to blame. Bug 959598 accounts for a huge delay on startup, on my XP VM it takes about 900ms to initialize the worker and read the session file.

Another problem are the sessionstore initialization changes. For small to medium sessions where we finish reading long before delayed-startup-finished is fired we wait 70-100ms until restoring the session into the initial window only because of all the stuff being called in delayedStartup(). We can fix this by calling SessionStore.initSession() back from where it was.
Here's a patch that restores using OS.File instead of SessionWorker to read the session.
Tim, could you test the performance with that patch?
Flags: needinfo?(ttaubert)
Your patch unfortunately doesn't apply at all but I just wrote my own version :)

OS.File with dedicated Worker: ~800ms
OS.File:                     : ~800ms (your patch)
NetUtil.asyncFetch()         : ~ 70ms
Flags: needinfo?(ttaubert)
Results on my MBP are as follows:

It takes about ~220ms from calling |new ChromeWorker(worker.js)| to executing the first line of |worker.js|. Starting up workers seems really slow.

The other thing that isn't really great is that |importScripts("resource://gre/modules/osfile.jsm")| alone takes 60ms off the worker's "startup time" until it's ready to handle I/O requests.
(In reply to Tim Taubert [:ttaubert] from comment #28)
> It takes about ~220ms from calling |new ChromeWorker(worker.js)| to
> executing the first line of |worker.js|. Starting up workers seems really
> slow.

Just FTR, due to bug 960986 worker startup time is actually more like 600-700ms.
I have a patch to fix the initial session store read almost complete, which I'll post in the next day or two. It uses NetUtil.asyncFetch, but I'll also try with nsIDomFileReader and see what's faster. There will be a bit of a hack due to Bug 937220, but we should be able to remove it easily once that bug is fixed.
Depends on: 937220
(In reply to Steven MacLeod [:smacleod] from comment #30)
> It uses NetUtil.asyncFetch, but I'll also try with nsIDomFileReader

We should be careful with asyncFetch, since it can introduce main-thread I/O depending on the file stream being used. Feels like an all-native solution would be better still, I think we should prototype it and measure.
(In reply to :Gavin Sharp (email gavin@gavinsharp.com) from comment #31)
> (In reply to Steven MacLeod [:smacleod] from comment #30)
> > It uses NetUtil.asyncFetch, but I'll also try with nsIDomFileReader
> 
> We should be careful with asyncFetch, since it can introduce main-thread I/O
> depending on the file stream being used. Feels like an all-native solution
> would be better still, I think we should prototype it and measure.

I have a native OS.File.read implementation prototype in bug 961665.
Whiteboard: [triage]
This patch switches the SessionStore read over to using the DOMFileReader. To avoid crashes due to worker startup at shutdown we also start the worker with a noop during startup.
Assignee: nobody → smacleod
Attachment #8368075 - Flags: review?(ttaubert)
Attachment #8368075 - Flags: review?(dteller)
With both of the previous patches applied on my machine I see an improvement to the (Session Restored - First Paint) times.

There was a lot of noise in the measurements, but I pulled out what seemed to be a typical case for each. The timestamps appear in order (except for the INFO.* stamps which are printed after Session Restored):

LEGEND:

[CALLING READ] - Time immediately before SessionFile.read() call
[WORKER STARTING] - Time immediately before first message posted to worker (either read, or noop)
[AFTER NOOP CALL] - Time immediately after first message posted to worker (after the post() call)
[WORKER STARTED] - Time of first line of JS execution inside SessionWorker (right before osfile import)
[WORKER OSFILE IMPORTED] - Time immediately after osfile is imported in worker
[READ FINISHED] - Time of first line inside |_onSessionFileRead| in nsSessionStartup

Others are self explanatory 

# BEFORE CHANGES:

## COLD:

1391108642855 [CALLING READ]
1391108642858 [WORKER STARTING]
1391108642859 [AFTER NOOP CALL]
1391108643048 [WORKER STARTED]
1391108643110 [WORKER OSFILE IMPORTED]
1391108643113 [READ FINISHED]
1391108642546 [INFO.PROCESS]
1391108643771 [INFO.FIRSTPAINT]
1391108643853 [INFO.SESSIONRESTORED]
1225 [TIME TO FIRST PAINT]
82 [SESSION RESTORED - FIRST PAINT]
1307 [TIME TO SESSION RESTORED]

## HOT:

1391108651116 [CALLING READ]
1391108651118 [WORKER STARTING]
1391108651119 [AFTER NOOP CALL]
1391108651313 [WORKER STARTED]
1391108651946 [WORKER OSFILE IMPORTED]
1391108651952 [READ FINISHED]
1391108650861 [INFO.PROCESS]
1391108651784 [INFO.FIRSTPAINT]
1391108652008 [INFO.SESSIONRESTORED]
923 [TIME TO FIRST PAINT]
224 [SESSION RESTORED - FIRST PAINT]
1147 [TIME TO SESSION RESTORED]

1391108658025 [CALLING READ]
1391108658026 [WORKER STARTING]
1391108658027 [AFTER NOOP CALL]
1391108658253 [WORKER STARTED]
1391108658845 [WORKER OSFILE IMPORTED]
1391108658847 [READ FINISHED]
1391108657739 [INFO.PROCESS]
1391108658699 [INFO.FIRSTPAINT]
1391108658892 [INFO.SESSIONRESTORED]
960 [TIME TO FIRST PAINT]
193 [SESSION RESTORED - FIRST PAINT]
1153 [TIME TO SESSION RESTORED]


# AFTER CHANGES:

## COLD:

1391109078865 [CALLING READ]
1391109078868 [WORKER STARTING]
1391109078869 [AFTER NOOP CALL]
1391109079058 [READ FINISHED]
1391109079064 [WORKER STARTED]
1391109079128 [WORKER OSFILE IMPORTED]
1391109078564 [INFO.PROCESS]
1391109079776 [INFO.FIRSTPAINT]
1391109079827 [INFO.SESSIONRESTORED]
1212 [TIME TO FIRST PAINT]
51 [SESSION RESTORED - FIRST PAINT]
1263 [TIME TO SESSION RESTORED]

## HOT:

1391108903131 [CALLING READ]
1391108903132 [WORKER STARTING]
1391108903132 [AFTER NOOP CALL]
1391108903325 [READ FINISHED]
1391108903329 [WORKER STARTED]
1391108902873 [INFO.PROCESS]
1391108903828 [INFO.FIRSTPAINT]
1391108903905 [INFO.SESSIONRESTORED]
955 [TIME TO FIRST PAINT]
77 [SESSION RESTORED - FIRST PAINT]
1032 [TIME TO SESSION RESTORED]
1391108904100 [WORKER OSFILE IMPORTED]

1391108908894 [CALLING READ]
1391108908896 [WORKER STARTING]
1391108908896 [AFTER NOOP CALL]
1391108909093 [READ FINISHED]
1391108909096 [WORKER STARTED]
1391108908650 [INFO.PROCESS]
1391108909542 [INFO.FIRSTPAINT]
1391108909630 [INFO.SESSIONRESTORED]
892 [TIME TO FIRST PAINT]
88 [SESSION RESTORED - FIRST PAINT]
980 [TIME TO SESSION RESTORED]
1391108909803 [WORKER OSFILE IMPORTED]
Comment on attachment 8368076 [details] [diff] [review]
Patch 2 - Stop using OS.File for startup read in CrashMonitor.jsm

Review of attachment 8368076 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashmonitor/CrashMonitor.jsm
@@ +95,5 @@
>     *
>     * @return {Promise} A promise that resolves/rejects once loading is complete
>     */
>    loadPreviousCheckpoints: function () {
> +    let d = Promise.defer();

Nit: maybe call it "deferred"?

@@ +99,5 @@
> +    let d = Promise.defer();
> +    CrashMonitorInternal.previousCheckpoints = d.promise;
> +
> +    let reader = Cc["@mozilla.org/files/filereader;1"].
> +                 createInstance(Ci.nsIDOMFileReader);

Nit: please indent by two spaces.

@@ +108,5 @@
> +    }
> +
> +    reader.onload = function () {
> +      let notifications = null;
> +      let decoder = new TextDecoder();

We could combine the listeners by using reader.addEventListener("loadend", ...). When that is called we just check reader.error?

@@ +114,2 @@
>        try {
> +        let contents = decoder.decode(Uint8Array(reader.result));

Can this throw? If not we should move it out of the try stmt.

@@ +114,3 @@
>        try {
> +        let contents = decoder.decode(Uint8Array(reader.result));
> +        notifications = Object.freeze(JSON.parse(contents));

Same here for freeze(), we could do that below when calling resolve().

@@ +117,2 @@
>        } catch (ex) {
>          Cu.reportError("Error while loading crash monitor data: " + ex);

Nit: "... error while parsing ..."

@@ +126,5 @@
> +
> +    try {
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.

Nit: retrieve

@@ +127,5 @@
> +    try {
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.
> +      let file = File(dsFile);

Move that out of the try stmt?

@@ +128,5 @@
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.
> +      let file = File(dsFile);
> +      reader.readAsArrayBuffer(file);

How about using reader.readAsText(file, "utf-8")? We could get rid of the TextDecoder and Uint8Array usage then, no?

@@ +133,5 @@
> +    } catch (ex if ex.result == Cr.NS_ERROR_FILE_NOT_FOUND) {
> +      d.resolve(null);
> +    } catch (ex) {
> +      Cu.reportError("Error while loading crash monitor data: " + ex);
> +      d.resolve(null);

Nit: maybe:

} catch (ex) {
  if (ex.result == Cr.NS_ERROR_FILE_NOT_FOUND) {
    Cu.reportError("Error while loading crash monitor data: " + ex);
  }
  d.resolve(null);
}
Attachment #8368076 - Flags: feedback+
(In reply to Tim Taubert [:ttaubert] from comment #36)
> @@ +128,5 @@
> > +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> > +      // cause synchronous disk access as they check for existence and
> > +      // retreieve file meta-data before the file is read asynchronously.

(nit: retrieve)

That's rather unfortunate :( IIRC Yoric suggested you file a bug (I don't see one on file). Perhaps we can fix it quickly?
Comment on attachment 8368075 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown

Review of attachment 8368075 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +152,5 @@
> +   *        string The name of the session file.
> +   * @returns {promise}
> +   */
> +  _readSessionFile: function (filename) {
> +    let d = Promise.defer();

Nit: deferred

@@ +161,5 @@
> +    reader.onerror = function () {
> +      d.reject(reader.error);
> +    }
> +
> +    reader.onload = function () {

addEventListener("loadend") and check reader.error?

@@ +181,5 @@
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.
> +      let file = File(dsFile);
> +      reader.readAsArrayBuffer(file);

readAsText()?

@@ +184,5 @@
> +      let file = File(dsFile);
> +      reader.readAsArrayBuffer(file);
> +    } catch (ex) {
> +      d.reject(ex);
> +    }

Do we need to reject here? Wouldn't we just continue with the next file as the exception is caught in read()?
Attachment #8368075 - Flags: review?(ttaubert) → feedback+
Severity: normal → major
Comment on attachment 8368075 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown

Review of attachment 8368075 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +127,5 @@
>     */
>    _isClosed: false,
>  
>    read: function () {
> +    SessionWorker.post("noop");

Please add a comment explaining why you do this.
Also, couldn't we do this in app-startup in nsSessionStartup.js?

@@ +179,5 @@
> +
> +    try {
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.

Mention the bug number.
Also, "retrieve."

@@ +181,5 @@
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.
> +      let file = File(dsFile);
> +      reader.readAsArrayBuffer(file);

readAsText() would simplify the code, and might possibly be faster.

::: browser/components/sessionstore/src/SessionWorker.js
@@ +71,3 @@
>     */
> +  noop: function () {
> +    return {result: true};

Let's call it |init|.
Attachment #8368075 - Flags: review?(dteller) → feedback+
FTR, testing both patches on my machine yielded good but not great results. They took as long as reading using the ChromeWorker (200ms) but that was after all an almost constant time and didn't increase to 800ms with a filled scache anymore.

Trying NetUtil.asyncFetch() I saw 140ms read times. We had discarded that idea earlier because it seemed hard to get the encoding right but I think we found code we can steal from Metro. Steven will give that a try.
Comment on attachment 8368076 [details] [diff] [review]
Patch 2 - Stop using OS.File for startup read in CrashMonitor.jsm

Review of attachment 8368076 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashmonitor/CrashMonitor.jsm
@@ +121,5 @@
> +      d.resolve(notifications);
> +    }
> +
> +    let dsFile = Services.dirsvc.get("ProfD", Ci.nsIFile);
> +    dsFile.append("sessionCheckpoints.json");

I'd prefer if you created the nsIFile and initialized it from CrashMonitorInternal.path.

@@ +128,5 @@
> +      // Note that both the call to File(...) and readAsArrayBuffer(...) will
> +      // cause synchronous disk access as they check for existence and
> +      // retreieve file meta-data before the file is read asynchronously.
> +      let file = File(dsFile);
> +      reader.readAsArrayBuffer(file);

As in the other patch, readAsText would probably be better.

@@ +133,5 @@
> +    } catch (ex if ex.result == Cr.NS_ERROR_FILE_NOT_FOUND) {
> +      d.resolve(null);
> +    } catch (ex) {
> +      Cu.reportError("Error while loading crash monitor data: " + ex);
> +      d.resolve(null);

As Tim, but with a != instead of a ==.
Attachment #8368076 - Flags: review?(dteller) → feedback+
I left the noop worker message (now "init") inside sessionfile to keep things as simple as possible. Patch now uses NetUtil.
Attachment #8368075 - Attachment is obsolete: true
Attachment #8368946 - Flags: review?(ttaubert)
Patch now uses NetUtil.
Attachment #8368076 - Attachment is obsolete: true
Attachment #8368949 - Flags: review?(ttaubert)
Comment on attachment 8368946 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown v2

Review of attachment 8368946 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +129,5 @@
>  
>    read: function () {
> +    // We must initialize the worker during startup so it will be ready to
> +    // perform the final write. If shutdown happens soon after startup and
> +    // the worker has not started yet it will cause lockups.

Nit: I hope the remaining lockup will be fixed soon. The actual reason we need to do it here is possible dataloss.

@@ +133,5 @@
> +    // the worker has not started yet it will cause lockups.
> +    // See Bug 964531.
> +    SessionWorker.post("init");
> +
> +    return Task.spawn(function () {

Nit: function* () { [if that works with .bind(this), if not just ignore me]
Attachment #8368946 - Flags: review?(ttaubert) → review+
Comment on attachment 8368946 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown v2

Review of attachment 8368946 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/components/sessionstore/src/SessionFile.jsm
@@ +139,5 @@
> +      for (let filename of [this.path, this.backupPath]) {
> +        try {
> +          data = yield this._readSessionFile(filename);
> +          break;
> +        } catch (ex if ex.result == Cr.NS_ERROR_FILE_NOT_FOUND) {

Wait, did you try this? If you reject(status) below what's the exception that then arrives here? Does rejecting a promise in Task.jsm cause an exception? I never tried this. Would be great to maybe remove all the files in your profile dir and check that it works as expected.
Comment on attachment 8368949 [details] [diff] [review]
Patch 2 - Stop using OS.File for startup read in CrashMonitor.jsm v2

Review of attachment 8368949 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM

::: toolkit/components/crashmonitor/CrashMonitor.jsm
@@ +102,5 @@
> +    let file = FileUtils.File(CrashMonitorInternal.path);
> +    NetUtil.asyncFetch(file, function(inputStream, status) {
> +      if (!Components.isSuccessCode(status)) {
> +        if (status != Cr.NS_ERROR_FILE_NOT_FOUND) {
> +          Cu.reportError("Error while loading crash monitor data: " + ex);

"ex" isn't defined here.
Attachment #8368949 - Flags: review?(ttaubert) → review+
Attachment #8368949 - Attachment is obsolete: true
Attachment #8369392 - Flags: review?(ttaubert)
Comment on attachment 8369390 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown v3

Review of attachment 8369390 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM!
Attachment #8369390 - Flags: review?(ttaubert) → review+
Comment on attachment 8369392 [details] [diff] [review]
Patch 2 - Stop using OS.File for startup read in CrashMonitor.jsm v3

Review of attachment 8369392 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/components/crashmonitor/CrashMonitor.jsm
@@ +119,2 @@
>        } catch (ex) {
>          Cu.reportError("Error while loading crash monitor data: " + ex);

Nit: Error while parsing...
Attachment #8369392 - Flags: review?(ttaubert) → review+
No longer blocks: fxdesktopbacklog
Whiteboard: [triage]
I assume this won't make Fx 29 but let's see.
Comment on attachment 8369390 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown v3

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 959130 (This bug)
User impact if declined: Startup perf regression documented in this bug
Testing completed (on m-c, etc.): Tested locally, pushed to try, landing [1]
Risk to taking this patch (and alternatives if risky): Shouldn't be very risky, just a small change to read the file differently outside of the worker.
String or IDL/UUID changes made by this patch: None


[1] https://hg.mozilla.org/integration/fx-team/rev/13d1ff429b44
Attachment #8369390 - Flags: approval-mozilla-aurora?
Since it hasn't landed to central yet, we might have to uplift this to Aurora/Beta after the merge - tracking so we make sure it does.
It looks like there has been a 27% talos regression on "Fx-Team-Non-PGO - Tp5 No Network Row Major MozAfterPaint (Non-Main Startup File IO Bytes)".

https://groups.google.com/forum/#!topic/mozilla.dev.tree-management/StAkmt81_ec
http://mzl.la/1fqYBZ9

Tp5 is a page loader test but the "Startup File IO Bytes" hints to it measuring I/O going through file channels at startup. We previously omitted those channels by using OS.File so it seems reasonable that there is of course more I/O reported now.

I think it is safe to ignore but wanted to write it down in case someone thinks otherwise.

Joel, can you confirm that this test regressed only because of more bytes going through the file channels at startup? Other Tp5 tests would have regressed as well if this would really impact page load time, correct?
Flags: needinfo?(jmaher)
Flags: needinfo?(jmaher)
Whiteboard: [talos_regression]
this test is xperf specfically- we measure bytes on startup and during loading pages and report the total bytes from those accessed files.  When these numbers change it is usually related to loading DLL's that we didn't load previously (sometimes things are delay loaded) or related to some new feature that adds/requires libraries which were not there before.
If this is a xperf test, this means that OS.File was not hiding byte usage.
I strongly doubt it's related to this bug.
At the moment, none of the Talos tests has a sessionstore.js, so I can't imagine us causing 125kb+ of I/O during startup. Barring any mistake, all the code of asyncFetch resides in libxul and libc/win32.dll or equivalent, so us causing a dll load sounds unlikely, too.
Ah, I didn't think about those startup profiles not having a sessionstore.js file. It's probably completely unrelated then and safe to ignore.
Attachment #8369390 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
No longer depends on: 960952, 960986
Summary: Investigate Firefox 25 regression on sessionrestored → don't use a ChromeWorker to read the session file at startup
Depends on: 967766
Depends on: 970043
No longer depends on: 967766
Can we get an uplift nomination for Beta here?
Flags: needinfo?(smacleod)
Comment on attachment 8369390 [details] [diff] [review]
Patch 1 - Stop using the SessionWorker for read at startup to avoid slowdown v3

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Bug 959130 (This bug)
User impact if declined: Startup perf regression documented in this bug
Testing completed (on m-c, etc.): On m-c and aurora currently.
Risk to taking this patch (and alternatives if risky): Introduces a small talos regression (Bug 970043) which we believe is worth the much larger perf win introduced. Otherwise pretty low risk.
String or IDL/UUID changes made by this patch: None
Attachment #8369390 - Flags: approval-mozilla-beta?
Flags: needinfo?(smacleod)
Attachment #8369390 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Steven, can you please provide a patch that applies on Beta?
Flags: needinfo?(smacleod)
Patch for beta uplift is now attached.
Deprioritizing for QA testing before we release. If there's any chance this could cause regressions in Session Restore functionality then please flag for testing by removing [qa-] and adding the verifyme keyword.
Whiteboard: [talos_regression] → [talos_regression][qa-]
You need to log in before you can comment on or make changes to this bug.