Open Bug 1353586 Opened 7 years ago Updated 2 years ago

[meta] _saveStateAsync shouldn't pile up on existing jank

Categories

(Firefox :: Session Restore, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: florian, Unassigned, NeedInfo)

References

(Depends on 1 open bug, Blocks 2 open bugs)

Details

(Keywords: meta, perf, stale-bug, Whiteboard: [fxperf])

When profiling, I frequently see _saveStateAsync from SessionSaver.jsm in blocks of time when the browser is unresponsive. This is the result of using setTimeout, that sometimes makes the callback execute at the most inconvenient time. We should use some equivalent of window.requestIdleCallback instead.
Blocks: 1354723
Flags: qe-verify-
Priority: -- → P2
Whiteboard: [photon] → [photon-performance]
Hi :florian, I think it's nice to improve the saving work in SessionSaver.jsm, which could become worse if the file size of backup files(recovery.js) is bloat enough and saving interval is adjusted to a smaller one (less possibility, though). 

Would you mind sharing one or two profiling result here, if possible?

Thanks a lot :)
We discussed this with Florian and we need to make sure that we use profiles from builds with the fixes from bug 1305950, which it wasn't clear if they were available at the time this bug was filed.
(In reply to Panos Astithas [:past] (please needinfo?) from comment #2)
> We discussed this with Florian and we need to make sure that we use profiles
> from builds with the fixes from bug 1305950, which it wasn't clear if they
> were available at the time this bug was filed.

Would it be possible to easily get a profile using m-c over the past few days?
Thanks!
Assignee: nobody → wiwang
Status: NEW → ASSIGNED
Priority: P2 → P1
Here is something I saw today in a profile from a current nightly: https://perfht.ml/2qKqBGV
Here is another profile where _saveStateAsync ran while the browser was already janky while trying to display the "Are you sure you want to close N tabs?" dialog: https://perfht.ml/2qOA7so
No longer blocks: photon-performance-triage
(In reply to Florian Quèze [:florian] [:flo] from comment #5)
> Here is another profile where _saveStateAsync ran while the browser was
> already janky while trying to display the "Are you sure you want to close N
> tabs?" dialog: https://perfht.ml/2qOA7so

Thanks for your distinct profile, Florian! 
Now I'm also observing this phenomenon in my several profiles.

While this bug is still blocked by bug 1353206(requestIdleCallback API implementation for non-DOM JS execution contexts),
I think there are some topics we could discuss in advance:

(1) Deadline timeout in the requestIdleCallback API: 
    Since we can't postpone the session saving forever, a deadline timeout is needed. It could be hard to set a proper timeout, which could bypass most jank and doesn't overly increase the possibility of losing session data. It will become more complicated given that some(or few?) users would adjust their `browser.sessionstore.interval`.

(2) Besides requestIdleCallback API, is there any other way to avoid jank?
    There might be other better way, or if the API implementation is not done yet.

(3) Less data collection time:
    This is likely off-topic, however, if there is any low-hanging fruit which can reduce collection time(I don't think so), we should do it.

Besides, to be clear, size of session restore files should not be related since the disk writing is off-main-thread by OS.File.

Yoric, would it be possible to you to share a bit of thought about the timeout(1) ? Many thanks for your help!
Flags: needinfo?(dteller)
(In reply to Will Wang [:WillWang] from comment #6)

> (2) Besides requestIdleCallback API, is there any other way to avoid jank?

There are 3 things to consider:

- can we make this happen at a time when it doesn't block the user? (that's the reason for using requestIdleCallback)

- can we reduce the time this takes? (we want to be able to paint a new frame every 16ms, so any code blocking the main thread for most of that time or more is likely to have a visible impact). One idea to explore here may be to collect the information only for tabs that the user has interacted with or that have had load events. I think the current code goes through every browser of every window each time.

- if we can't make this very fast, could we split this work into chunks, so that none of them cause us to skip a frame?
See Also: → 1365970
Whiteboard: [photon-performance] → [reserve-photon-performance]
> Yoric, would it be possible to you to share a bit of thought about the timeout(1) ? Many thanks for your help!

Sorry for the delay, Bugzilla seems to have stopped reminding about these things.

I'll start by saying that I agree with comment 7 :)

Now, I wonder in which case we could end up hitting the deadline timeout. I take it this happens when the user is actively using the computer for a long duration in a row and we want to save during this. I don't think that we can do any miracle here, so yes, we'll need some kind of timeout.

Ideally, we should be able to find out if there is anything interesting to save. It's somewhat orthogonal to the other problems, but if we had an indication of what kind of data has changed, we could probably find a way to categorize between important data (e.g. stuff that takes place in an active tab and involves forms or cookies) from less important data. We could decide that our timeout is short (~1 minute) if important data has been saved, longer (~1h) otherwise. 

I don't know the current state of Session Restore. How hard do you think this would be?

Also, something a bit orthogonal. To determine the maximal ideal duration for Session Restore, is there a way to find out that we have *something* to write? If so, we could feed this information to crash stats and find out how often we actually lose data during a crash.
Flags: needinfo?(dteller) → needinfo?(wiwang)
Whiteboard: [reserve-photon-performance] → [reserve-photon-performance][qf]
Whiteboard: [reserve-photon-performance][qf] → [reserve-photon-performance][qf:p1]
Will, do we have progress on this bug? Thanks.
(In reply to Bobby Chien [:bchien] from comment #9)
> Will, do we have progress on this bug? Thanks.

Yes, thanks to Yoric and Florian, you provide many useful considerations!
For those considerations, these days I'm analyzing over several aspects, which means this bug is in fact a meta bug if we would like to do something more effective, instead of only simply converting the setTimeout() to requestIdleCallback().

Therefore, for instance,
1. I'm trying to reduce the data collection time.(for instance, Bug 1366213)
2. I'm also analyzing the telemetry data to find out the most suitable time which was needed within each frame, to avoid janking the next frame if we put a much bigger task into a small idle slot.
3. I'm also trying to make the granularity of collection process more subtle, which is likely changing the architecture.
4. ... There are some other ongoing discussions for this bug, I'll keep you posted.
Flags: needinfo?(wiwang)
Hi Will, Mike suggested me that I can help you on this bug.

So there are a few questions I want to ask you:
1. What are your progress on this bug?
2. What do we expect after this bug is solved?

It would be great if I work with you on this bug.
Flags: needinfo?(wiwang)
(In reply to Bao Quan [:beekill] from comment #11)
> Hi Will, Mike suggested me that I can help you on this bug.
> 
> So there are a few questions I want to ask you:
> 1. What are your progress on this bug?
> 2. What do we expect after this bug is solved?
> 
> It would be great if I work with you on this bug.

Nice to see your work here, Bao!

It becomes more clear that there are only a few things we can do when IdleDeadline.timeRemaining() is not exposed...
Please refer to the bug 1365970 comment 20 I just made yesterday.
That is, we will likely "overrun the time allotted" as [1] said.
But we still can use idle dispatch as the first step at this moment.

You can also refer to the valuable comment 7 and 8;
In sum, I believe the key part is to (1) split up the data collection and (2) reduce data collection time, if we also want to deal with them in this bug.


[1]
Cooperative Scheduling of Background Tasks API - Web APIs | MDN 
https://developer.mozilla.org/en-US/docs/Web/API/Background_Tasks_API
Flags: needinfo?(wiwang)
(In reply to Will Wang [:WillWang] from comment #12)
> It becomes more clear that there are only a few things we can do when
> IdleDeadline.timeRemaining() is not exposed...
> Please refer to the bug 1365970 comment 20 I just made yesterday.
> That is, we will likely "overrun the time allotted" as [1] said.
> But we still can use idle dispatch as the first step at this moment.

Can we use hiddenDOMWindow.requestIdleCallback(...)? We can access that with nsIAppShellService [1].

> You can also refer to the valuable comment 7 and 8;
> In sum, I believe the key part is to (1) split up the data collection and
> (2) reduce data collection time, if we also want to deal with them in this
> bug.

During a meet up with Mike, he suggested that the data collection can be broken into ~20-tab groups. That way, we can scale up with larger number of tabs and windows without janking the main thread.

(In reply to David Teller [:Yoric] (hard to reach until July 17th - please use "needinfo") from comment #8)
> Ideally, we should be able to find out if there is anything interesting to
> save. It's somewhat orthogonal to the other problems, but if we had an
> indication of what kind of data has changed, we could probably find a way to
> categorize between important data (e.g. stuff that takes place in an active
> tab and involves forms or cookies) from less important data. We could decide
> that our timeout is short (~1 minute) if important data has been saved,
> longer (~1h) otherwise.

Currently, in content-sessionStore, we have these kind of listeners [2]. I think some of them are not as important as the others. Can we create something like PriorityMessageQueue instead of MessageQueue [3]? We can have 1-second timeout for important events and a longer timeout for less important events. Hopefully, this will reduce the time needed to collect data.

[1]: nsIAppShellService https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Reference/Interface/nsIAppShellService
[2]: http://searchfox.org/mozilla-central/source/browser/components/sessionstore/content/content-sessionStore.js#847-855
[3]: MessageQueue http://searchfox.org/mozilla-central/source/browser/components/sessionstore/content/content-sessionStore.js#701
Flags: needinfo?(wiwang)
(In reply to Bao Quan [:beekill] from comment #13)
> Can we use hiddenDOMWindow.requestIdleCallback(...)? We can access that with
> nsIAppShellService [1].
Nice! I think it's great to try using this way to see whether we can pass the existing tests or not;
so that we can utilize `IdleDeadline` to avoid the possible jank.


> During a meet up with Mike, he suggested that the data collection can be
> broken into ~20-tab groups. That way, we can scale up with larger number of
> tabs and windows without janking the main thread.
I believe Mike's suggestion is definitely the right direction, especially for the case like this recent post[1];
I'm not sure how the number ~20-tab was decided, it might be worth referencing the telemetry data.
Also, we have to carefully deal with the async nature during data collection and make this pass the tests.


> Currently, in content-sessionStore, we have these kind of listeners [2]. I
> think some of them are not as important as the others. Can we create
> something like PriorityMessageQueue instead of MessageQueue [3]? We can have
> 1-second timeout for important events and a longer timeout for less
> important events. Hopefully, this will reduce the time needed to collect
> data.
It's a good idea, and before that, how about evaluating the impact degree caused by the data collection in the content process? 
To this bug, I think the impact will be the oversized(if any) update from the content process.

Besides, as for comment 11, would you have any plan to take this bug these days? I would share anything I know with you :)


[1] The New Firefox and Ridiculous Numbers of Tabs
https://metafluff.com/2017/07/21/i-am-a-tab-hoarder/
Flags: needinfo?(wiwang) → needinfo?(nnn_bikiu0707)
(In reply to Will Wang [:WillWang] from comment #14)
> Besides, as for comment 11, would you have any plan to take this bug these
> days? I would share anything I know with you :)

Thank you for your response, Will!

I'll work on this bug after I finish the bug 1034036. But in the mean time, I can work on the things you suggested in comment 14 in advance:
1. Experimenting with hiddenDOMWindow.
2. Digging through telemetry data to find the best number of tabs per group.
3. Evaluating the impact of data collection in content process.

It would be cool to know steps to produce profile results like comment 4 and comment 5.
Flags: needinfo?(nnn_bikiu0707) → needinfo?(wiwang)
(In reply to Bao Quan [:beekill] from comment #15)
> It would be cool to know steps to produce profile results like comment 4 and
> comment 5.

I think you can try making some sort of busy loop in the parent process as we usually did in the content process;
if that's not ideal for you, you might want to consult :flo for the experience of jank in comment 4/5 :)
Flags: needinfo?(wiwang)
I looked at the telemetry data, here are what I found:

+ 5 percentile of users have the collecting time of ~1.15 seconds [1]. I think we should make the _saveStateAsync to execute when there is an idle time of 2 seconds. I tested this on my machine and all the tests passed.

+ Firefox takes 3.31ms (median) to collect all windows data [2] and the number of concurrent tabs [3] is ~4 tabs (median). So, the time it takes to collect data for 1 tab is ~0.8ms. If we want to collect windows data in a 10-millisecond idle time, we should break tabs into groups of 12.

I also thought about implementing the PriorityMessageQueue. I think that it's going to longer the time we're going to collect data in write to session file. We're already collecting after (default) 15 seconds when users change something. The priority queue may longer that time to 20 seconds or more, however, I don't think that is going to benefit us much, compares to changes that we have to make to make this work.

[1]: https://mzl.la/2ufqyBd
[2]: https://mzl.la/2ufSlBF
[3]: https://mzl.la/2ufSBAD
Depends on: 1388664
Filed the bug 1388664 to move _saveStateAsync into idle dispatch.

:flo, can you show me steps to re-produce the profiles in comment 4 and comment 5. I made a busy loop in parent process like Will recommended and the profile result looked good. However, I think it would be best to know what the steps are so that we can verify the patches work.
Requesting needinfo for comment 18 above.
Flags: needinfo?(florian)
(In reply to Bao Quan [:beekill] from comment #18)
> Filed the bug 1388664 to move _saveStateAsync into idle dispatch.
> 
> :flo, can you show me steps to re-produce the profiles in comment 4 and
> comment 5. I made a busy loop in parent process like Will recommended and
> the profile result looked good. However, I think it would be best to know
> what the steps are so that we can verify the patches work.

I don't have steps to reproduce unfortunately. Bugs blocking the photon-perf-jank meta bug are things that happen without any specific user interaction triggering them, typically because it's code that's starting off a timer. While working on the photon performance project, I spend a lot of time looking at profile, and whenever I see something interesting, I give a link to the profile in the relevant bug, but I'll never have reliable steps to reproduce.
Flags: needinfo?(florian)
This is an assigned P1 bug without activity in two weeks. 

If you intend to continue working on this bug for the current release/iteration/sprint, remove the 'stale-bug' keyword.

Otherwise we'll reset the priority of the bug back to '--' on Monday, August 28th.
Keywords: stale-bug
Per discussion with :beekill, he is going to file another bug for breaking the data collection into chunks;
and this bug will focus on the idle dispatch.
Keywords: stale-bug
Depends on: 1395066
I don't think this bug will be fixed by 57. I am changing it from qf:p1 to qf:p2 for post 57 work.
Whiteboard: [reserve-photon-performance][qf:p1] → [reserve-photon-performance][qf:p2]
Keywords: perf
Hey :WillWang, are you still looking at this?
Flags: needinfo?(wiwang)
Whiteboard: [reserve-photon-performance][qf:p2] → [reserve-photon-performance][qf:p1]
Hey Mike :)

Since Bug 1388664 had already done the idle dispatch part, which is the original idea(comment 0) of this bug,
I think this bug become sort of meta bug, including ideas from comment 6/7/8;

So I was thinking... we probably can either treat this bug as a meta bug, or resolve this bug now. What do you think?
Flags: needinfo?(wiwang) → needinfo?(mconley)
I think this is a fine choice for a meta bug. Thanks!
Flags: needinfo?(mconley)
Whiteboard: [reserve-photon-performance][qf:p1] → [reserve-photon-performance][qf:meta]
Keywords: meta
Whiteboard: [reserve-photon-performance][qf:meta] → [qf:meta]
Whiteboard: [qf:meta] → [qf:meta] [fxperf]
Moving to p3 because no activity for at least 24 weeks.
Priority: P1 → P3
Performance Impact: --- → ?
Whiteboard: [qf:meta] [fxperf] → [fxperf]
Summary: _saveStateAsync shouldn't pile up on existing jank → [meta] _saveStateAsync shouldn't pile up on existing jank

The bug assignee didn't login in Bugzilla in the last 7 months.
:dao, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: wiwang → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(dao+bmo)
Performance Impact: ? → ---
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.