Closed Bug 1373814 Opened 7 years ago Closed 7 years ago

Track INPUT_EVENT_RESPONSE_COALESCED_MS during startup into a different probe

Categories

(Core :: Widget, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla57
Performance Impact medium
Tracking Status
firefox57 --- fixed

People

(Reporter: Harald, Assigned: wcpan)

References

(Blocks 1 open bug)

Details

(Whiteboard: [tpi:+])

Attachments

(1 file)

tl;dr: We want to exclude input latency generated during browser startup from INPUT_EVENT_RESPONSE_COALESCED_MS.

We have a hypothesis that most high input latency is collected when the browser starts up, skewing our release criteria that should be reflective perceived performance. It is important have the interface interactive as fast as possible (usually measured as Time to Interactive, TTI), but before TTI milestones the input latency per event doesn't matter and should not be measured.
Ben, does this correctly reflect what we talked about?
Blocks: QRC_FX57
Flags: needinfo?(benjamin)
Yes. I'll see if I can pick this up.
Assignee: nobody → benjamin
Status: NEW → ASSIGNED
Component: General → Widget
Flags: needinfo?(benjamin)
I'm having trouble with this primarily because there is no straightforward and reliable definition of when the browser has started up. Here are some options:

places-init-complete
  + this is what is used by talos-xperf
  - this is not recorded in telemetry
  + I'm pretty sure this will work in tbird
  - I'm suspicious that if there are serious errors in places this will never fire

sessionstore-windows-restored
  + this is what's used by tsessionstore
  + this is measured in telemetry
  - this definitely doesn't exist in tbird
  - there are known cases where this won't fire in Firefox

XUL window firstPaint
  + this is used by talos Tspaint, our primary startup metric
  + this is measured in telemetry
  + this exists in tbird
  = there are cases where this won't fire, but those cases are mainly quick startup/shutdown loops where we aren't showing a window anyway

receipt of an idle-dispatched event when we start spinning the event loop
  + very reliable
  - may fire well before either of the two above signals
  - may fire before we've painted a window if that paint depends on async I/O which cause the main thread to be idle waiting on I/O
  + works in tbird

I'm leaning toward XUL firstPaint, but that will require some plumbing.
We used to[1] have some built-in startup-specific code for "interesting" histograms which we tore out. For what it's worth, it was using sessionstore-windows-restored (in parent. In child it was using  "app-startup")

May be of more interest than use, though.

[1]: https://hg.mozilla.org/mozilla-central/rev/d19cffeae1fe
All these points seem very early in the process, where I did nit even imagine the browser collecting input latency metrics.

Is there an event for when all tabs have been restored? We define startup's hero element as the new tab page showing a complete search bar. This is also the point when we expect users to interact with the browser, which also means the starting point for when input latency matters.
The hero element for startups with session restore could probably be the tab bar, but we could have a closer look here.

If the above, using hero element as proxy for expected-TTI is too complicated, could we combine several signals to reduce the known issues.
:bsmedberg, are you still working on this or can recommend somebidy?

Nominating for [qf] as this has potentially significant impact on Quantum Release Criteria.
Flags: needinfo?(benjamin)
Whiteboard: [qf]
Because of the startup issue, this has become a fair bit more complicated. I'm still hoping to tackle it (and I don't know of a better person to pick it up right now).
Flags: needinfo?(benjamin)
Whiteboard: [qf] → [qf:p2]
Priority: -- → P2
Whiteboard: [qf:p2] → [qf:p2][tpi:+]
I think splitting this will help us clarify whether INPUT_EVENT_RESPONSE_COALESCED_MS is skewed by startup. Maybe WeiCheng can help to take a look.
Flags: needinfo?(wpan)
Here are some collected result in session store:

The p75 of FX_SESSION_RESTORE_STARTUP_ONLOAD_INITIAL_WINDOW_MS (which I think is very close to sessionstore-windows-restored) is about 200ms.

The *p25* of FX_SESSION_RESTORE_AUTO_RESTORE_DURATION_UNTIL_EAGER_TABS_RESTORED_MS is about 2.5s.

The *p25* of FX_SESSION_RESTORE_NUMBER_OF_TABS_RESTORED is about 2.

Almost all FX_SESSION_RESTORE_NUMBER_OF_WINDOWS_RESTORED is 1.

This probably means we are spending about one second for each restored tab?
Flags: needinfo?(wpan)
I misread what FX_SESSION_RESTORE_STARTUP_ONLOAD_INITIAL_WINDOW_MS means, it is not related to sessionstore-windows-restored.
[re-assigning to get this unstuck, please correct if I got this wrong]

MattN pointed me also to sessionstore-windows-restored: https://developer.mozilla.org/en-US/docs/Observer_Notifications#Application_startup

Per docs, this marks when the window and tab chrome is loaded. Makes a good TTI as the browser should be responsive while the pinned/active tabs are loaded.

We also found the observer that is used by talos:

sessionstore-finished-restoring-initial-tabs: http://searchfox.org/mozilla-central/source/browser/components/sessionstore/StartupPerformance.jsm#30
Used by Talos for startup performance tests: http://searchfox.org/mozilla-central/source/testing/talos/talos/startup_test/sessionrestore/addon/SessionRestoreTalosTest.js#50
Assignee: benjamin → wpan
I think we probably want sessionstore-finished-restoring-initial-tabs, instead of sessionstore-windows-restored.

But sessionstore-finished-restoring-initial-tabs only happens when:
1. we have to restore a crashed session
or
2. users choose to restore previously opened session

For users who dont enable session restoring, we need to use sessionstore-windows-restored instead.

So I'm going to add two probes:
1. record input response which happens before sessionstore-finished-restoring-initial-tabs
2. record input response which happens before sessionstore-windows-restored

Then we can compare the data to know what is the next action item.
The sessionrestore-* notifications only exist in parent, so it's quite painful to sync this state to content processes (in PresShell).

Since there are many sources which could cause the input delay (e.g. GC/CC, slow script, slow startup, slow shutdown ... etc.),  we could probably just analyze the data from BHR, and see if the count matches.
I think this is slightly reasonable than let PresShell to know what is hanging during recording input. Plus we don't need to wait for new data from users.
> Since there are many sources which could cause the input delay (e.g. GC/CC, slow script, slow startup, slow shutdown ... etc.),  we could probably just analyze the data from BHR, and see if the count matches.

This sounds generally useful for diagnostics and mconley already applied a similar idea for tab switch issues, where hangs are specifically collected for tab switches. Would we teach BHR to know if the browser is starting up or shutting down?

The main issue with this is that BHR only captures long hangs, not the short thresholds that input latency telemetry provides.

If I understand correctly, in the case of startup, events are queued up in the parent and reach the content delayed when its event queue is ready. Could events somehow flagged for when they came in; so content knows that these are startup events.

I might have a very wrong picture of how the event queue works, but a simpler but less accurate solution could be that the content event queue could by default track the first incoming events (right after the content process got ready) as "startup" events and collect them in a different probe.
:wcpan for feedback on the previous comment
Flags: needinfo?(wpan)
sessionstore-finished-restoring-initial-tabs does not seem to represent the state of tabs:
http://searchfox.org/mozilla-central/rev/6482c8a5fa5c7446e82ef187d1a1faff49e3379e/browser/components/sessionstore/StartupPerformance.jsm#141
So I added another observer topic for this.

I guess we can do something like bug 1384238 for BHR, but maybe we can do that as a follow-up.
Flags: needinfo?(wpan)
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review173412

::: layout/base/PresShell.cpp:8303
(Diff revision 1)
>      if (!sLastInputProcessed || sLastInputProcessed < aEvent->mTimeStamp) {
>        if (sLastInputProcessed) {
>          // This input event was created after we handled the last one.
>          // Accumulate the previous events' coalesced duration.
>          double lastMillis = (sLastInputProcessed - sLastInputCreated).ToMilliseconds();
>          Telemetry::Accumulate(Telemetry::INPUT_EVENT_RESPONSE_COALESCED_MS,

Hmm, but we still keep collecting these initial evens to the old probes too.  I don't think we should do that. Makes rather hard to look at the data.

::: layout/base/PresShell.cpp:9746
(Diff revision 1)
>    if (!nsCRT::strcmp(aTopic, NS_WIDGET_WAKE_OBSERVER_TOPIC)) {
>      mLastOSWake = TimeStamp::Now();
> +    return NS_OK;
> +  }
> +
> +  if (!nsCRT::strcmp(aTopic, "sessionstore-tab-restored") ||

Looks a bit over-engineered to add two notifications just for this.
Could we just have something like
sessionstore-completed or some such?
Attachment #8896932 - Flags: review?(bugs) → review-
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review174446

::: layout/base/PresShell.h:920
(Diff revision 2)
>    mozilla::TimeStamp        mLastOSWake;
>  
>    static mozilla::TimeStamp sLastInputCreated;
>    static mozilla::TimeStamp sLastInputProcessed;
> +
> +  static bool               sSessionInitialized;

Could we call this something less generic. Perhaps sSessionStoreInitialized

::: layout/base/PresShell.cpp:8305
(Diff revision 2)
>          // Accumulate the previous events' coalesced duration.
>          double lastMillis = (sLastInputProcessed - sLastInputCreated).ToMilliseconds();
>          Telemetry::Accumulate(Telemetry::INPUT_EVENT_RESPONSE_COALESCED_MS,
>                                lastMillis);
> +
> +        if (MOZ_UNLIKELY(!sSessionInitialized)) {

You didn't answer to the other review comment.
Why do we keep collecting INPUT_EVENT_RESPONSE_COALESCED_MS data even when sSessionInitialized is false?
Attachment #8896932 - Flags: review?(bugs) → review-
(In reply to Olli Pettay [:smaug] from comment #20)
> Comment on attachment 8896932 [details]
> Bug 1373814 - Add new probe to measure start-up input latency.
>
> You didn't answer to the other review comment.
> Why do we keep collecting INPUT_EVENT_RESPONSE_COALESCED_MS data even when
> sSessionInitialized is false?

Ah yes, excuse me for missed that one.

I'm not sure if we need to change the meaning for INPUT_EVENT_RESPONSE_COALESCED_MS, because we are not 100% sure this is the problem.
:bsmedberg, do you have any thought on this (and the patch)? Thanks.
Flags: needinfo?(benjamin)
Forward ni to :chutten.
Flags: needinfo?(benjamin) → needinfo?(chutten)
INPUT_EVENT_RESPONSE_COALESCED_MS was designed from the beginning to have merit as a measure. It shouldn't be changed without good reason. Also, retiring/changing INPUT_EVENT_RESPONSE_COALESCED_MS might require some chatting with quantum since I believe it is part of the release criteria.

If our analysis needs would be better served by having one startup measure and one post-startup measure, we'll need two new measures: a STARTUP_ one and a POST_STARTUP_ one.
Flags: needinfo?(chutten)
> INPUT_EVENT_RESPONSE_COALESCED_MS was designed from the beginning to have merit as a measure.

I'd argue that the COALESCED_ variant was created to fix the issue of overcounting when multiple events being delayed by one hang. We did not spend too much time on proving its merits with experimental designs.

> If our analysis needs would be better served by having one startup measure and one post-startup measure, we'll need two new measures: a STARTUP_ one and a POST_STARTUP_ one.

This is a good point. Changing the old metric would throw off our historical data. If the metric turns out to be more useful we can still add it to mission control later.

I like the naming, so lets do the split for INPUT_EVENT_RESPONSE_STARTUP_MS and INPUT_EVENT_RESPONSE_POST_STARTUP_MS.
So, we keep the old probe and add two new ones, for now? Sounds ok to me.
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

f? for data review.
Attachment #8896932 - Flags: feedback?(francois)
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review175484

Sorry, I had missed something when I looked at this last time.

::: dom/base/nsDOMNavigationTiming.cpp:198
(Diff revision 4)
>        Telemetry::AccumulateTimeDelta(Telemetry::TIME_TO_DOM_INTERACTIVE_MS,
>                                       mNavigationStartTimeStamp);
> +
> +      nsCOMPtr<nsIObserverService> os = mozilla::services::GetObserverService();
> +      if (os) {
> +        os->NotifyObservers(nullptr, "sessionstore-completed", nullptr);

What, this doesn't look right.
I think I missed this part when looking at the patch last time.

What are we trying to achieve? First page loaded in child process or session restored, whichever happens first? But only on content process perhaps?

::: dom/base/nsDOMNavigationTiming.cpp:321
(Diff revision 4)
>  nsDOMNavigationTiming::IsTopLevelContentDocument() const
>  {
>    if (!mDocShell) {
>      return false;
>    }
> +  // It can be a background page of an extension for parent process, which we

Hmm, this looks wrong when thinking about the name of the method.
If you need this kind of change, it should be then something like
IsTopLEvelContentDocumentInChildProcess
Attachment #8896932 - Flags: review?(bugs) → review-
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review175534

datareview+ but please add your email.

::: toolkit/components/telemetry/Histograms.json:6567
(Diff revision 4)
>      "releaseChannelCollection": "opt-out",
>      "description": "Time (ms) from the Input event being created to the end of it being handled, but with overlapping events coalesced."
>    },
> +  "INPUT_EVENT_RESPONSE_STARTUP_MS": {
> +    "record_in_processes": ["main", "content"],
> +    "alert_emails": ["perf-telemetry-alerts@mozilla.com"],

We now ask people to add the address of a person to the list so please add your own email here.

::: toolkit/components/telemetry/Histograms.json:6578
(Diff revision 4)
> +    "releaseChannelCollection": "opt-out",
> +    "description": "Time (ms) from the Input event being created to the end of it being handled, but with overlapping events coalesced, which happens before sessionstore finished eager tabs restoring."
> +  },
> +  "INPUT_EVENT_RESPONSE_POST_STARTUP_MS": {
> +    "record_in_processes": ["main", "content"],
> +    "alert_emails": ["perf-telemetry-alerts@mozilla.com"],

Same thing here.
Attachment #8896932 - Flags: review+
Attachment #8896932 - Flags: feedback?(francois)
(In reply to Olli Pettay [:smaug] from comment #29)
> Comment on attachment 8896932 [details]
> 
> What, this doesn't look right.
> I think I missed this part when looking at the patch last time.
> 
> What are we trying to achieve? First page loaded in child process or session
> restored, whichever happens first? But only on content process perhaps?

I'd like to use "sessionstore-finished-restoring-initial-tabs" for content process, but the notification only fired if we have something to restore (either by perf or crashed session). So "content-document-interactive" is the closest topic which will fire in both cases (restore previous session or show certain pages directly). But since we only need the notification from top-level document, "content-document-interactive" is not that accurate.

> ::: dom/base/nsDOMNavigationTiming.cpp:321
> (Diff revision 4)
> >  nsDOMNavigationTiming::IsTopLevelContentDocument() const
> >  {
> >    if (!mDocShell) {
> >      return false;
> >    }
> > +  // It can be a background page of an extension for parent process, which we
> 
> Hmm, this looks wrong when thinking about the name of the method.
> If you need this kind of change, it should be then something like
> IsTopLEvelContentDocumentInChildProcess

When I think this twice, I found it may change the meaning for other probes. This check probably needs to move out from this function.
Flags: needinfo?(bugs)
Sorry about delay.

I'm still having problem to understand the change to ::NotifyDOMInteractive.
Doesn't that get called often before SessionStore calls notifyObservers?
Flags: needinfo?(bugs)
The SessionStore thing is confusing for content process, so I renamed the topic and the static variable.
This is tricky to review, not because of the code, but because of trying to understand what we're trying to capture here, and whether that is what we actually want and need.
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review179142

::: browser/components/sessionstore/SessionStore.jsm:975
(Diff revision 5)
>          SessionStoreInternal._resetLocalTabRestoringState(tab);
>          SessionStoreInternal.restoreNextTab();
>  
>          this._sendTabRestoredNotification(tab, data.isRemotenessUpdate);
> +
> +        Services.obs.notifyObservers(null, "sessionstore-completed", null);

This is wrong. sessionstore isn't completed here, since restoreNextTab() may have started to restore another tab.

::: dom/base/nsDocument.cpp:13867
(Diff revision 5)
>        if (!mDOMInteractiveSet) {
>          Telemetry::AccumulateTimeDelta(Telemetry::TIME_TO_DOM_INTERACTIVE_MS,
>                                         startTime);
>          mDOMInteractiveSet = true;
> +
> +        nsCOMPtr<nsIObserverService> os = mozilla::services::GetObserverService();

I think I'd prefer if this code just set some global variable. Perhaps just add a static setter method to nsPresShell and call that here.

::: browser/components/sessionstore/SessionStore.jsm:975
(Diff revision 6)
>          SessionStoreInternal._resetLocalTabRestoringState(tab);
>          SessionStoreInternal.restoreNextTab();
>  
>          this._sendTabRestoredNotification(tab, data.isRemotenessUpdate);
> +
> +        Services.obs.notifyObservers(null, "sessionstore-first-or-no-tab-restored");

This gets fired several times, so perhaps it should be after  all
sessionstore-one-or-no-tab-restored.
I know, sounds silly, but at least a bit closer to what actually should happen.
Rename everywhere.


Is it guaranteed that the notification will fire even if user has disabled all the history storing from preferences?

::: layout/base/PresShell.cpp:1022
(Diff revision 6)
>  #ifdef MOZ_XUL
>        os->AddObserver(this, "chrome-flush-skin-caches", false);
>  #endif
>        os->AddObserver(this, "memory-pressure", false);
>        os->AddObserver(this, NS_WIDGET_WAKE_OBSERVER_TOPIC, false);
> +      if (XRE_IsParentProcess()) {

We don't need to add the observer if sProcessInteractable is already true

::: layout/base/PresShell.cpp:8310
(Diff revision 6)
> +                break;
> +              default:
> +                break;
> +            }
> +          }
> +          Telemetry::Accumulate(Telemetry::INPUT_EVENT_RESPONSE_STARTUP_MS,

So if ready state is interactive or complete, you end up still reporting startup. 
Could you refactor this a bit.

Something like

if (!sProcessInteractable) {
  // do the content process check and possibly set
  // sProcessInteractable
}

if (MOZ_LIKELY(sProcessInteractable)) {
          Telemetry::Accumulate(Telemetry::INPUT_EVENT_RESPONSE_POST_STARTUP_MS,
} else {
  Telemetry::Accumulate(Telemetry::INPUT_EVENT_RESPONSE_STARTUP_MS,...
}

::: layout/base/PresShell.cpp:9755
(Diff revision 6)
>  
> +  // For parent process, user may expect the UI is interactable after the first
> +  // tab (previously opened page or home page) has restored.
> +  if (XRE_IsParentProcess() &&
> +      !nsCRT::strcmp(aTopic, "sessionstore-first-or-no-tab-restored")) {
> +    sProcessInteractable = true;

Perhaps remove the observer here, and the XRE_IsParentProcess check is useless here.
Make it just an assertion
Attachment #8896932 - Flags: review?(bugs) → review+
Comment on attachment 8896932 [details]
Bug 1373814 - Add new probe to measure start-up input latency. data-r=francois

https://reviewboard.mozilla.org/r/168214/#review179142

> This is wrong. sessionstore isn't completed here, since restoreNextTab() may have started to restore another tab.

This line should not exist in the latest patch. Maybe I somehow confused the MozReview's inter-diff.

> I think I'd prefer if this code just set some global variable. Perhaps just add a static setter method to nsPresShell and call that here.

Ditto.

> This gets fired several times, so perhaps it should be after  all
> sessionstore-one-or-no-tab-restored.
> I know, sounds silly, but at least a bit closer to what actually should happen.
> Rename everywhere.
> 
> 
> Is it guaranteed that the notification will fire even if user has disabled all the history storing from preferences?

Yes, in that case the notification will fire along with sessionstore-windows-restored.
Pushed by wpan@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3da4af5046a9
Add new probe to measure start-up input latency. data-r=francois r=francois,smaug
(In reply to Wei-Cheng Pan [:wcpan] [:wcp] [:legnaleurc] from comment #38)
> Comment on attachment 8896932 [details]
> Bug 1373814 - Add new probe to measure start-up input latency.
> data-r=francois
> 
> https://reviewboard.mozilla.org/r/168214/#review179142
> 
> > This is wrong. sessionstore isn't completed here, since restoreNextTab() may have started to restore another tab.
> 
> This line should not exist in the latest patch. Maybe I somehow confused the
> MozReview's inter-diff.

Yes, I think MozReview did something weird. This was for the previous version of the patch.
https://hg.mozilla.org/mozilla-central/rev/3da4af5046a9
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
Performance Impact: --- → P2
Whiteboard: [qf:p2][tpi:+] → [tpi:+]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: