Closed Bug 1372942 Opened 2 years ago Closed 2 years ago

tp5: Wait for the idle-callback before moving to the next page.

Categories

(Testing :: Talos, enhancement)

enhancement
Not set

Tracking

(firefox57 fixed)

RESOLVED FIXED
mozilla57
Tracking Status
firefox57 --- fixed

People

(Reporter: nbp, Assigned: rwood)

References

Details

(Whiteboard: [PI:September])

Attachments

(1 file)

Bug 900784 is adding a JavaScript Start-up Bytecode Cache (JSBC).  The bytecode cache works by recording and saving the bytecode used during one of the first visits, when the idle callback is called, and provide these values to the follow-up visits of the same page.  The recording of the bytecode has some overhead.

Currently, the pageloader harness wait for 250ms after the MozAfterPaint event to move to the next page[1].  In most case these pages are going to not have any code running by the end of the timeout.

However, the JSBC is waiting for the idle-callback to be triggered in order to schedule the saving of the recorded bytecode (Bug 1372207) to the necko cache.  If some code is still running by the time the timeout triggers, we might not have saved any bytecode, and we might trigger the recording the next time we visit the same page.

If we could trigger a timeout after the callback given to window.requestIdleCallback() is executed, then this should be enough to ensure that the bytecode is saved.

[1] http://searchfox.org/mozilla-central/source/testing/talos/talos/pageloader/chrome/pageloader.js#609,611,616,621
would this affect the numbers of the test, or just to ensure that we are measuring the right thing?

Right now we wait for 2 event: onload + mozafterpaint.  Now we would be waiting for onload+mozafterpaint+idle-callback before calling a page completed.

Would the measurement still be after mozafterpaint, or after idle-callback?

We are also looking at changing the mozafterpaint to be the first non empty mozafterpaint in bug 1371341, I would assume that wouldn't change anything here.

:nbp, can you confirm I am thinking about this correctly?
Flags: needinfo?(nicolas.b.pierron)
(In reply to Joel Maher ( :jmaher) from comment #1)
> would this affect the numbers of the test, or just to ensure that we are
> measuring the right thing?

Just ensure we are measuring the decoded scripts, by avoiding a worse case scenario which is unexpected in common browser experience.

> Right now we wait for 2 event: onload + mozafterpaint.  Now we would be
> waiting for onload+mozafterpaint+idle-callback before calling a page
> completed.

The idle callback should be registered at the onLoad event, and the callback should schedule the next page load with a setTimeout as well.

> Would the measurement still be after mozafterpaint, or after idle-callback?

I do not expect to change any of the current measurements.  I guess we could record/report the load-to-idle time, but this is not a requirement for this issue.

> We are also looking at changing the mozafterpaint to be the first non empty
> mozafterpaint in bug 1371341, I would assume that wouldn't change anything
> here.

Indeed, this should not matter.
Flags: needinfo?(nicolas.b.pierron)
Whiteboard: [PI:June] → [PI:July]
(In reply to Nicolas B. Pierron [:nbp] from comment #2)

> 
> The idle callback should be registered at the onLoad event, and the callback
> should schedule the next page load with a setTimeout as well.
> 

Might be a silly question, but is there any chance that the idle callback could happen before mozAfterPaint? In which case the set timeout would be initiated to move to the next page potentially before mozAfterPaint happens and the test measurement is recorded (?) Is it better to register the idle callback after mozAfterPaint instead of right after onload? Thanks
Flags: needinfo?(nicolas.b.pierron)
(In reply to Robert Wood [:rwood] from comment #3)
> Might be a silly question, but is there any chance that the idle callback
> could happen before mozAfterPaint?

I do not know exactly, but based on telemetry data I would not expect that to happen frequently. (TIME_TO_NON_BLANK_PAINT_MS vs TIME_TO_LOAD_EVENT_END_MS)

> […] Is it better to register the idle
> callback after mozAfterPaint instead of right after onload?

Currently the JSBC uses the load *end* event to schedule the saving to happen at the next idle callback.
Registering an idle callback sooner might schedule this callback sooner and not get a change to save the bytecode.

Otherwise, would it be a viable option to have a preference which emits one event when the the saving of the bytecode is either not necessary / completed?
Flags: needinfo?(nicolas.b.pierron)
I am confused by the last statement about a preference which emits an event- can you elaborate on this?  would this be an option to listen for an event which would be more finite than the idleCallback()?  That sounds better.
Whiteboard: [PI:July] → [PI:August]
(In reply to Nicolas B. Pierron [:nbp] from comment #4)
> (In reply to Robert Wood [:rwood] from comment #3)
> > Might be a silly question, but is there any chance that the idle callback
> > could happen before mozAfterPaint?
> 
> I do not know exactly, but based on telemetry data I would not expect that
> to happen frequently. (TIME_TO_NON_BLANK_PAINT_MS vs
> TIME_TO_LOAD_EVENT_END_MS)

Hi Nicolas, if it's not consistent when the idle callback occurs, i.e. if sometimes it is received before mozAfterPaint, then we can't reliably use it here, as that will cause intermittent talos failures. I fully support improving talos tests however I just want to be very sure we don't make the data inconsistent or intermittent. Therefore I'm going to close out this bug, unless you see another way of implementing this?
Flags: needinfo?(nicolas.b.pierron)
(In reply to Robert Wood [:rwood] from comment #6)
> Hi Nicolas, if it's not consistent when the idle callback occurs, i.e. if
> sometimes it is received before mozAfterPaint, then we can't reliably use it
> here, as that will cause intermittent talos failures. I fully support
> improving talos tests however I just want to be very sure we don't make the
> data inconsistent or intermittent. Therefore I'm going to close out this
> bug, unless you see another way of implementing this?

I am not suggesting to use the idle callback as a way to measure the time taken by the benchmark, I am suggesting to use the idle callback as a trigger that we should wait for, in order to let the page save any data it has to save, such that next iterations can benefit from this cached data.

Waiting for the test case to finish, is basically waiting for both events to have been received.  Each event could set a flag, and call the same next function which check both flags before continuing.

I am not sure to understand the limitation or concern with the mozAfterPaint timing.
Flags: needinfo?(nicolas.b.pierron)
(In reply to Nicolas B. Pierron [:nbp] from comment #7)
...
I was thinking that the idle callback may be received before mozafterpaint and force the next iteration before the timing measurement has been taken. However yes good point we can still grab the measurement after mozafterpaint but don't move on to the next iteration until both mozafterpaint and the idle callback have been received.
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Comment on attachment 8902447 [details]
Bug 1372942 - tp5: Wait for the idle-callback before moving to the next page;

https://reviewboard.mozilla.org/r/174036/#review179264

I am excited to see progress here, but r- for a couple things:
1) how does this work with time to first non blank paint?  this seems to only work with mozafterpaint?
2) are we measuring on the parent window or the content window?  does it matter?
Attachment #8902447 - Flags: review?(jmaher) → review-
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #11)
> 2) are we measuring on the parent window or the content window?  does it
> matter?

For information, the JSBC depends on the idle-callback of the content process, and to be precise of the document in which the scripts are.
so we should ensure idleCallback is on the content process, what we have looks to be the parent process, but I could be wrong.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #13)
> so we should ensure idleCallback is on the content process, what we have
> looks to be the parent process, but I could be wrong.

Ah, yes correct, ok thanks for the feedback guys appreciated!
Comment on attachment 8902447 [details]
Bug 1372942 - tp5: Wait for the idle-callback before moving to the next page;

https://reviewboard.mozilla.org/r/174036/#review180126

this looks right!  thanks for changing this :)
Attachment #8902447 - Flags: review?(jmaher) → review+
Update: This patch causes a timeout in tp50_scroll
Whiteboard: [PI:August] → [PI:September]
Comment on attachment 8902447 [details]
Bug 1372942 - tp5: Wait for the idle-callback before moving to the next page;

Small update, fix for tp5o_scroll - no need to use idle-callback with this test as it measures scrolling perf not pageload etc.
Attachment #8902447 - Flags: review+ → review?(jmaher)
Comment on attachment 8902447 [details]
Bug 1372942 - tp5: Wait for the idle-callback before moving to the next page;

https://reviewboard.mozilla.org/r/174036/#review182412

excellent
Attachment #8902447 - Flags: review?(jmaher) → review+
Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6af4edb9979e
tp5: Wait for the idle-callback before moving to the next page; r=jmaher
https://hg.mozilla.org/mozilla-central/rev/6af4edb9979e
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla57
and there are some talos improvements here: 
== Change summary for alert #9321 (as of September 08 2017 13:14 UTC) ==

Improvements:

 30%  tp6_youtube summary windows7-32 opt 1_thread e10s     423.21 -> 297.42
 29%  tp6_youtube summary windows7-32 pgo 1_thread e10s     335.90 -> 238.29
 28%  tp6_youtube summary windows10-64 pgo 1_thread e10s    350.46 -> 252.62
 27%  tp6_youtube summary windows7-32 opt e10s              389.79 -> 283.67
 27%  tp6_youtube summary windows10-64 pgo e10s             320.35 -> 233.42
 26%  tp6_youtube summary windows7-32 pgo e10s              309.33 -> 228.67
 22%  tp6_youtube summary windows10-64 opt e10s             353.00 -> 276.92

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=9321

not sure if this sounds right, but the retriggers seem to highlight this revision
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #29)
> and there are some talos improvements here: 
...
> 
> not sure if this sounds right, but the retriggers seem to highlight this
> revision

Right, :ryanvm mentioned that on IRC also. I'm a bit confused as to how, since this patch really just changes *when* we can proceed to the next test iteration. Perhaps I've overlooked something...
Comparison of try push with this patch removed / wait for idle-callback turned off vs central where wait for idle-callback is on [1]. Yep now shows a 34% regression in tp6_youtube when this patch / wait for idle-callback is removed:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=8226131a8d09548b69c1cc8ebdcb5303ecae8ac2&framework=1&filter=tp6_youtube&showOnlyImportant=0&selectedTimeRange=172800

So adding wait for idle-callback definitely causes an improvement in tp6_youtube which I really don't understand. The patch doesn't change when the test measurements are taken, it only checks that idle call-back was received before it can move to the next test/pageload iteration (after the measurements were already taken). Strange how the improvement is only shown in tp6_youtube and none of the other tp6 pages.

:jmaher as you noted maybe registering the idle call-back is causing some change in browser behaviour in tp6_youtube that is improving the performance.

:nbp, do you think that's a possibility? Any insight here? I'm just curious about how this improvement occurred.
Flags: needinfo?(nicolas.b.pierron)
(In reply to Robert Wood [:rwood] from comment #31)
> :nbp, do you think that's a possibility? Any insight here? I'm just curious
> about how this improvement occurred.

This cannot be the JSBC, because it is not enabled at the moment.  I guess there is probably another caching mechanism which is also hooked on the idle callback, as I did for the JSBC.

From the top of my head, I can think of WebAssembly, which is using some persistent storage, and might not be able to save unless we wait a bit longer.  In which case following work-load would load from the cache instead of compiling WebAssembly code.  So, follow-up runs might run faster because there is not compiler thread in parallel.

:luke, do you think this might explains the improvements from comment 29?

:bkelly, any idea if we have other optimizations which are waiting on the idle callback for caching information?
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(luke)
Flags: needinfo?(bkelly)
We try to perform periodic maintenance tasks like GC/CC during idle periods.  Creating idle time between tests reduces the chance these tasks run during the test.

I'm not sure if this is the right thing to do for these tests or not.
Flags: needinfo?(bkelly)
There's nothing I'm aware of.
Flags: needinfo?(luke)
You need to log in before you can comment on or make changes to this bug.