Checkerboard report API: need a way to get report before the event has ended

RESOLVED FIXED in Firefox 52

Status

()

defect
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: avih, Assigned: kats)

Tracking

({feature})

49 Branch
mozilla52
Points:
---
Dependency tree / graph
Bug Flags:
qe-verify -

Firefox Tracking Flags

(firefox52 fixed)

Details

(Whiteboard: [gfx-noted])

Attachments

(6 attachments, 1 obsolete attachment)

Currently the checkerboard report API reports events which have already finished, but does not report anything for events which are still ongoing at the time of request.

For the talos checkerboard test (bug 1189901), it's preferable that we don't have to wait till all the events have finished before we request a report - both because theoretically CB events can last forever, but also because the test wants to "record" the snapshot while the page is still scrolling (and possibly at the middle of a CB event), and before if slows down and the average severity may get biased.

For this test, it's OK if this is exposed as a new API, and also the test does not care about further logging of this specific event in case the request happens at the middle of one (though further events which start after the current event has finished should be logged and reported normally).
Assignee: nobody → bugmail.mozilla
@kats, what's your priority on this?

The APZ scroll test is basically blocked on this, or at least until we rule it out as the reason for the rudimentary checkerboard reports during the test runs.

Maybe you can write some temporary patch for this and we'll do some try pushes with it and the APZ test, and see if the numbers become more meaningful before a more complete patch lands?
Flags: needinfo?(bugmail.mozilla)
I'd like to get this done sooner rather than later, although realistically I won't get to it until the week after London. Leaving needinfo on me for now, I'll see if I can cook up something quickly.
Posted patch WIP (obsolete) — Splinter Review
Here you go. This adds a new function |flushActiveReports()| on the checkerboard service JS object, which you can call at the end of the test. It should synchronously add a report to the list of checkerboard reports that contains the severity of the ongoing checkerboard event. It doesn't stop that event from continuing to record. Let me know if this works for you.

Note that there is a potential race here because you'll have to call flushActiveReports() followed by getReports(). If you do this within the same spin of the event loop it's fine, but if the event loop runs in between then it's possible for more reports to get inserted, which would most likely be duplicate reports. If needed we can prevent this race by making flushActiveReports directly return the reports rather than requiring a second API call to getReports().
Flags: needinfo?(bugmail.mozilla)
Attachment #8761312 - Flags: feedback?(avihpit)
Thanks, sounds good enough and perfectly in line with the requirements. I'll do a try push hopefully tomorrow.
Keywords: feature
Whiteboard: [gfx-noted]
(In reply to Avi Halachmi (:avih) from comment #5)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1047ee23ca1

Interesting. git auto 3-way-merged changes during my rebase, and did so incorrectly (at testing/talos/talos/talos-powers/content/TalosPowersContent.js , |focus()| ended up in TalosPowersParent instead of TalosPowersContent where it originally was).

Luckily, the video test which is the only consumer of |focus()| did not land yet, so the try push shouldn't have talos failures from this.

I've updated the rebased patch locally.
(In reply to Avi Halachmi (:avih) from comment #5)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1047ee23ca1

kats, while it built on windows locally before I try-pushed, I think the try server refuses to build for Linux/OS X (windows still pending?).

Seems like static analysis issue with:
> ... gfx/layers/apz/src/APZCTreeManager.cpp:93:3: error: bad implicit conversion constructor for 'CheckerboardFlushObserver'
Flags: needinfo?(bugmail.mozilla)
Posted patch WIP v2Splinter Review
Now with |explicit| keyword
Attachment #8761312 - Attachment is obsolete: true
Attachment #8761312 - Flags: feedback?(avihpit)
Flags: needinfo?(bugmail.mozilla)
Attachment #8761571 - Flags: feedback?(avihpit)
Thanks, I'll check it a bit later. Turns out that it did build and ran the tests (on win/linux/osx), but the error appeared before the build was completed and so I thought it also aborted the build. It did not.

I'm still trying to understand how the numbers look like. Perfherder does not seem cooperative at this stage. Will update once I know more.
I ran this using your instructions, and found that because the report is created during the flush call, the timestamp on the report is later than endts. This causes the report to not get accumulated in the JS code.

Another thing you may wish to do is reload the page between running the sync scroll and CSSOM scroll, because the sync scrolling causes the displayport to expand and makes it easier for the CSSOM scroll to not checkerboard. You can see this in the minimap if you run with apz.minimap.enabled=true
Attachment #8761769 - Attachment description: Debug logging patch → Debug logging patch (forces checkerboarding)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> I ran this using your instructions, and found that because the report is
> created during the flush call, the timestamp on the report is later than
> endts. This causes the report to not get accumulated in the JS code.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #12)
> Created attachment 8761770 [details] [diff] [review]
> Move endts to include flushed reports

Interesting. I assumed the timestamp of the event is when it starts, rather than when it ends/flushed.

So your plan is to keep this behavior also when you'll finalize this new API? (not that it matters much - taking 'endts' after the measurement is OK with me).


> Another thing you may wish to do is reload the page between running the sync
> scroll and CSSOM scroll, because the sync scrolling causes the displayport
> to expand and makes it easier for the CSSOM scroll to not checkerboard. You
> can see this in the minimap if you run with apz.minimap.enabled=true

Very true point, and I've given it a lot of thought and testing when I wrote/enhanced the test.

TL;DR: I think we're still good even as it is, though, like you, I think reload in between would have been better.

Without getting into much details, the infrastructure which launches this test ("page loader") doesn't really allow the test page to reload itself, and it already includes some sensitive code to run the scroll test instead of just measuring the load time of the page (even before we added the APZ scrolls).

I will try thinking about it again and maybe find some work around, but being familiar with the subject and its sensitivities, I'm not very hopeful.

To the point. While it's true that scrolling synchronously first extends the displayport by some, and for this part the CSSOM scroll will almost surely not have any checkerboarding, we can still make the following statements IMO:

1. Because the CSSOM scroll is based on duration rather than scroll speed, for relatively short pages (all the tscrollx pages and some of the tp5o_scroll pages) it's not going to matter much anyway, as it's quite unlikely to have checkerboard even after reload  - slow scroll/short distance.

2. For relatively long pages, the synchronous part of the test doesn't get very far (in tp5o_scroll it's limited to 1000px which is roughly one page), so while it does extend the display port, long pages are typically quite a bit longer than 1000px, and so if their content is such that checkerboard happens, it's likely to happen IMO also after the display port has been extended a bit.

Also, again, fixed duration for the whole page length and long page (tp5o_Scroll) --> relatively fast scroll which is likely to trigger checkerboard regardless.

Let me know if you the above is valid. If you think it's not, then I'll have to consider some other approach for this test, which will make the CSSOM scroll be measured completely on its own.

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> Created attachment 8761769 [details] [diff] [review]
> Debug logging patch (forces checkerboarding)

Do you want me to use it in some way? or is this here just for reference?
Flags: needinfo?(bugmail.mozilla)
(In reply to Avi Halachmi (:avih) from comment #13)
> Interesting. I assumed the timestamp of the event is when it starts, rather
> than when it ends/flushed.
> 
> So your plan is to keep this behavior also when you'll finalize this new
> API? (not that it matters much - taking 'endts' after the measurement is OK
> with me).
> 

That was my intention, yeah. It's easier to pick up the timestamp when the report is generated, less stuff to track and pass around.

> 1. Because the CSSOM scroll is based on duration rather than scroll speed,
> for relatively short pages (all the tscrollx pages and some of the
> tp5o_scroll pages) it's not going to matter much anyway, as it's quite
> unlikely to have checkerboard even after reload  - slow scroll/short
> distance.

Agreed.

> 2. For relatively long pages, the synchronous part of the test doesn't get
> very far (in tp5o_scroll it's limited to 1000px which is roughly one page),
> so while it does extend the display port, long pages are typically quite a
> bit longer than 1000px, and so if their content is such that checkerboard
> happens, it's likely to happen IMO also after the display port has been
> extended a bit.

Also true.

> Also, again, fixed duration for the whole page length and long page
> (tp5o_Scroll) --> relatively fast scroll which is likely to trigger
> checkerboard regardless.
> 
> Let me know if you the above is valid. If you think it's not, then I'll have
> to consider some other approach for this test, which will make the CSSOM
> scroll be measured completely on its own.

I think it's valid.

> (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> > Created attachment 8761769 [details] [diff] [review]
> > Debug logging patch (forces checkerboarding)
> 
> Do you want me to use it in some way? or is this here just for reference?

Just for reference. There was one line in there that keeps the displayport locked to the top of the page and forces checkerboarding which I thought you might find useful if you wanted to exercise that scenario more reliably.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #14)
> I think it's valid.

Another approach is maybe to not scroll back up before we start the CSSOM scroll. For some pages surely it won't have anything left to scroll, but for those pages CB would have been unlikely anyway.

For longer pages, however, it would probably be equivalent to them being slightly shorter, but otherwise without most of the impact of already extended DP.

I'll also do another experiment with the synchronous part disabled to evaluate the difference between CSSOM on its own to CSSOM after the some sync scroll, and hopefully it would show that the difference is negligible or otherwise not worth the effort.

It the diff does turn out meaningful, however, then it could be worth the extra effort.


> > (In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #11)
> Just for reference. There was one line in there that keeps the displayport
> locked to the top of the page and forces checkerboarding which I thought you
> might find useful if you wanted to exercise that scenario more reliably.

Yeah, I noticed it, thanks. I'll keep that in mind if I need to force CB.
So I did a try push with the patches at comment 8 and comment 12 applied (but without the debug patch at comment 11): https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ca5be6d9e09&selectedJob=23611820  (the parent of a1bfd374f998 is where all the stuff is, but invisible here since it was part of an earlier try push too).

The try push also includes some minor subtest-naming changes to workaround some talos issues - move the "checkerboard"/"CSSOM" to the start of the name rather than put it at the end of the name.

- There are some Gtest failures which don't happen on the base m-c revision. According to jmaher which examined the logs, he thinks it's a leak, and it seems to me related to APZ code, so possibly from the patch at comment 8:

<jmaher> 17:53:12     INFO -  /builds/slave/try-lx-00000000000000000000000/build/src/gfx/layers/apz/test/gtest/TestPanning.cpp:17: ERROR: this mock object (used in test APZCPanningTester.PanWithTouchActionPanY) should be deleted but never is. Its address is @0xb1c1fca0.
<jmaher> 17:53:12     INFO -  ERROR: 28 leaked mock objects found at program exit.
<jmaher> so a leak


As for the actual results, this is an overall talos results comparison of this try push to itself (just to see the values, we're not comparing it to anything): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0ca5be6d9e09&newProject=try&newRevision=0ca5be6d9e09&framework=1&showOnlyImportant=0

We're interested at the "subtests" links which appears while hovering a platform at the "tp5o_scroll opt e10s" section.

- OS X: 5 pages with non zero checkerboard reports [1].
- Win7: 2 pages with non zero checkerboard reports [2].
- Linux64: all pages show zero checkerboard reports.

Is it expected? or are we expecting quite a bit more pages with non zero reports? what about linux?

Should we consider it a good test as is and land it? I'm not feeling great about it with so many zero CB reports TBH...


[1] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=0ca5be6d9e09&newProject=try&newRevision=0ca5be6d9e09&originalSignature=45adf5be5a5cab8f50e7150a541da1d064a7b037&newSignature=45adf5be5a5cab8f50e7150a541da1d064a7b037&framework=1

[2] https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=try&originalRevision=0ca5be6d9e09&newProject=try&newRevision=0ca5be6d9e09&originalSignature=d92f84d73ea6af8969ba0c70c466b432c02e6086&newSignature=d92f84d73ea6af8969ba0c70c466b432c02e6086&framework=1
Flags: needinfo?(bugmail)
Comment on attachment 8761571 [details] [diff] [review]
WIP v2

Review of attachment 8761571 [details] [diff] [review]:
-----------------------------------------------------------------

The patch seems to work, but see comment 16 for other concerns.
Attachment #8761571 - Flags: feedback?(avihpit) → feedback+
(In reply to Avi Halachmi (:avih) from comment #16)
> - OS X: 5 pages with non zero checkerboard reports [1].
> - Win7: 2 pages with non zero checkerboard reports [2].
> - Linux64: all pages show zero checkerboard reports.
> 
> Is it expected? or are we expecting quite a bit more pages with non zero
> reports? what about linux?

It's more or less expected from what I remember of running the test locally. Most of the pages had no checkerboarding during the scroll. Different platforms will have different characteristics. I ran it locally on OS X, so I'm not sure what it's like on Linux - it could be that we paint these particular pages fast enough that there's no checkerboarding at all there, and so these results are not surprising to me.

> Should we consider it a good test as is and land it? I'm not feeling great
> about it with so many zero CB reports TBH...

I think we should have *some* sort of test to detect regressions in checkerboarding. If you don't want to add this as is, that's fine, but I feel like maybe a new test with just the more stressful pages from the tp5 suite might be worth adding. Thoughts?
Flags: needinfo?(bugmail) → needinfo?(avihpit)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #18)
> I think we should have *some* sort of test to detect regressions in checkerboarding.

Completely agreed.


> I feel like maybe a new test with just the more stressful pages from the tp5
> suite might be worth adding. Thoughts?

I like the idea, but I don't see how, since we only have 5 such pages for OS X, 2 for Windows, and none for Linux... (if we trust the results at comment 16).

Maybe we should add synthetic pages to the tscroll set instead? but I wouldn't know what kinds of pages you'd like to have there (that's how the original tscroll set was born - to synthesize known hard cases at the time).
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #19)
> I like the idea, but I don't see how, since we only have 5 such pages for OS
> X, 2 for Windows, and none for Linux... (if we trust the results at comment
> 16).

Those 5 pages (and maybe some of other long ones from the tp5 set) would be a decent start. But you're right, maybe synthetic ones would be better. I'm pretty sure mstange and jrmuizel already have examples of pages that take a long time to paint and will trigger checkerboarding.
After a discussion on IRC, we agreed that I will try to find some pages that paint slowly and that make good candidates for this test.
Version: Trunk → 49 Branch
kats and I agreed on IRC to land bug 1189901 without the checkerboard tests for now.

As backup, I'm attaching the entire patch series with the CB tests which I have now (which includes the patches for 1189901 and the latest WIP patch which kats posted earlier here).
Comment on attachment 8800389 [details]
Bug 1275314 - Allow flushing in-progress checkerboard reports in the GPU process as well.

https://reviewboard.mozilla.org/r/85302/#review84140

::: gfx/ipc/GPUProcessManager.cpp:746
(Diff revision 1)
>  }
>  
> +bool
> +GPUProcessManager::NotifyGpuObservers(const char* aTopic)
> +{
> +  if (mGPUChild) {

nit: I prefer early return here, i.e. if (!mGPUChild) { return false...
Attachment #8800389 - Flags: review?(dvander) → review+
Comment on attachment 8800388 [details]
Bug 1275314 - Add an API to allow flushing out in-progress checkerboard reports.

https://reviewboard.mozilla.org/r/85300/#review84602

::: gfx/layers/apz/src/APZCTreeManager.cpp:132
(Diff revision 1)
> +  MOZ_ASSERT(NS_IsMainThread());
> +  MOZ_ASSERT(mTreeManager.get());
> +
> +  MutexAutoLock lock(mTreeManager->mTreeLock);
> +  if (mTreeManager->mRootNode) {
> +    ForEachNode<ReverseIterator>(mTreeManager->mRootNode.get(),

I'm glad to see the tree traversal algorithms being used in new code :)

::: gfx/layers/apz/src/APZCTreeManager.cpp:168
(Diff revision 1)
>        mRetainedTouchIdentifier(-1),
>        mApzcTreeLog("apzctree")
>  {
>    AsyncPanZoomController::InitializeGlobalState();
>    mApzcTreeLog.ConditionOnPrefFunction(gfxPrefs::APZPrintTree);
> +  mFlushObserver = new CheckerboardFlushObserver(this);

Do this in the initializer.
Attachment #8800388 - Flags: review?(botond) → review+
Comment on attachment 8800388 [details]
Bug 1275314 - Add an API to allow flushing out in-progress checkerboard reports.

r? to ehsan since I need a DOM peer to sign off on the .webidl change. Note that this .webidl file is not exposed to web content.
Attachment #8800388 - Flags: review?(ehsan)
Comment on attachment 8800388 [details]
Bug 1275314 - Add an API to allow flushing out in-progress checkerboard reports.

https://reviewboard.mozilla.org/r/85300/#review84628
Attachment #8800388 - Flags: review?(ehsan) → review+
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/77054c32f5df
Add an API to allow flushing out in-progress checkerboard reports. r=botond,ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/73a60f9b3c2f
Allow flushing in-progress checkerboard reports in the GPU process as well. r=dvander
Look like ClearTree() isn't being called on the tree manager at [1] before the test ends, so it ends up leaking the observer. Clearing the tree in TearDown() fixes it for me locally, try push at [2].

[1] http://searchfox.org/mozilla-central/rev/d96317a351af8aa78ab9847e7feed964bbaac7d7/gfx/layers/apz/test/gtest/APZCBasicTester.h#30
[2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=36da31478ccfe414bf795d5d0513815ade1b45c7
Flags: needinfo?(bugmail)
Pushed by kgupta@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0151fc9f97a2
Add an API to allow flushing out in-progress checkerboard reports. r=botond,ehsan
https://hg.mozilla.org/integration/mozilla-inbound/rev/6fba47ac1541
Allow flushing in-progress checkerboard reports in the GPU process as well. r=dvander
kats, is this something manual QA should be tracking?
Flags: qe-verify?
Flags: needinfo?(bugmail)
No, this is an internal API that we will be using for talos tests for checkerboarding. Thanks for checking!
Flags: needinfo?(bugmail)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #35)
> No, this is an internal API that we will be using for talos tests for
> checkerboarding. Thanks for checking!

Thank you for following up on this!
Flags: qe-verify? → qe-verify-
Depends on: 1362889
You need to log in before you can comment on or make changes to this bug.