Closed Bug 1226826 Opened 4 years ago Closed 4 years ago

Record detailed checkerboarding data for analysis

Categories

(Core :: Panning and Zooming, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla46
Tracking Status
firefox45 --- affected
firefox46 --- fixed

People

(Reporter: kats, Assigned: kats)

References

(Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

This is a follow-up from bug 1221694, wherein I landed a very basic telemetry probe for checkerboarding. However that doesn't give us a lot of actionable information - in fact that information that it gives us may not be useful at all.

I was thinking about what it is exactly we want to measure and how to make that actionable, and this bug is to track that works. The specific thing I'm interested in here is a "checkerboarding event" - that is, the sets of contiguous composite frames where a layer was checkerboarded. We can detect this relatively easily in the APZ code. For each checkerboarding event we should record the total duration of the event, the peak checkerboard (the most number of pixels that were checkerboarded in any given frame) and possible a few other similar metrics, such as total pixel-milliseconds checkerboarded during the event or composition bounds.

From that information, we can determine the frequency and severity of checkerboarding events, and track regressions in user-visible checkerboarding. In addition to that, though, we should record some vital stats on each checkerboarding event. This would include things like panning velocity, recorded paint times, and so on. This set of information would tell us *why* the checkerboarding regressed, or at least provide us with clues as to where we should focus our efforts. If were able to grab a rendertrace log of the event that would be pretty good to start, as that includes all of this information.

However, I'm not sure we can pack all of this into the standard telemetry histograms, we'd probably have to write some custom telemetry json to encapsulate it. To start with it would be good to get the machinery written and have it dump to stdout or create an about:checkerboarding where you can look at the data for your last few checkerboarding events, and save the data. We can enable this feature on nightly by default so that people who run into bad checkerboarding can send us data on what was happening.
Blocks: 1170536
Assignee: nobody → bugmail.mozilla
Attached patch WIP (obsolete) — Splinter Review
What I have so far. The recording code buffers up to 5 "historical" values for each type of rect into a circular buffer, dumping them all along with the checkerboard frames when checkerboarding is encountered. This by itself is an improvement over the existing rendertrace code in that it can be enabled by a pref at runtime, and doesn't require tedious filtering of the log and whatnot. It doesn't really provide any new functionality yet, I'll work on that next.
Comment on attachment 8705728 [details]
MozReview Request: Bug 1226826 - Add a CheckerboardEvent class to record stuff when checkerboarding happens. r?botond

https://reviewboard.mozilla.org/r/30177/#review26885

This is a first pass, no r+ yet. I'll have another look at this after looking at the other patches.

::: gfx/layers/apz/src/AsyncPanZoomController.cpp:3098
(Diff revision 1)
> +      mCheckerboardEvent = nullptr;

If one of the next BUFFER_SIZE frames checkerboard, we'll have fewer than BUFFER_SIZE "just before" property values for it. I guess that's OK?

::: gfx/layers/apz/src/CheckerboardEvent.h:21
(Diff revision 1)
> + * steps to reduce the occcurrence of checkerboarding. Furthermore, it records

too many c's in "occcurrence"

::: gfx/layers/apz/src/CheckerboardEvent.h:26
(Diff revision 1)
> +  public:

style nit: access specifiers aren't indented, class members are only indented by one level

::: gfx/layers/apz/src/CheckerboardEvent.h:129
(Diff revision 1)
> +        uint32_t mIndex;

// The index of the oldest value in the buffer.
// This is the next index that will be written to.

::: gfx/layers/apz/src/CheckerboardEvent.h:185
(Diff revision 1)
> +    std::stringstream mRendertraceInfo;

This can be an std::ostringstream, as we only write to it (and then get a copy of the entire buffer).

::: gfx/layers/apz/src/CheckerboardEvent.cpp:77
(Diff revision 1)
> +  if (mRendertraceInfo.tellg() >= LOG_LENGTH_LIMIT) {

As we're writing to the buffer, we want tellp(), the write ('put') position (tellg() is the read ('get') position). 

(If you change mRendertraceInfo to be an std::ostringstream, tellp() will be your only option.)

::: gfx/layers/apz/src/CheckerboardEvent.cpp:82
(Diff revision 1)
> +  mRendertraceInfo << "RENDERTRACE "

Please add a link somewhere to the tool that expects this format.

::: gfx/layers/apz/src/CheckerboardEvent.cpp:96
(Diff revision 1)
> +  bool ret = false;

s/ret/eventEnding

::: gfx/layers/apz/src/CheckerboardEvent.cpp:132
(Diff revision 1)
> +  for (auto it = history.begin(); it != history.end(); it++) {

for (const PropertyValue& p : history) {
  LogInfo(p.mProperty, p.mTimeStamp, p.mRect, lock);
}

::: gfx/layers/apz/src/CheckerboardEvent.cpp:145
(Diff revision 1)
> +  if (mRendertraceInfo.tellg() >= LOG_LENGTH_LIMIT) {

tellp()

::: gfx/layers/apz/src/CheckerboardEvent.cpp:156
(Diff revision 1)
> +  if (mTimeStamp < aOther.mTimeStamp) {

This operator will say that 

  {mProperty = Page, mTimeStamp = 20}
< {mProperty = UserVisible, mTimeStamp = 10}

You want an

  else if (mTimeStamp > aOther.mTimeStamp) {
    return false;
  }
  
clause.
Attachment #8705728 - Flags: review?(botond)
Comment on attachment 8705728 [details]
MozReview Request: Bug 1226826 - Add a CheckerboardEvent class to record stuff when checkerboarding happens. r?botond

https://reviewboard.mozilla.org/r/30177/#review26929

::: gfx/layers/apz/src/AsyncPanZoomController.h:1110
(Diff revision 1)
> +  UniquePtr<CheckerboardEvent> mCheckerboardEvent;

Please say something about when this is non-null. Something like: 

  - initially created on the first composite of this APZC
  - nulled out at the end of a checkerboarding event, then recreated on the next composite
Attachment #8705728 - Flags: review+
Comment on attachment 8705729 [details]
MozReview Request: Bug 1226826 - Record velocity at point of displayport request in checkerboard event. r?botond

https://reviewboard.mozilla.org/r/30179/#review26923

An alternative design to consider is, instead of making property values "rects with optional extra string data tacked on", make them a variant type, with possible component types being a rect, a point, a TimeStamp, and others added as needed. I feel this would be cleaner, as we can then have a separate enumerator for each item, the description would remain separate from the value (in the sDescriptions array, as opposed to being part of the tacked-on string), etc.

Note that there is a mozilla::Variant class in MFBT that makes working with variants easy.

::: gfx/layers/apz/src/CheckerboardEvent.h:64
(Diff revision 1)
> +                                   std::string* aExtraInfo = nullptr);

Please make this

  const std::string& aExtraInfo = std::string()
  
and notice how this simplifies code on either side.

Decent standard library implementations don't incur a heap allocation for an empty string.

(Note also that std::string() and std::string("") are equivalent string objects, except the second form requires the compiler to create a character array containing a single null character, and the std::string implementation to iterate it and realize it's empty.)
Attachment #8705729 - Flags: review?(botond) → review+
Attachment #8705731 - Flags: review?(botond)
Comment on attachment 8705731 [details]
MozReview Request: Bug 1226826 - Record if painted displayport updates were due to repaints from the relevant layer tree or not. r?botond

https://reviewboard.mozilla.org/r/30183/#review26927

::: gfx/layers/apz/src/AsyncPanZoomController.cpp:3155
(Diff revision 1)
> +        // transaction.

Not sure I follow this comment. Wouldn't a layer transaction from a different process have its own distinct paint request time?
Comment on attachment 8705730 [details]
MozReview Request: Bug 1226826 - Record paint times in the checkerboard event. r?botond

https://reviewboard.mozilla.org/r/30181/#review26933

::: gfx/layers/FrameMetrics.h:735
(Diff revision 1)
> +  mozilla::TimeStamp mPaintRequestTime;

This isn't really a per-{scroll frame} piece of information. Instead of storing it in FrameMetrics, can we send it as an argument to GeckoContentController::RequestContentRepaint on the way to content, and as an argument to PLayerTransaction::Update on the way back? I realize this involves more boilerplate, but perhaps we can mitigate that by introducing some structures that group such per-paint data (e.g. for PLayerTransaction::Update, the sequence number could go in there).

::: gfx/layers/apz/util/APZCCallbackHelper.h:175
(Diff revision 1)
> +  static TimeStamp sLastPaintRequestTime;

Not sure how I feel about this having static storage duration. Are we sure there's always just one layer tree per process? Can we instead store this in some object, like in ClientLayerManager, or on the RCD's PresShell, or on the RCD-RSF?

::: layout/base/nsLayoutUtils.cpp:3404
(Diff revision 1)
> +  APZCCallbackHelper::ClearLastPaintRequestTime();

This is a bit problematic because some page structures cause nested PaintFrame calls to occur; in such a case, on the way out of a nested call we can clear the last paint request time and not have it in place when processing scroll frames we care about in the enclosing call.

I recently added a gPaintCountStack variable (currently used only when MOZ_DUMP_PAINTING is defined) that can be used to determine whether we're in an inner PaintFrame call; feel free to reuse that, or something similar.
Attachment #8705730 - Flags: review?(botond)
(In reply to Botond Ballo [:botond] from comment #7)

All of these comments addressed locally. Thanks for the C++ tips, I still feel like a C++ n00b at times :p

(In reply to Botond Ballo [:botond] from comment #8)
> Please say something about when this is non-null. Something like: 
> 
>   - initially created on the first composite of this APZC
>   - nulled out at the end of a checkerboarding event, then recreated on the
> next composite

Done

(In reply to Botond Ballo [:botond] from comment #9)
> An alternative design to consider is, instead of making property values
> "rects with optional extra string data tacked on", make them a variant type,
> with possible component types being a rect, a point, a TimeStamp, and others
> added as needed. I feel this would be cleaner, as we can then have a
> separate enumerator for each item, the description would remain separate
> from the value (in the sDescriptions array, as opposed to being part of the
> tacked-on string), etc.

In general I agree this would be more structured and feels cleaner. However it would also change the format of the log and therefore require parser-side updates. I'm not opposed to doing this, but I'd rather defer it until after bug 1238042 is done, because then the parser code will be in-tree and we can atomically update both sides. It's not a huge change to switch to variants, but it's nontrivial so I'd prefer to do it as a cleanup the next time we need to add another piece of information to the log.

> Please make this
> 
>   const std::string& aExtraInfo = std::string()
>   
> and notice how this simplifies code on either side.

Done - much cleaner, thanks!

(In reply to Botond Ballo [:botond] from comment #10)
> ::: gfx/layers/apz/src/AsyncPanZoomController.cpp:3155
> (Diff revision 1)
> > +        // transaction.
> 
> Not sure I follow this comment. Wouldn't a layer transaction from a
> different process have its own distinct paint request time?

So the comment refers to this sort of scenario:

- Process A sends a layer tree update with paint request time C for the metrics in layers for A
- Process B sends a layer tree update with paint request time D for the metrics in layers for B

When the first layer tree update arrives, all APZCs get NotifyLayersUpdated called. The APZCs corresponding to layers in A will see the paint request time C. When the second layer tree update arrives, all APZCs *again* get NotifyLayersUpdated called. The compositor will have updated the layer info from B, but it reuses the stuff for A from first layer tree update. So APZCs corresponding to layers in A will get a second call to NotifyLayersUpdated, and it will again see paint request time C. This time, we want to ignore C because we already processed it the first time around. Does that make more sense?

(In reply to Botond Ballo [:botond] from comment #11)
> > +  mozilla::TimeStamp mPaintRequestTime;
> 
> This isn't really a per-{scroll frame} piece of information. Instead of
> storing it in FrameMetrics, can we send it as an argument to
> GeckoContentController::RequestContentRepaint on the way to content, and as

RequestContentRepaint is itself a per-scrollframe request, because each APZC will issue these calls independently. So if the paint request time doesn't belong in the FrameMetrics for that reason then it doesn't really belong in the RequestContentRepaint call either :) However, I would argue that we do want to track it on a per-scrollframe basis, because the reason we care about it is for displayport calculation, and that is per-scrollframe. Consider a layer tree with two APZCs A and B. Let's say A makes a call to RequestContentRepaint and then 5ms later B makes a call to RequestContentRepaint. The paint happens, the layers update happens, and then A and B get NotifyLayersUpdated calls. From A's point of view the total round-trip time *was* 5ms longer than B's paint time, and I think we should be taking that into account.

> Not sure how I feel about this having static storage duration. Are we sure
> there's always just one layer tree per process? Can we instead store this in
> some object, like in ClientLayerManager, or on the RCD's PresShell, or on
> the RCD-RSF?

Agreed. In fact based on what I just said above, I should be storing this in a per-scrollframe place. I might store it as a property on the element instead, the same way we store the scroll id.

> ::: layout/base/nsLayoutUtils.cpp:3404
> (Diff revision 1)
> > +  APZCCallbackHelper::ClearLastPaintRequestTime();
> 
> This is a bit problematic because some page structures cause nested
> PaintFrame calls to occur; in such a case, on the way out of a nested call
> we can clear the last paint request time and not have it in place when
> processing scroll frames we care about in the enclosing call.
> 
> I recently added a gPaintCountStack variable (currently used only when
> MOZ_DUMP_PAINTING is defined) that can be used to determine whether we're in
> an inner PaintFrame call; feel free to reuse that, or something similar.

If I store the time on the element then I can probably just clear it when I populate the FrameMetrics and that would sidestep this problem.
> (In reply to Botond Ballo [:botond] from comment #9)
> > An alternative design to consider is, instead of making property values
> > "rects with optional extra string data tacked on", make them a variant type,
> > with possible component types being a rect, a point, a TimeStamp, and others
> > added as needed. I feel this would be cleaner, as we can then have a
> > separate enumerator for each item, the description would remain separate
> > from the value (in the sDescriptions array, as opposed to being part of the
> > tacked-on string), etc.
> 
> In general I agree this would be more structured and feels cleaner. However
> it would also change the format of the log and therefore require parser-side
> updates. I'm not opposed to doing this, but I'd rather defer it until after
> bug 1238042 is done, because then the parser code will be in-tree and we can
> atomically update both sides. It's not a huge change to switch to variants,
> but it's nontrivial so I'd prefer to do it as a cleanup the next time we
> need to add another piece of information to the log.

Sounds good.

> So the comment refers to this sort of scenario:
> 
> - Process A sends a layer tree update with paint request time C for the
> metrics in layers for A
> - Process B sends a layer tree update with paint request time D for the
> metrics in layers for B
> 
> When the first layer tree update arrives, all APZCs get NotifyLayersUpdated
> called. The APZCs corresponding to layers in A will see the paint request
> time C. When the second layer tree update arrives, all APZCs *again* get
> NotifyLayersUpdated called. The compositor will have updated the layer info
> from B, but it reuses the stuff for A from first layer tree update. So APZCs
> corresponding to layers in A will get a second call to NotifyLayersUpdated,
> and it will again see paint request time C. This time, we want to ignore C
> because we already processed it the first time around. Does that make more
> sense?

Yup - thanks for clarifying!

> (In reply to Botond Ballo [:botond] from comment #11)
> > > +  mozilla::TimeStamp mPaintRequestTime;
> > 
> > This isn't really a per-{scroll frame} piece of information. Instead of
> > storing it in FrameMetrics, can we send it as an argument to
> > GeckoContentController::RequestContentRepaint on the way to content, and as
> 
> RequestContentRepaint is itself a per-scrollframe request, because each APZC
> will issue these calls independently. So if the paint request time doesn't
> belong in the FrameMetrics for that reason then it doesn't really belong in
> the RequestContentRepaint call either :) However, I would argue that we do
> want to track it on a per-scrollframe basis, because the reason we care
> about it is for displayport calculation, and that is per-scrollframe.
> Consider a layer tree with two APZCs A and B. Let's say A makes a call to
> RequestContentRepaint and then 5ms later B makes a call to
> RequestContentRepaint. The paint happens, the layers update happens, and
> then A and B get NotifyLayersUpdated calls. From A's point of view the total
> round-trip time *was* 5ms longer than B's paint time, and I think we should
> be taking that into account.

That sounds reasonable if the patch is updated, as you propose, to track the last paint request time on a per-scrollframe basis on the content side as well.

> > ::: layout/base/nsLayoutUtils.cpp:3404
> > (Diff revision 1)
> > > +  APZCCallbackHelper::ClearLastPaintRequestTime();
> > 
> > This is a bit problematic because some page structures cause nested
> > PaintFrame calls to occur; in such a case, on the way out of a nested call
> > we can clear the last paint request time and not have it in place when
> > processing scroll frames we care about in the enclosing call.
> > 
> > I recently added a gPaintCountStack variable (currently used only when
> > MOZ_DUMP_PAINTING is defined) that can be used to determine whether we're in
> > an inner PaintFrame call; feel free to reuse that, or something similar.
> 
> If I store the time on the element then I can probably just clear it when I
> populate the FrameMetrics and that would sidestep this problem.

Yup.
Comment on attachment 8705728 [details]
MozReview Request: Bug 1226826 - Add a CheckerboardEvent class to record stuff when checkerboarding happens. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30177/diff/1-2/
Comment on attachment 8705729 [details]
MozReview Request: Bug 1226826 - Record velocity at point of displayport request in checkerboard event. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30179/diff/1-2/
Comment on attachment 8705730 [details]
MozReview Request: Bug 1226826 - Record paint times in the checkerboard event. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30181/diff/1-2/
Attachment #8705730 - Flags: review?(botond)
Comment on attachment 8705731 [details]
MozReview Request: Bug 1226826 - Record if painted displayport updates were due to repaints from the relevant layer tree or not. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30183/diff/1-2/
Attachment #8705731 - Flags: review?(botond)
Comment on attachment 8705728 [details]
MozReview Request: Bug 1226826 - Add a CheckerboardEvent class to record stuff when checkerboarding happens. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30177/diff/2-3/
Comment on attachment 8705729 [details]
MozReview Request: Bug 1226826 - Record velocity at point of displayport request in checkerboard event. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30179/diff/2-3/
Comment on attachment 8705730 [details]
MozReview Request: Bug 1226826 - Record paint times in the checkerboard event. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30181/diff/2-3/
Comment on attachment 8705731 [details]
MozReview Request: Bug 1226826 - Record if painted displayport updates were due to repaints from the relevant layer tree or not. r?botond

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/30183/diff/2-3/
Attachment #8705730 - Flags: review?(botond) → review+
Comment on attachment 8705730 [details]
MozReview Request: Bug 1226826 - Record paint times in the checkerboard event. r?botond

https://reviewboard.mozilla.org/r/30181/#review27577

::: dom/base/nsGkAtomList.h:2172
(Diff revision 3)
> +GK_ATOM(PaintRequestTime, "_paintrequesttime")

nit: Not sure to what extent any of this is "IPC stuff", but can we move "apzCallbackTransform" into this block as well, so that the properties used by APZ are together?

::: gfx/layers/FrameMetrics.h:736
(Diff revision 3)
> +  mozilla::TimeStamp mPaintRequestTime;

We're inside namespace mozilla, so the qualification should be unnecessary.

::: layout/base/nsLayoutUtils.cpp:8520
(Diff revision 3)
> +  if (void* paintRequestTime = aContent->GetProperty(nsGkAtoms::PaintRequestTime)) {

Move this into the 'if (aContent)' block below.
Attachment #8705731 - Flags: review?(botond) → review+
Comment on attachment 8705731 [details]
MozReview Request: Bug 1226826 - Record if painted displayport updates were due to repaints from the relevant layer tree or not. r?botond

https://reviewboard.mozilla.org/r/30183/#review27579
(In reply to Botond Ballo [:botond] from comment #22)
> > +GK_ATOM(PaintRequestTime, "_paintrequesttime")
> 
> nit: Not sure to what extent any of this is "IPC stuff", but can we move
> "apzCallbackTransform" into this block as well, so that the properties used
> by APZ are together?

Oh good point. I ended up moving the new PaintRequestTime property to be with apzCallbackTransform, since that one is correctly in the "Content property names" section, which is where PaintRequestTime belongs as well. I left the IPC stuff section as-is.
See Also: → 1239549
Thanks for the backout, and sorry for the bustage. :(

I think the problem was that aSampleTime comes from vsync, whereas mLastSampleTime is initialized from TimeStamp::Now(), and so if on the first frame the vsync timestamp is earlier than "now" it causes the assertion to fail. I'll fix that up and do a try push to confirm.
Flags: needinfo?(bugmail.mozilla)
I thought I posted this yesterday, whoops!

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #28)
> I think the problem was that aSampleTime comes from vsync, whereas
> mLastSampleTime is initialized from TimeStamp::Now(), and so if on the first
> frame the vsync timestamp is earlier than "now" it causes the assertion to
> fail. I'll fix that up and do a try push to confirm.

This was part of the problem. The other part was that the sample time was actually moving backwards in some cases, I filed bug 1239861 about that. However after chasing down these issues I think it makes more sense to just use TimeStamp::Now() values rather than the sample time values. The reason is that the sample time values go through some amount of vsync-related "processing" and may not line up directly with TimeStamp::Now() (hence the problems above). The sample time values are also subject to various thread-switch latencies as they propagate over from the vsync source. For the purposes of checkerboard recording, though, we are more interested in the user's perception of checkerboarding, so we should be using values from the TimeStamp::Now() timeline which is more representative of what they are actually seeing.

It's a small change to the patch - instead of passing in aSampleTime to the RecordFrameInfo call, I just use a local variable initialized to TimeStamp::Now() inside the function.
Note that spamming stuff to the terminal in opt builds is generally a really bad idea.  Filed bug 1241148 on that.
You need to log in before you can comment on or make changes to this bug.