Closed Bug 1713360 Opened 4 years ago Closed 4 years ago

Fission Checkerboarding issues for low-cpu and low-mem users

Categories

(Core :: Panning and Zooming, defect, P3)

defect

Tracking

()

RESOLVED FIXED
Fission Milestone M8
Tracking Status
firefox92 --- affected

People

(Reporter: neha, Assigned: tnikkel)

References

Details

(Whiteboard: fission-soft-blocker)

Attachments

(1 file)

We are now seeing worse checkerboarding severity with Fission in our beta experiment, which was actually much better with fission in nightly experiment. See https://protosaur.dev/partybal/bug_1706428_pref_fission_m7_beta_experiment_with_memory_filter_beta_89_91.html#checkerboard_severity461 and look for the (relative uplift) graphs for comparison, and you can see it's higher for Fission for all different memory groups (2-4GB, 4-6GB, and high mem groups).
https://protosaur.dev/fission-experiment-monitoring-dashboard/dashboard/dashboard.html shows the Nightly experiment Fission dashboard which shows Fission as improving this metric.
The beta results are unexpected and needs investigation.

Flags: needinfo?(botond)
Fission Milestone: --- → M7a

Is our Beta population (or the segment of it involved in this experiment) skewed heavily towards certain locales? I seem to recall that a couple of Asian countries (India? maybe others as well?) had a disproportionately-large number of beta users, due to historical distribution quirks; that might imply a relatively large number of older/lower-end machines with less CPU power, and/or running older OS versions. If that's still the case, it could perhaps result in quite different performance characteristics showing up in the different channels.

Following on from jfkthame, if our beta population has significantly less powerful gpus on average vs nightly, AND with fission for some reason we are asking to rasterize more content that could explain what we see. If we rasterize more content AND the gpu is powerful so that does not delay any frames or anything then we would expect to see less checkerboarding just from the fact we have more content rasterized and ready to scroll to. However if the gpu is underpowered, asking for more content to be rasterized causes us to miss even more frames etc and that could increase checkerboarding.

See Also: → 1713547

(In reply to Jonathan Kew (:jfkthame) from comment #1)

Is our Beta population (or the segment of it involved in this experiment) skewed heavily towards certain locales? I seem to recall that a couple of Asian countries (India? maybe others as well?) had a disproportionately-large number of beta users, due to historical distribution quirks; that might imply a relatively large number of older/lower-end machines with less CPU power, and/or running older OS versions. If that's still the case, it could perhaps result in quite different performance characteristics showing up in the different channels.

IIUC, over 50% of Beta users are in Southeast Asia and over 50% of Beta users (the same users?) are running Windows 7, compared to only 20% of Release channel users running Windows 7. So we will see slower computers and probably more Software WebRender usage on Beta. About 92% of Windows Nightly users have Accelerated WebRender, but only about 70% of Windows Beta users do.

I'm not sure how it could be involved, but there are two prefs that are related to this area that have different behaviour for fission apz.prefer_jank_minimal_displayports and apz.wr.activate_all_scroll_frames_when_fission. I don't think they would cause this, but if they were misbehaving or there was some other interaction that we didn't understand they could be involved.

  • What is considered a big regression for CHECKERBOARDING_SEVERITY? From our Fission experiment in Beta 89, checkerboarding is more likely for all Fission users, but users who currently see very little checkerboarding with e10s might see a 1000% regression with Fission (increasing from ~20 to ~2000 "severity points"), while users who currently see a lot of checkerboarding with e10s might see "only" a 100% regression with Fission (increasing from ~10,000 to 25,000 "severity points"). Are those actually big differences?

  • Do we have local benchmarks for CHECKERBOARDING_SEVERITY that we can profile with Fission?

While investigating this, we discovered two potentially relevant bugs:

  • The checkerboarding telemetry code sometimes wrongly considers a scrollable element to be checkerboarding when it's actually not (tracked in bug 1713547). This happens when the element does not have a displayport set, for whatever reason. In such cases, the actual behaviour is that the viewport is painted instead, but the telemetry code treats this situation as if nothing was painted.
  • OOP iframes sometimes do not get displayports, thereby making them particularly affected by the previous bug. It looks like this happens when the iframe's viewport is overflow:hidden, and only so long as the iframe is not interacted with. This is being fixed in bug 1709460.

These bugs suggest that the reported telemetry with Fission enabled may paint a worse picture than what is actually the case. I think a good first step is to fix these and see how they affect the telemetry numbers.

Flags: needinfo?(botond)
See Also: → 1709460

(In reply to Timothy Nikkel (:tnikkel) from comment #4)

I'm not sure how it could be involved, but there are two prefs that are related to this area that have different behaviour for fission apz.prefer_jank_minimal_displayports and apz.wr.activate_all_scroll_frames_when_fission. I don't think they would cause this, but if they were misbehaving or there was some other interaction that we didn't understand they could be involved.

Summarizing our discussion about these prefs:

  • apz.wr.activate_all_scroll_frames_when_fission is required for correctness (bug 1675547). I don't think we can reasonably turn this off for experimentation purposes.
  • apz.prefer_jank_minimal_displayports could be turned off for experimentation purposes (though, based on what this pref does, I would expect turning it off will only increase checkerboarding). It could be something to consider trying to rule out bugs related to it.

(In reply to Chris Peterson [:cpeterson] from comment #5)

  • Do we have local benchmarks for CHECKERBOARDING_SEVERITY that we can profile with Fission?

To my knowledge, we currently do not have such benchmarks. We could possibly look into developing some, perhaps using existing performance tests as a starting point.

  • What is considered a big regression for CHECKERBOARDING_SEVERITY? From our Fission experiment in Beta 89, checkerboarding is more likely for all Fission users, but users who currently see very little checkerboarding with e10s might see a 1000% regression with Fission (increasing from ~20 to ~2000 "severity points"), while users who currently see a lot of checkerboarding with e10s might see "only" a 100% regression with Fission (increasing from ~10,000 to 25,000 "severity points"). Are those actually big differences?

I don't really have a good answer for this yet, but I'll summarize my understanding so far:

  • The checkerboarding telemetry code considers a "checkerboarding event" to be a contiguous sequence of composited frames where each frame has a non-zero number of pixels of checkerboarding.
  • At the end of every checkerboarding event, three metrics about the event (duration, peak, and severity) are accumulated into telemetry histograms. The severity is computed as the sum, over each composited frame that's part of the event, of the number of pixels of checkerboarding for that frame multipled by the duration of the frame in milliseconds; we then take the square root of this sum for reporting purposes to make sure it fits into a 32-bit value.
  • So, e.g. an event with 2000 severity points represents 4 million pixel-milliseconds of checkerboarding. That could be e.g. a 1000x100 pixel area checkerboarding for 40 ms (~3 frames).
  • As far as I can tell, each checkerboarding event is reported independently (that is, data about events within a session are not aggregated in any way).

This last point suggests that we need more than just an aggregate statistic like the median severity to interpret changes in checkerboarding metrics. We also need to know how the total number of events changed, as e.g. we could have a significant reduction in the overall number of checkerboard events coupled with an increase in the median severity of the remaining events, and that could overall be a win (because we've gotten rid of lower-intensity checkerboarding events) even though the median severity has increased.

This leads me to ask a couple of questions:

  • Is there some sort of guide or documentation to interpreting the graphs at the protosaur.dev link in comment 0? Some of those graphs (like the first and second ones) make sense to me, but some of these others (e.g. the "density estimate" one) I'm confused about.
  • (Assuming the answer isn't already there and I just don't know where to look): can we get information about the total number of checkerboarding severity events with and without fission?
Flags: needinfo?(cpeterson)

FWIW, I tried to see how our CHECKERBOARD_SEVERITY looks like with/without Fission on my local debug build (i.e. a nightly build), the result might be interesting.

What I did is;

  1. launch the build with/without fission (./mach run --setpref "fission.autostart=true" or --setpref "fission.autostart=false")
  2. open about:telemetry and type "CHECKERBOARD_SEVERITY" in the search field and check there is no record.
  3. open a new tab and open https://hsivonen.fi/fission-host.html
  4. back to the about:telemetry tab and reload it

I did this three times for fission/non fission cases. The results;

Fission

CHECKERBOARD_SEVERITY
4 samples, average = 29,922.3, sum = 119,689

 7151 |  0  0%
10946 |#########################  3  75%
60084 |########  1  25%
91968 |  0  0%

CHECKERBOARD_SEVERITY
2 samples, average = 45,045.5, sum = 90,091

 7151 |  0  0%
10946 |#########################  1  50%
60084 |#########################  1  50%
91968 |  0  0%

CHECKERBOARD_SEVERITY
2 samples, average = 40,805, sum = 81,610

 4672 |  0  0%
 7151 |#########################  1  50%
60084 |#########################  1  50%
91968 |  0  0%

Non Fission

1 sample, average = 110,661, sum = 110,661

 60084 |  0  0%
 91968 |#########################  1  100%
140771 |  0  0%

CHECKERBOARD_SEVERITY
1 sample, average = 83,387, sum = 83,387

39254 |  0  0%
60084 |#########################  1  100%
91968 |  0  0%

CHECKERBOARD_SEVERITY
1 sample, average = 79,883, sum = 79,883

39254 |  0  0%
60084 |#########################  1  100%
91968 |  0  0%

With this simple comparison, in Fission sampling number gets increased, but the increased severity are lower. I am pretty sure this is due to bug 1713547. So I presume bug 1713547 skews the nightly experiment result. I concluded an opposed hypothesis this morning (I was thinking bug 1713547 increases severity numbers generally) but I was wrong.

There's another interesting thing in the result, which is though I didn't touch anything in the opened tab it recorded a single severity in non fission case. Looks like another case where we have no displayport happens on both fission/non fission.

Just a quick additional note; With the patch for bug 1713547, there is no severity record both on Fission/non-Fission cases. That means the patch covers Fission/non-Fission displayport-isn't-set severity record issues. Whereas bug 1709460 covers only Fission case, I mean with patches for bug 1709460 there appears a single record both on Fission/non-Fission cases.

(In reply to Botond Ballo [:botond] from comment #8)

  • Is there some sort of guide or documentation to interpreting the graphs at the protosaur.dev link in comment 0? Some of those graphs (like the first and second ones) make sense to me, but some of these others (e.g. the "density estimate" one) I'm confused about.

I asked the data scientist (Corey Dow-Hygelund) that created Fission's protosaur.dev dashboard. Unfortunately, there is no documentation for protosaur.dev, but I am told density estimate and eCDF ("empirical Cumulative Distribution Function") are both effectively normalized histograms. The density estimate and eCDF for checkerboarding_severity show that fission-disabled had more low-value checkerboarding events and fission-enabled had more high-value checkerboarding events.

  • (Assuming the answer isn't already there and I just don't know where to look): can we get information about the total number of checkerboarding severity events with and without fission?

Unfortunately, we don't have that data for our experiment right now, but Corey will add that for our next Fission Beta experiment (the number of checkerboarding_severity records in the histogram, normalized by active hours).

Your question revealed a problem with our analysis of the checkerboarding_severity probe: checkerboarding is only reported when it happens, so Fission users could (hypothetically) have much less checkerboarding, but the users that do checkerboard have higher values. Your suggestion to look at the number of checkerboarding events should avoid that problem.

Flags: needinfo?(cpeterson) → needinfo?(botond)

Okay, so if the density chart is essentially a histogram then fission disabled has a lot more checkerboard events, with a slightly smaller median. We don't have the same view of the checkerboarding data between nightly and beta, so these pages might be presenting underlying data that looks the same but because of different choices in how they are presented come to different conclusions. ie we might be comparing apples and oranges. We need to look at the underlying data from beta and nightly with an identical presentation to see if there is even a difference. Can we do this?

Severity: -- → S3
Priority: -- → P3

(In reply to Timothy Nikkel (:tnikkel) from comment #12)

Okay, so if the density chart is essentially a histogram then fission disabled has a lot more checkerboard events, with a slightly smaller median.

Yes, that is my understanding.

We don't have the same view of the checkerboarding data between nightly and beta, so these pages might be presenting underlying data that looks the same but because of different choices in how they are presented come to different conclusions. ie we might be comparing apples and oranges. We need to look at the underlying data from beta and nightly with an identical presentation to see if there is even a difference. Can we do this?

Unfortunately, we don't have the same data view for Nightly (due to some implementation differences in how we are testing Fission on Nightly and Beta). Creating the same data view would require extra work from the Data Science team, who is pretty swamped.

However, the Nightly dashboard [1] onlys shows the mean for all users and we can compare it to Beta's mean checkerboarding_severity (for all users [2] or by different CPU and memory segments). Like Nightly, Beta's mean checkerboarding for fission-enabled is better than fission-disabled for all users. Looking at Beta's user segments, Beta's mean checkerboarding is better for high_cpu users [3] and users with more than 6 GB, but worse for low_cpu users [4] and users with less than 6 GB. So I don't think "Beta is worse than Nightly". We just that we have a way to segment Beta's users by high/low CPU and memory and we then see that users with low CPU or low memory experience worse checkerboarding.

[1] https://protosaur.dev/fission-experiment-monitoring-dashboard/dashboard/dashboard.html

[2] https://protosaur.dev/partybal/bug_1706428_pref_fission_m7_beta_experiment_with_memory_filter_beta_89_91.html#checkerboard_severity

[3] https://protosaur.dev/partybal/bug_1706428_pref_fission_m7_beta_experiment_with_memory_filter_beta_89_91.html#checkerboard_severity97

[4] https://protosaur.dev/partybal/bug_1706428_pref_fission_m7_beta_experiment_with_memory_filter_beta_89_91.html#checkerboard_severity279

Summary: Fission Checkerboarding issues → Fission Checkerboarding issues for low-cpu and low-mem users

When we compare CHECKERBOARDING_SEVERITY between fission-enabled and fission-disabled we are already comparing apples to oranges. This is because of the pref apz.wr.activate_all_scroll_frames_when_fission. This means that when fission is enabled every scroll frame that could ever async scroll gets a displayport and hence an apzc. Without fission only scrollframes that are actually scrolled or are special in some way (root scroll frame in the root content doc) get this treatment. So we just have a bunch of extra apzc's around with fission. As well when a user goes to scroll a scroll frame for the first time we will have different behaviour. Without fission we set a displayport and create an apzc for the first time. With fission we change the existing displayport from a "minimal" one to a normally sized one. (We have this different behaviour for fission because of correctness: we need it to do hit testing correctly. It was not a perf related decision.)

So in order to compare CHECKERBOARDING_SEVERITY numbers between fission and non-fission we need the detailed distribution data otherwise we can't say anything because we expect CHECKERBOARDING_SEVERITY to differ between fission enabled and disabled for the above reasons.

We've had bugs in the past (specifically bug 1693636) where it looked like we either improved or regressed CHECKERBOARDING_SEVERITY, but in reality we had either added or removed a lot of new CHECKERBOARDING_SEVERITY events of low magnitude. So the apparent effect from looking at the mean was in reality exactly the opposite (mean looked like a regression => users actually saw less checkerboarding events).

Looking at the protosaur links from comment 13, specifically the density charts, which if they are histograms are the only charts we can get useful data out of. In all cases we see a lot more low magnitude events with fission-disabled. I think in every case the fission-enabled density chart is superior.

If we can't get any more data (or the same data presented in a fashion that is easier to properly interpret) I don't think there is any work to do here.

We still want to land bug 1713547 and bug 1709460. As well it would be good to look into changing this metric (or adding a new metric and de-emphasizing CS) so that the easiest/most obvious/default way of looking at it (and what our telemetry sends alerts for) is not susceptible to the problems mentioned about.

So in order to compare CHECKERBOARDING_SEVERITY numbers between fission and non-fission we need the detailed distribution data otherwise we can't say anything because we expect CHECKERBOARDING_SEVERITY to differ between fission enabled and disabled for the above reasons.

Corey (Fission data science) will report the CHECKERBOARDING_SEVERITY event count (normalized by users' active hours) for our in-progress Fission Beta experiment. Like the bugs you described, this will give us a better idea of how often checkerboarding actually happens, not just the severity when it does happen.

If we can't get any more data (or the same data presented in a fashion that is easier to properly interpret) I don't think there is any work to do here.

We still want to land bug 1713547 and bug 1709460. As well it would be good to look into changing this metric (or adding a new metric and de-emphasizing CS) so that the easiest/most obvious/default way of looking at it (and what our telemetry sends alerts for) is not susceptible to the problems mentioned about.

I will move this bug to a later Fission milestone so we will remember to review CHECKERBOARDING_SEVERITY after we get CHECKERBOARDING_SEVERITY event counts and those bug fixes ride to Beta.

Fission Milestone: M7a → M8

(In reply to Chris Peterson [:cpeterson] from comment #15)

Corey (Fission data science) will report the CHECKERBOARDING_SEVERITY event count (normalized by users' active hours) for our in-progress Fission Beta experiment. Like the bugs you described, this will give us a better idea of how often checkerboarding actually happens, not just the severity when it does happen.

(In the interest of saving time of the data science person) If the density charts on the linked page are indeed basically histograms then they provide more data then just getting event counts would, and they already confirm that we are seeing more events in the fission disabled case.

I am not precisely sure what is meant by these plots being "basically histograms". To clarify, the values represented in these charts represent the mean of a client's summed histogram across the experiment window. All of the client means form a distribution. The density plot is this distribution. The means, quantiles, and eCDF are all calculated from this distribution.

The plots show that the quantiles of the clients means of a given magnitude are less for Fission. However, they don't show that the absolute number of events that would actually trigger the probe are less. The idea is that Fission could be triggering more events, just that the significant part of these have lower severity. That would pull the quantiles down, even though Fission was performing worse.

From what I understand, this trigger in measurement is different between Fission and non-Fission. Therefore, comparing the number of events between the two branches has little utility and interpretability.

Is this understanding correct?

(In reply to Corey Dow-Hygelund [:ccd] from comment #17)

I am not precisely sure what is meant by these plots being "basically histograms". To clarify, the values represented in these charts represent the mean of a client's summed histogram across the experiment window. All of the client means form a distribution. The density plot is this distribution. The means, quantiles, and eCDF are all calculated from this distribution.

Distribution meaning that if we take a point on a curve in the density plot, the x value represents the mean checkerboard_severity and the y value represents how many in the population have that mean? So if the density plot is a distribution wouldn't we expect the area under both of the two curves to be equal? They do not look equal to me.

The plots show that the quantiles of the clients means of a given magnitude are less for Fission. However, they don't show that the absolute number of events that would actually trigger the probe are less. The idea is that Fission could be triggering more events, just that the significant part of these have lower severity. That would pull the quantiles down, even though Fission was performing worse.

I think it's the other way around (the mean is higher for fission), but yes.

From what I understand, this trigger in measurement is different between Fission and non-Fission. Therefore, comparing the number of events between the two branches has little utility and interpretability.

There is a different trigger but I think we can interpret the data we get from it meaningfully. Two histograms, one with fission, one without, with all checkerboard_severity events in them without doing anything to them per-client would be the most helpful for this, then we could see how the distributions differ and if that makes sense given our understanding of this metric and the differences in how fission and non-fission works with regards to it. The mean and total event counts would also help us understand if our theory is correct but in less detail.

And I do think we do want to compare this data so that we are making informed choices, so that we can say "Yes, we understand why it's different with fission and we think that is okay" or "we understand why it's different, but we think we need to make some changes still", even if comparing the actual means to each other isn't mathematically meaningful.

We didn't understand how checkerboard_severity worked until we dug into it for this bug (and so we weren't aware of these issues), so having a value that is easier to compare, interpret, and track is definitely on our "want list" now that we know.

See Also: → 1715961
Assignee: nobody → botond

In bug 1709460 in bug 1713547 Hiro landed some patches to improve the accounting of this and he included some comparison on nightly before/after his patches. The comparison changed in the way we expected, which increases our confidence that we understand this metric and why it seems to be worse on beta. If another beta report is coming for 91 beta we could look at those numbers to confirm.

Component: Graphics → Panning and Zooming

This bug is a soft blocker for Fission M8. We'd like to fix it before our M8 Release experiment, but we won't delay the experiment waiting for it.

Whiteboard: fission-soft-blocker

The remaining task here is to have a look at telemetry data from beta 91, and double-check that it's in line with our expectations.

I will be on PTO for the next couple of weeks, but Timothy has kindly indicated that, with Hiro's help, he will take care of this final step.

Flags: needinfo?(botond)

Thanks. I think the Beta 91 experiment's first week of telemetry data will be available later this week. I'll share it with Timothy and Hiro then.

Moving this to Tim as Assignee while Botond is on extended PTO. Moved the other audit issue off your plate Tim. Thanks for the help on this.

Assignee: botond → tnikkel

Note that even when we get the new results from the experiment it could very well still show a difference between fission and non-fission because we are still (somewhat) comparing apples and oranges. The histograms that Hiro has produced from nightly after landing his two patches show the what we expected (and I understand better how they are computed), so for the results of the beta experiment to not be in line with our expectations it would have to have a pretty difference from both the previous beta experiment and what we see on the nightly histograms.

Now I got the new results, it shows us;

  1. Fission severity data look worse on lower percentiles
  2. Fission severity data look better on higher percentiles

low cpu or low mem doesn't matter at all, right?

A principal about checkerboard severity is, with Fission there are more oppotunities to cause checkerboarding but the severity value is much lower than values with Fission. So the distributions for Fission/non-Fission should look something like the images I am attaching. Note that these two histograms are not representing the Fission beta results, it's totally different one but I am using this to make things get understand easier

Without Fission, the histogram should be something like Gaussian distribution, a single peak one on the right histogram in the image.
With Fission, the distribution has another peak just like the left histogram in the image, but the peak should not be significant as the image.

This results the fact 1) and 2).

I believe the result should be same on nightly, just looking at mean value would be confusing.

I am closing this bug, since now we (especially I) think the distributions both on low mem/cpus beta and nightly are basically same.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: