Closed Bug 1634406 Opened 1 year ago Closed 1 year ago

Implement similarity metric in visual metrics task

Categories

(Testing :: Performance, task, P2)

Version 3
task

Tracking

(firefox78 fixed)

RESOLVED FIXED
mozilla78
Tracking Status
firefox78 --- fixed

People

(Reporter: sparky, Assigned: sparky)

References

Details

Attachments

(1 file)

We recently started running tests on live sites, but it's currently not possible to interpret the data because we have two major variables that are colliding: (1) shifting code base, and (2) shifting test pages. So when changes happen in live sites data, we currently can't tell what might have caused it which prevents live sites from providing any value.

Yesterday, I came up with a metric that will allow us to capture if content changes had occurred, or determine how similar recordings are in this test when compared to the previous one, in other words, a similarity metric.

The metric is calculated as follows:

1. Find the last live site test that ran and get the videos.
2. For each of the 15x15 video pairings (effectively 225 trials) build up a cross-correlation matrix:
    1. Calculate the two histograms of the entire videos (using 255 bins).
    2. Run a correlation between these two histograms.
3. Average the cross-correlation matrix produced in (2) to obtain the similarity score.

The similarity score ranges from 0 to 1, with both 0 and 1 indicating problems in the recording. The 0 is self-explanatory, but when we hit a correlation of 1, it means that they are too perfect and that we are likely just recording a blank page.

Values between 0 and 1 (exclusively) mean that the page load is working. Values that approach 1 mean that the page has the same content whereas approaching 0 means that the page no longer has the same content when compared to the last test run. Thinking about this as a time series, a single drop in a high similarity score would indicate when a page has changed content (it would go back up to its usual value in the next run).

That said, this similarity metric has much more applicability, here's a list of all applications I see with it:

  1. Consistently low scores indicate tests which are highly variable and not good candidates for testing.
  2. You might have guessed by now that it captures variability of a test in a single metric.
    1. Makes it an ideal candidate to help with choosing tests to sheriff.
    2. Works for recorded and live sites - we could even compare live to recorded sites to determine the quality of the recording.
  3. No need to run a ton of tests to look at the data anymore, just run this and it will tell you how good or bad the test is for performance testing (as mentioned above, this has and effective trial count of 225 in two test runs). Large savings in terms of CI costs.
  4. We can now determine when live sites change, and when our product regresses on those live sites.
    1. Using the score from other browsers, if we see a drop in all of them, the content changed, if it's only our product, then it's a regression.
    2. Determining quality of the page for live site testing, and continuous monitoring of the quality is simple.
    3. It's now possible to expand what live sites we test because of this.

Now, one thing you might be wondering about is network variability. The neat thing about this technique is that if we see network variability or network changes occur in the test, and those changes affect the page load performance, we will catch it because it would cause a change in the number of frames we capture (and how fast/slow the content in each frame changes), which changes the histogram, and that changes the metric. If the changes in the network don't affect performance much, we won't see a change in the metric.

I wrote a quick patch to show this score in action: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ddc7e148661363471aefda57eb2762befce39aa0&selectedTaskRun=NMTPyHhKRse5n2Nr9ICCpA-0

I've linked it to the expedia test which has a very low score in comparison to the last test run - this indicates that it's a highly variable test and not an ideal candidate for performance testing. If you look at the google-accounts test, you'll see that it has a high score because there's nothing to it so it's very consistent. Looking at ESPN, you'll see it also has a high score because while it takes a long time to load, it's consistent.

The actual meat of how the score is calculated is here: https://hg.mozilla.org/try/rev/557d010171a836c892bd8cfb610a9d72c2b066b9#l2.237

I'm going to be reworking this patch into something I can land later this week or next week so we could start exploring it in more depth in production because it looks like it will be very useful for us.

That said, I'd like to open this up for discussion to see what everyone else thinks about it. Feel free to CC others here as well.

This patch adds a new similarity metric that will allow us to determine when content changes occur in live site tests. It also enabled to recorded sites so we can get a comparison of the quality of the recording (and difference) between it and the live site.

Depends on D73277

Assignee: nobody → gmierz2
Status: NEW → ASSIGNED

I ran this patch in multiple try runs to check it's stability and it looks very good (they use slightly different patches, but they all do the same thing). I've also enabled it for recorded sites so we can compare them to the live sites.

Run 1: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=de72804bb88ef57153e3014231b25be03d2fd647
Run 2: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=d02a81ca2cb3eada8342b495e9b954a2ba2ff023
Run 3: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=4ead0c6703b04ac393ec452e760c206f7df39032

Hi Greg,

This is interesting.

(In reply to Greg Mierzwinski [:sparky] from comment #0)

... So when changes happen in live sites data, we currently can't tell what might have caused it which prevents live sites from providing any value.

Just to nitpick here :)
I know you're talking about catching regression and performance improvements, but I believe that the value of live site testing is in being able to compare separate binaries (or preferences) against each other in a realistic environment. (Running the tests at the same time, or as close together as possible).

1. Find the last live site test that ran and get the videos.
2. For each of the 15x15 video pairings (effectively 225 trials) build up a cross-correlation matrix:
    1. Calculate the two histograms of the entire videos (using 255 bins).
    2. Run a correlation between these two histograms.
3. Average the cross-correlation matrix produced in (2) to obtain the similarity score.

I wonder if histogram comparisons would miss important changes to the pages?
e.g. the same way that PerceptualSpeedIndex was added to use structural similarity index instead of histograms?
I suppose you could compare the videos based on either of these (histogram, SSIM)

(In reply to Andrew Creskey [:acreskey] [he/him] from comment #3)

I know you're talking about catching regression and performance improvements, but I believe that the value of live site testing is in being able to compare separate binaries (or preferences) against each other in a realistic environment. (Running the tests at the same time, or as close together as possible).

Right, but what would it mean for the differences to change between the binaries? At the moment, we can't tell what it means - it's either a change in content that caused it, or a performance difference. With these metrics, we'll be able to tell.

We should expand the uses for live sites now that we have them - there's a lot more we can do with it.

1. Find the last live site test that ran and get the videos.
2. For each of the 15x15 video pairings (effectively 225 trials) build up a cross-correlation matrix:
    1. Calculate the two histograms of the entire videos (using 255 bins).
    2. Run a correlation between these two histograms.
3. Average the cross-correlation matrix produced in (2) to obtain the similarity score.

I wonder if histogram comparisons would miss important changes to the pages?
e.g. the same way that PerceptualSpeedIndex was added to use structural similarity index instead of histograms?
I suppose you could compare the videos based on either of these (histogram, SSIM)

SSIM is for 2D images, or 3D volumes that have the same size. We aren't dealing with 2D images here, instead we have 3D volumes that have differing shapes. I could resample the images but SSIM is much more sensitive to pixel differences (especially the ordering of the video frames) so the metric would also have a higher variability.

The benefit of histograms is that it doesn't matter how the frames are painted, as long as we approximately see the same content - in terms of pixel value counts - then we get a high similarity. It won't tell us where in the pageload pipeline the issue can be found, it will only tells us that there either is an issue or isn't. I don't think we should be looking at ordering of frame paints here since that would introduce a lot of variability and make this metric less useful.

The only issue I see with this approach will be when content changes result in the same distribution pattern but this would be extremely rare. If we upload the worst video pairing as artifacts, then it's easy for us to check if this has occurred (if needed).

Pushed by gmierz2@outlook.com:
https://hg.mozilla.org/integration/autoland/rev/942a7f8ede44
Implement similarity score in visual metrics tasks. r=tarek
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla78
Blocks: 1636682
You need to log in before you can comment on or make changes to this bug.