We recently started running tests on live sites, but it's currently not possible to interpret the data because we have two major variables that are colliding: (1) shifting code base, and (2) shifting test pages. So when changes happen in live sites data, we currently can't tell what might have caused it which prevents live sites from providing any value.
Yesterday, I came up with a metric that will allow us to capture if content changes had occurred, or determine how similar recordings are in this test when compared to the previous one, in other words, a similarity metric.
The metric is calculated as follows:
1. Find the last live site test that ran and get the videos.
2. For each of the 15x15 video pairings (effectively 225 trials) build up a cross-correlation matrix:
1. Calculate the two histograms of the entire videos (using 255 bins).
2. Run a correlation between these two histograms.
3. Average the cross-correlation matrix produced in (2) to obtain the similarity score.
The similarity score ranges from 0 to 1, with both 0 and 1 indicating problems in the recording. The 0 is self-explanatory, but when we hit a correlation of 1, it means that they are too perfect and that we are likely just recording a blank page.
Values between 0 and 1 (exclusively) mean that the page load is working. Values that approach 1 mean that the page has the same content whereas approaching 0 means that the page no longer has the same content when compared to the last test run. Thinking about this as a time series, a single drop in a high similarity score would indicate when a page has changed content (it would go back up to its usual value in the next run).
That said, this similarity metric has much more applicability, here's a list of all applications I see with it:
- Consistently low scores indicate tests which are highly variable and not good candidates for testing.
- You might have guessed by now that it captures variability of a test in a single metric.
- Makes it an ideal candidate to help with choosing tests to sheriff.
- Works for recorded and live sites - we could even compare live to recorded sites to determine the quality of the recording.
- No need to run a ton of tests to look at the data anymore, just run this and it will tell you how good or bad the test is for performance testing (as mentioned above, this has and effective trial count of 225 in two test runs). Large savings in terms of CI costs.
- We can now determine when live sites change, and when our product regresses on those live sites.
- Using the score from other browsers, if we see a drop in all of them, the content changed, if it's only our product, then it's a regression.
- Determining quality of the page for live site testing, and continuous monitoring of the quality is simple.
- It's now possible to expand what live sites we test because of this.
Now, one thing you might be wondering about is network variability. The neat thing about this technique is that if we see network variability or network changes occur in the test, and those changes affect the page load performance, we will catch it because it would cause a change in the number of frames we capture (and how fast/slow the content in each frame changes), which changes the histogram, and that changes the metric. If the changes in the network don't affect performance much, we won't see a change in the metric.
I wrote a quick patch to show this score in action: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=ddc7e148661363471aefda57eb2762befce39aa0&selectedTaskRun=NMTPyHhKRse5n2Nr9ICCpA-0
I've linked it to the expedia test which has a very low score in comparison to the last test run - this indicates that it's a highly variable test and not an ideal candidate for performance testing. If you look at the google-accounts test, you'll see that it has a high score because there's nothing to it so it's very consistent. Looking at ESPN, you'll see it also has a high score because while it takes a long time to load, it's consistent.
The actual meat of how the score is calculated is here: https://hg.mozilla.org/try/rev/557d010171a836c892bd8cfb610a9d72c2b066b9#l2.237
I'm going to be reworking this patch into something I can land later this week or next week so we could start exploring it in more depth in production because it looks like it will be very useful for us.
That said, I'd like to open this up for discussion to see what everyone else thinks about it. Feel free to CC others here as well.