Discuss implementing Frustration-Index (FI) metric
Categories
(Testing :: Performance, enhancement, P3)
Tracking
(Not tracked)
People
(Reporter: sparky, Unassigned)
Details
So I recently stumbled upon this article (which discusses another article) which discusses this thing called a Frustration-Index (FI)
.
It's basically a way in which multiple metrics get pooled together into a single value that ranges from 0-100% in an attempt to gauge how frustrated a user might be with the page. Mathematically, it's the magnitude of an N-dimensional vector, with each vector component being formed by the difference between the timing of two metrics. This makes it easy to add more information to the index as well by increasing the number of dimensions (you just need to determine where to put the new dimension).
I got curious to see if we can measure this, and I found that we can. Here's a try run with the implementation in run-visual-metrics.py (so I can make use of our visual-metrics in the calculation): https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=edd5f411b2ee83d14df1172c0f3b38222151dc05&selectedJob=298170598
It's revealing some interesting stuff as well. Here's a graph for cold pageload tests on amazon: https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=try,2349497,1,13&timerange=1209600&zoom=1587173081062,1587173085250,8.26863974464758,10.529011614541188
And here's a graph for warm pageloads: https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=try,2349496,1,13&timerange=1209600&zoom=1587173023366,1587173148572,5.479826462240979,17.780343897143858
What you'll notice from them is that (1) warm pageloads have a higher FI in comparison to cold, and (2) cold pageload FI has a much lower variability in comparison to warm.
I find that this result is interesting, and concerning. It could either mean that cold pageloads are nicer than warm ones (which would be surprising) or that the frustration index is a poor indicator of performance.
I'd like to hear what everyone thinks about adding this metric to our visual-metrics and monitoring it for a few months to see what it looks like in production?
I think it might end up being useful because it captures many things in one value which could be a good indicator of something being wrong, and indicate that someone should dig deeper into more granular data.
Reporter | ||
Comment 1•5 years ago
|
||
I played with this idea a little more yesterday, and I realized that it's really similar to the Root Mean Square Error (RMSE) which was easy to implement: https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=a4cbd8b190ed3c9a262a2764a7935c7304274303&selectedJob=298205522
The difference is that FI tracks the differences between successive data points. In other words, it's expected outcome value is the next data points value. With the RMSE implementation that I did, the expected outcome is more realistic, and is found by interpolating on the line built from points (0,0) and (total_points, max(loadeventend, ttfi)) - the max y point could be something else also. This is calculating the inverse correlation score of how well the data fits the straight line. The worse it fits, the higher the score, and vice versa. To me, FI is okay, but it's statistically biased towards the largest difference (because it doesn't take an average) whereas RMSE takes them all into account.
I wondered if a linear fit estimation with RMSE was odd to use here, but I think it makes a lot of sense. These timings should be monotonically increasing in a linear, or logarithmic fashion. If they aren't then there is a problem, and RMSE tracks exactly this. With FI, it's entirely possible for your value to be caused by only one large difference, so you actually don't get the full picture of how a user might feel - RMSE would incorporate differences from multiple points in this case (so instead of getting a score which is "not bad" you would get a score that is "really bad", and rightfully so).
Comment 2•5 years ago
|
||
Interesting.
I didn't read all of the FI links although I've seen it brought up elsewhere too.
I would be curious as to what UX folks think of this metric.
Also, noting that SpeedIndex, PerceptualSpeedIndex, and ContentfulSpeedIndex each do attempt to measure overall quality of visual experience. (i.e. not just when a single event took place).
One problem with our warm page load variability is that when a given network resources is no longer served from the cache, there is a very significant spike in load time. (A bit over a year ago I made prototypes that used a "generous cache policy" and it really flattened the results (but breaks spec)).
Reporter | ||
Comment 3•5 years ago
•
|
||
I think the *SpeedIndex variations might be better to use.
I spent some time thinking about the cold vs warm frustration index differences which don't make any sense. FI is simply a poor predictor of frustration.
It uses differences to produce the metric and this is where the issue lies because differences don't take into account the actual values of the various metrics. We know that a warm pageload is always faster than or at least as fast as a cold pageload so a warm pageload should have a lower or equal FI if we are actually trying to model user frustration. An FI that is higher during warm is enough proof that this metric does not model user frustration.
To improve this and make it more realistic, each metric used in the calculation needs to be weighted by something that takes into account the actual value. On the other hand, I think that using percent-change differences (in either RMSE or FI) would solve this problem without using weightings.
EDIT: IMO, a real "frustration" score won't be this simple. To do this, we need to use EEG measurements and building a model of the frustration responses using all of our metrics (not just 4). Note that I'm getting closer and closer to being able to do this modelling. We currently have no ability to synchronize EEG with the browser and I've nearly completed an integration of Lab Streaming Layer and Firefox to be able to do so (I need this integration for some other projects also).
Comment 4•5 years ago
|
||
Greg, you might find this PerfMatters presentation interesting:
Andrew Scheuermann of AirBnB :: One Number, Multiple Metrics
https://www.youtube.com/watch?v=e215_uiU3LQ&list=PLSmH2HL6l9pwQmSgpKFtWiISOXua3zq8I&index=3&t=0s
Reporter | ||
Comment 5•5 years ago
|
||
Oh that's really cool! Thanks for providing a link to that. Seems like that PPS (from AirBnB and Lighthouse) probably models frustration better than this frustration-index.
Updated•5 years ago
|
Description
•