Bug 1602808 Comment 60 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

As promised here's the backout patch. I put it next to the `formatValue-cache` patch.

Here's the comparison using Pike's talos compare - https://pike.github.io/talos-compare/?revision=41cf2e99d8b0&revision=46577a1c65c8&revision=9254c4b7e01b (you have to remove the tests you don't need to see tabswitch for example).

Here's the perfherder view:

m-c vs backout: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=41cf2e99d8b098a638278403d986820f386923d9&newProject=try&newRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&framework=1

backout vs. formatValue-cache: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&newProject=try&newRevision=9254c4b7e01b5d76fd75281dbc7b08a28fee5cf7&framework=1

I understand the motivation for the backout, but, I'd like to take a moment here to talk about the culture of making decisions based on imperfect talos data:

Talos has been notoriously noisy and imperfect. It's a useful tool for us, but it's very far away from a tool that would reliably show data that we can blindly translate into "this impacts user experience". There are many factors contributing to that, but the level of noise, hidden factors (like differences in machines that the test happens to be performed in in the cloud) and non-normal distributions being measured as if they were normal are the main ones.

The reason I brought it up here is that we spent 4 months, with non-insignificant amount of time invested at least by me, but likely by others in this thread as well, chasing a statistically significant regression of ~4-6% (linux shippable)/~2-3% (windows shippable) in a test with ~2.5% stdev. This test *seems* to be bi or multimodal which puts the significance results in question.

***
To be clear - I don't doubt we are regression. I am not sure *how much* we really are regressing and if it's worth reverting a migration from a deprecated technology to Fluent for that reason.
***

***
With all that in mind. I still think it was worth chasing this bug. The test may be not perfect, the validity of the results may be in question, and the value of fixing this regression may be less than it appears to me from the reaction, but it's good to chase it and try to fix it.
***

What I struggle with is how absolute the notion of "regression has to be completely removed" is being treated in the light of all those imperfections of the test.
The caching patch I provided reduces the regression to almost completely 0% with a few exceptions on non-shippable builds spread across different tests and platforms.

We are left in a position to revert the current patch and work on an "improved" patch to chase the remaining  0.2ms on linux? 0.1ms on windows on tabswitch and fluctuating ?

With the level of stdev and hidden factors, I'm fairly confident that I can re-run all trys a couple times and find a combination where `formatValue+cache` doesn't produce any significance.
In fact the latest `backout vs. formatValue+cache` almost looks like it.

====

For that reason, while I'm offering the backout patch, I'm advocating for landing the `formatValue+cache` patch and considering this regression fixed enough. It's a zero-sum game and in the end, reducing technical debt is also a value that frees me to work on Fluent bindings performance.

NI's:

 - Gijs/Dao - can you look at the last talos results and either confirm that you want me to back out the patch and file a follow up to invest more time to improve the migration or that you agree with me on the `formatValue+cache` patch?
 - Pike - in case Gijs+Dao decide on the backout, can you confirm that we're ok with the backout?
 - mconley - can you look at the talos results and my take on the results and provide your take on how seriously you'd recommend us treating regressions of that magnitude (both the original regression and the reduced regression with the formatValue+cache patch)?
As promised here's the backout patch. I put it next to the `formatValue-cache` patch.

Here's the comparison using Pike's talos compare - https://pike.github.io/talos-compare/?revision=41cf2e99d8b0&revision=46577a1c65c8&revision=9254c4b7e01b (you have to remove the tests you don't need to see tabswitch for example).

Here's the perfherder view:

m-c vs backout: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=41cf2e99d8b098a638278403d986820f386923d9&newProject=try&newRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&framework=1

backout vs. formatValue-cache: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&newProject=try&newRevision=9254c4b7e01b5d76fd75281dbc7b08a28fee5cf7&framework=1

I understand the motivation for the backout, but, I'd like to take a moment here to talk about the culture of making decisions based on imperfect talos data:

Talos has been notoriously noisy and imperfect. It's a useful tool for us, but it's very far away from a tool that would reliably show data that we can blindly translate into "this impacts user experience". There are many factors contributing to that, but the level of noise, hidden factors (like differences in machines that the test happens to be performed in in the cloud) and non-normal distributions being measured as if they were normal are the main ones.

The reason I brought it up here is that we spent 4 months, with non-insignificant amount of time invested at least by me, but likely by others in this thread as well, chasing a statistically significant regression of ~4-6% (linux shippable)/~2-3% (windows shippable) in a test with ~2.5% stdev. This test *seems* to be bi or multimodal which puts the significance results in question.

***
To be clear - I don't doubt that there is a regression. I am not sure *how much* we really are regressing and if it's worth reverting a migration from a deprecated technology to Fluent for that reason.
***

***
With all that in mind. I still think it was worth chasing this bug. The test may be not perfect, the validity of the results may be in question, and the value of fixing this regression may be less than it appears to me from the reaction, but it's good to chase it and try to fix it.
***

What I struggle with is how absolute the notion of "regression has to be completely removed" is being treated in the light of all those imperfections of the test.
The caching patch I provided reduces the regression to almost completely 0% with a few exceptions on non-shippable builds spread across different tests and platforms.

We are left in a position to revert the current patch and work on an "improved" patch to chase the remaining  0.2ms on linux? 0.1ms on windows on tabswitch and fluctuating ?

With the level of stdev and hidden factors, I'm fairly confident that I can re-run all trys a couple times and find a combination where `formatValue+cache` doesn't produce any significance.
In fact the latest `backout vs. formatValue+cache` almost looks like it.

====

For that reason, while I'm offering the backout patch, I'm advocating for landing the `formatValue+cache` patch and considering this regression fixed enough. It's a zero-sum game and in the end, reducing technical debt is also a value that frees me to work on Fluent bindings performance.

NI's:

 - Gijs/Dao - can you look at the last talos results and either confirm that you want me to back out the patch and file a follow up to invest more time to improve the migration or that you agree with me on the `formatValue+cache` patch?
 - Pike - in case Gijs+Dao decide on the backout, can you confirm that we're ok with the backout?
 - mconley - can you look at the talos results and my take on the results and provide your take on how seriously you'd recommend us treating regressions of that magnitude (both the original regression and the reduced regression with the formatValue+cache patch)?
As promised here's the backout patch. I put it next to the `formatValue-cache` patch.

Here's the comparison using Pike's talos compare - https://pike.github.io/talos-compare/?revision=41cf2e99d8b0&revision=46577a1c65c8&revision=9254c4b7e01b (you have to remove the tests you don't need to see tabswitch for example).

Here's the perfherder view:

m-c vs backout: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=41cf2e99d8b098a638278403d986820f386923d9&newProject=try&newRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&framework=1

backout vs. formatValue-cache: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=46577a1c65c867a5ce8e5e0397c1cf5d5e8c2180&newProject=try&newRevision=9254c4b7e01b5d76fd75281dbc7b08a28fee5cf7&framework=1

I understand the motivation for the backout, but, I'd like to take a moment here to talk about the culture of making decisions based on imperfect talos data:

Talos has been notoriously noisy and imperfect. It's a useful tool for us, but it's very far away from a tool that would reliably show data that we can blindly translate into "this impacts user experience". There are many factors contributing to that, but the level of noise, hidden factors (like differences in machines that the test happens to be performed in in the cloud) and non-normal distributions being measured as if they were normal are the main ones.

The reason I brought it up here is that we spent 4 months, with non-insignificant amount of time invested at least by me, but likely by others in this thread as well, chasing a statistically significant regression of ~4-6% (linux shippable)/~2-3% (windows shippable) in a test with ~2.5% stdev. This test *seems* to be bi or multimodal which puts the significance results in question.

***
To be clear - I don't doubt that there is a regression. I am not sure *how much* we really are regressing and if it's worth reverting a migration from a deprecated technology to Fluent for that reason.
***

***
With all that in mind. I still think it was worth chasing this bug. The test may be not perfect, the validity of the results may be in question, and the value of fixing this regression may be less than it appears to me from the reaction, but it's good to chase it and try to fix it.
***

What I struggle with is how absolute the notion of "regression has to be completely removed" is being treated in the light of all those imperfections of the test.
The caching patch I provided reduces the regression to almost completely 0% with a few exceptions on non-shippable builds spread across different tests and platforms.

We are left in a position to revert the current patch and work on an "improved" patch to chase the remaining  0.2ms on linux, 0.1ms on windows on tabswitch and fluctuating.

With the level of stdev and hidden factors, I'm fairly confident that I can re-run all trys a couple times and find a combination where `formatValue+cache` doesn't produce any significance.
In fact the latest `backout vs. formatValue+cache` almost looks like it.

I believe that our talos harness is not misleading us to show false positives, but has a tendency to allow us to exaggerate minor regressions (say, 0.5-1.5%) into significant ones (2.0%-3.5%) chase red herrings - not in a sense of non-existing regressions, but minor ones as if they were major.
In other words, if we're ok paying 1.0% for migrating this to Fluent and moving on to work on things that will improve the Fluent integration with Gecko (see bug 1613705 comment 2 for potential wins), then I think this patch is there.


====

For that reason, while I'm offering the backout patch, I'm advocating for landing the `formatValue+cache` patch and considering this regression fixed enough. It's a zero-sum game and in the end, reducing technical debt is also a value that frees me to work on Fluent bindings performance.

NI's:

 - Gijs/Dao - can you look at the last talos results and either confirm that you want me to back out the patch and file a follow up to invest more time to improve the migration or that you agree with me on the `formatValue+cache` patch?
 - Pike - in case Gijs+Dao decide on the backout, can you confirm that we're ok with the backout?
 - mconley - can you look at the talos results and my take on the results and provide your take on how seriously you'd recommend us treating regressions of that magnitude (both the original regression and the reduced regression with the formatValue+cache patch)?

Back to Bug 1602808 Comment 60