Bug 1602808 Comment 61 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #60)

Note to needinfo-ees: I'm snipping for brevity; please see Zibi's full comment #60.

> As promised here's the backout patch. I put it next to the `formatValue-cache` patch.
> 
> Here's the comparison using Pike's talos compare - https://pike.github.io/talos-compare/?revision=41cf2e99d8b0&revision=46577a1c65c8&revision=9254c4b7e01b (you have to remove the tests you don't need to see tabswitch for example).

FWIW, I don't really understand how to read this. There is no x-axis labeling, and the numbers on either end have no units and also seem to make no sense (e.g. 0 on the left of most of the cpstartup runs - but not all of them, and exactly the same number (557) at the end?). Also, the scale on the x-axis is such that you cannot really see any detail for tests with small values (like tabswitch).

> The reason I brought it up here is that we spent 4 months, with non-insignificant amount of time invested at least by me, but likely by others in this thread as well, chasing a statistically significant regression of ~4-6% (linux shippable)/~2-3% (windows shippable) in a test with ~2.5% stdev. This test *seems* to be bi or multimodal which puts the significance results in question.

I don't see this bi- or multi-modality in any of the links; either way, I suggest you file a separate bug on this issue in Testing :: Talos, with details on what you perceive to be the problem with the test, and suggestions as to how to fix it. If there's a perf cliff (e.g. based on whether we finish some operation before or after a refresh driver tick, or we end up doing a jank-y thing either before or after the test's completion point) resulting in multi-modality, and we could either transform the test results to account for that, or investigate and fix that perf cliff, that seems valuable, orthogonal to this bug.

> To be clear - I don't doubt that there is a regression. I am not sure *how much* we really are regressing and if it's worth reverting a migration from a deprecated technology to Fluent for that reason.

We're not permanently reverting. We're reverting so we can re-migrate in a way that doesn't impact performance, or at least so we can understand what exactly the impact on performance is. You allude to talos being an imperfect mirror of user experience - which is true! - but at this point we do not seem to understand what the cause of the talos regression is, so we do not understand the perf impact, so it is also impossible to say what the actual impact on UX is. We've now spent nearly 4 months on this and still don't understand, so the safest course of action is to back out and figure that part out (or come up with a way of landing this switch that has no perf impact) without potentially negatively impacting performance for users. Really, we should probably have done this sooner - but we kept thinking "if only we do this other thing, maybe that'll fix the performance issues", and have been unsuccessful. I still have some ideas on how to improve things here, but we've spent so much time doing this that I don't want to pursue them without reverting first: Time to cut our losses.

> What I struggle with is how absolute the notion of "regression has to be completely removed" is being treated in the light of all those imperfections of the test.
> The caching patch I provided reduces the regression to almost completely 0% with a few exceptions on non-shippable builds spread across different tests and platforms.

perfherder still shows 1.6 and 2.7% regressions on tabswitch on shippable linux and windows builds, and importantly, we still don't *understand* what the reason for this is.

> I believe that our talos harness is not misleading us to show false positives, but has a tendency to allow us to exaggerate minor regressions (say, 0.5-1.5%) into significant ones (2.0%-3.5%) chase red herrings - not in a sense of non-existing regressions, but minor ones as if they were major.
> In other words, if we're ok paying 1.0% for migrating this to Fluent and moving on to work on things that will improve the Fluent integration with Gecko (see bug 1613705 comment 2 for potential wins), then I think this patch is there.

I don't think we're OK with paying 2% (I don't really buy the "exaggerate minor regressions" thing - talos is a machine, it has no intentions or "tendencies". The regression is around 2% in this push, it was around 4% in the previous push, so if anything I think this push is undercounting it), when we should be able to do the migration with 0 performance cost.

> For that reason, while I'm offering the backout patch, I'm advocating for landing the `formatValue+cache` patch and considering this regression fixed enough. It's a zero-sum game and in the end, reducing technical debt is also a value that frees me to work on Fluent bindings performance.

Well, the DTD implementation was very straightforward - there were a bunch of attributes, and whenever the title changes there were some straightforward operations based on those attributes and the content title, including some string concatenation, to produce the final title. We could do the same in fluent, but we haven't tried that.

> NI's:

You didn't leave any needinfos; I'll set them.

>  - Gijs/Dao - can you look at the last talos results and either confirm that you want me to back out the patch and file a follow up to invest more time to improve the migration or that you agree with me on the `formatValue+cache` patch?

I still think we should back out the patch and someone should file a follow-up to investigate the migration. You wrote elsewhere in this comment that you have other fluent-related work, and suggested in comment #56 that maybe we should ask the perf team. The actual fluent work here is very minimal (only a handful of strings), and the issue we're fixing is perf-related, so that seems like a reasonable way to proceed. I expect Mike or I could work on the follow-up.

>  - Pike - in case Gijs+Dao decide on the backout, can you confirm that we're ok with the backout?
>  - mconley - can you look at the talos results and my take on the results and provide your take on how seriously you'd recommend us treating regressions of that magnitude (both the original regression and the reduced regression with the formatValue+cache patch)?
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #60)

Note to needinfo-ees: I'm snipping for brevity; please see Zibi's full comment #60.

> As promised here's the backout patch. I put it next to the `formatValue-cache` patch.
> 
> Here's the comparison using Pike's talos compare - https://pike.github.io/talos-compare/?revision=41cf2e99d8b0&revision=46577a1c65c8&revision=9254c4b7e01b (you have to remove the tests you don't need to see tabswitch for example).

FWIW, I don't really understand how to read this. There is no x-axis labeling, and the numbers on either end have no units and also seem to make no sense (e.g. 0 on the left of most of the cpstartup runs - but not all of them, and exactly the same number (557) at the end?). Also, the scale on the x-axis is such that you cannot really see any detail for tests with small values (like tabswitch).

> The reason I brought it up here is that we spent 4 months, with non-insignificant amount of time invested at least by me, but likely by others in this thread as well, chasing a statistically significant regression of ~4-6% (linux shippable)/~2-3% (windows shippable) in a test with ~2.5% stdev. This test *seems* to be bi or multimodal which puts the significance results in question.

I don't see this bi- or multi-modality in any of the links; either way, I suggest you file a separate bug on this issue in Testing :: Talos, with details on what you perceive to be the problem with the test, and suggestions as to how to fix it. If there's a perf cliff (e.g. based on whether we finish some operation before or after a refresh driver tick, or we end up doing a jank-y thing either before or after the test's completion point) resulting in multi-modality, and we could either transform the test results to account for that, or investigate and fix that perf cliff, that seems valuable, orthogonal to this bug.

> To be clear - I don't doubt that there is a regression. I am not sure *how much* we really are regressing and if it's worth reverting a migration from a deprecated technology to Fluent for that reason.

We're not permanently reverting. We're reverting so we can re-migrate in a way that doesn't impact performance, or at least so we can understand what exactly the impact on performance is. You allude to talos being an imperfect mirror of user experience - which is true! - but at this point we do not seem to understand what the cause of the talos regression is, so we do not understand the perf impact, so it is also impossible to say what the actual impact on UX is. We've now spent nearly 4 months on this and still don't understand, so the safest course of action is to back out and figure that part out (or come up with a way of landing this switch that has no perf impact) without potentially negatively impacting performance for users. Really, we should probably have done this sooner - but we kept thinking "if only we do this other thing, maybe that'll fix the performance issues", and have been unsuccessful. I still have some ideas on how to improve things here, but we've spent so much time doing this that I don't want to pursue them without reverting first: Time to cut our losses.

> What I struggle with is how absolute the notion of "regression has to be completely removed" is being treated in the light of all those imperfections of the test.
> The caching patch I provided reduces the regression to almost completely 0% with a few exceptions on non-shippable builds spread across different tests and platforms.

perfherder still shows 1.7 and 2.89% regressions on tabswitch on shippable linux and windows builds, and importantly, we still don't *understand* what the reason for this is.

> I believe that our talos harness is not misleading us to show false positives, but has a tendency to allow us to exaggerate minor regressions (say, 0.5-1.5%) into significant ones (2.0%-3.5%) chase red herrings - not in a sense of non-existing regressions, but minor ones as if they were major.
> In other words, if we're ok paying 1.0% for migrating this to Fluent and moving on to work on things that will improve the Fluent integration with Gecko (see bug 1613705 comment 2 for potential wins), then I think this patch is there.

I don't think we're OK with paying 2% (I don't really buy the "exaggerate minor regressions" thing - talos is a machine, it has no intentions or "tendencies". The regression is around 2% in this push, it was around 4% in the previous push, so if anything I think this push is undercounting it), when we should be able to do the migration with 0 performance cost.

> For that reason, while I'm offering the backout patch, I'm advocating for landing the `formatValue+cache` patch and considering this regression fixed enough. It's a zero-sum game and in the end, reducing technical debt is also a value that frees me to work on Fluent bindings performance.

Well, the DTD implementation was very straightforward - there were a bunch of attributes, and whenever the title changes there were some straightforward operations based on those attributes and the content title, including some string concatenation, to produce the final title. We could do the same in fluent, but we haven't tried that.

> NI's:

You didn't leave any needinfos; I'll set them.

>  - Gijs/Dao - can you look at the last talos results and either confirm that you want me to back out the patch and file a follow up to invest more time to improve the migration or that you agree with me on the `formatValue+cache` patch?

I still think we should back out the patch and someone should file a follow-up to investigate the migration. You wrote elsewhere in this comment that you have other fluent-related work, and suggested in comment #56 that maybe we should ask the perf team. The actual fluent work here is very minimal (only a handful of strings), and the issue we're fixing is perf-related, so that seems like a reasonable way to proceed. I expect Mike or I could work on the follow-up.

>  - Pike - in case Gijs+Dao decide on the backout, can you confirm that we're ok with the backout?
>  - mconley - can you look at the talos results and my take on the results and provide your take on how seriously you'd recommend us treating regressions of that magnitude (both the original regression and the reduced regression with the formatValue+cache patch)?

Back to Bug 1602808 Comment 61