Bug 1741022 Comment 3 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Marco Bonardo [:mak]

on 2021-11-16 00:40:59 PST

(In reply to :Gijs (he/him) from comment #2)
> - chunking the work in the startup-critical methods into named arrow functions
> - running those sequentially within `try...catch`
> - in the catch statement, send out a telemetry event with the name of the arrow function (so we can see where failures are in the wild), and in automation, crash the process so the test will fail.

This is what we are doing in other methods (idle, shutdown) and we should probably start from this. I dont know if crashing is necessary, I did it on shutdown because at that time the test harness is already off, maybe on startup we could just trigger a TEST-UNEXPECTED-FAIL handler. But I don't know if crashing has huge disadvantages for the test harness (may it be more expensive to handle due to stack walking?).
The telemetry point is interesting, we can use a keyed scalar to report the name. Probably it should also be done for the other points where we already use the same approach, and unify them in behavior.

> The main downside I can think of is that if the process is in a bad state and continues trying to do more stuff, it could do more bad things and worsen the state of things... but we're already in a pretty bad place if this starts happening, so I'm not sure that should stop us.

Yes, there's no way to tell if Component A failing will cause Component B to fail in a worse manner, or long term harm. For example migrationUI will just not bump up the migration version if one migration throws, will do once fixed, but in the meanwhile new code that made us add the migration may end up using the old state.
I don't think there's an easy solution to that.

Revision 1 by

Marco Bonardo [:mak]

on 2021-11-16 00:42:11 PST

(In reply to :Gijs (he/him) from comment #2)
> - chunking the work in the startup-critical methods into named arrow functions
> - running those sequentially within `try...catch`
> - in the catch statement, send out a telemetry event with the name of the arrow function (so we can see where failures are in the wild), and in automation, crash the process so the test will fail.

This is what we are doing in other methods (idle, shutdown) and we should probably start from this. I dont know if crashing is necessary, I did it on shutdown because at that time the test harness is already off, maybe on startup we could just trigger a TEST-UNEXPECTED-FAIL handler. But I don't know if crashing has huge disadvantages for the test harness (may it be more expensive to handle due to stack walking?).
The telemetry point is interesting, we can use a keyed scalar to report the name. Probably it should also be done for the other points where we already use the same approach, and unify them in behavior. A component could be written to manage it and reused.

> The main downside I can think of is that if the process is in a bad state and continues trying to do more stuff, it could do more bad things and worsen the state of things... but we're already in a pretty bad place if this starts happening, so I'm not sure that should stop us.

Yes, there's no way to tell if Component A failing will cause Component B to fail in a worse manner, or long term harm. For example migrationUI will just not bump up the migration version if one migration throws, will do once fixed, but in the meanwhile new code that made us add the migration may end up using the old state.
I don't think there's an easy solution to that.

Back to Bug 1741022 Comment 3