Figure out how to deal with modules used from C++ when moving to `moz-src`
Categories
(Firefox :: General, task)
Tracking
()
People
(Reporter: Gijs, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
We recently had the unfortunate privilege of finding out that some installs of Firefox get left in a state where the version of code in omni.ja
does not match the version in xul.dll
. See bug 1941972 and deps.
This affects the switch to moz-src
insofar as any switches to different URLs also have to be made in C++ code. This would open up the possibility that the version of xul.dll tries to use the moz-src
URL but omni.ja does not have it, or vice versa, that the xul.dll
version is still trying to access the module at the old resource
URL and the omni.ja
has moved it to its new moz-src
location.
At time of writing this would affect at least the modules imported in the "Core code" section of this searchfox query:
resource:///modules/OpenTabsProvider.sys.mjs
resource:///modules/BrowserUsageTelemetry.sys.mjs
(see also bug 1950892)resource://gre/modules/GeckoViewServiceWorker.sys.mjs
resource://gre/modules/Region.sys.mjs
(though that's payment requests, which I think might be turned off everywhere?)resource://gre/modules/txEXSLTRegExFunctions.sys.mjs
resource://gre/modules/ExtensionProcessScript.sys.mjs
resource://gre/modules/WebNavigation.sys.mjs
resource:///modules/BrowserWindowTracker.sys.mjs
resource://gre/modules/AppConstants.sys.mjs
(though that is preprocessed, x-ref bug 1951642)
It isn't clear to me if this is also affects any JS module registered as a component that then may get invoked from C++. I fear that it may, which considerably widens the scope of affected modules. If it were just the set above, we could conceivably ship some of them in two locations in omni.ja to mitigate (and/or potentially make the C++ code check both locations). If this affected all components, that no longer seems like a feasible mitigation, so it would be helpful to know if this is the case.
I also don't know where we are in assessing the size of this problem (i.e. number of affected clients) and/or mitigating it otherwise. Chris, do you know the answers to any of this?
Reporter | ||
Comment 1•22 days ago
|
||
Needinfo from comment 0 seems to have gotten lost...
Comment 2•21 days ago
|
||
This is the page where I'm documenting the issue, its effects on users, and what we're doing to fix the issue:
https://docs.google.com/document/d/15M-QYfJIUm_3KXY_BhhjUlotT00YFQnrZZRmfL9QbMQ/edit?tab=t.0
This query shows the count of times where the updater noticed that the version of a file on disk was not what was expected for the update being applied: https://sql.telemetry.mozilla.org/queries/105090/source?p_day_count=14
Reporter | ||
Comment 3•16 days ago
|
||
(In reply to Chris DuPuis from comment #2)
This is the page where I'm documenting the issue, its effects on users, and what we're doing to fix the issue:
https://docs.google.com/document/d/15M-QYfJIUm_3KXY_BhhjUlotT00YFQnrZZRmfL9QbMQ/edit?tab=t.0
Thanks. I would like to understand to what degree this might affect JS modules registered as components:
It isn't clear to me if this is also affects any JS module registered as a component that then may get invoked from C++. I fear that it may, which considerably widens the scope of affected modules.
So this would be the hits from StaticComponents
in the searchfox query from comment 0. Can you comment on this or redirect to someone who could?
This query shows the count of times where the updater noticed that the version of a file on disk was not what was expected for the update being applied: https://sql.telemetry.mozilla.org/queries/105090/source?p_day_count=14
This is suffering from the "lonely number" problem - what is the denominator here? Without that context I don't know how to interpret these numbers - 10k errors a day feels like a lot but I don't know if that's out of 1 million or 10 million or 100 million attempts (being somewhat facetious... but the other number is clearly important!). As Nick noted in the doc, having numbers per client ID would also be helpful (are a small set of clients repeatedly failing to update or are all of them failing once) but I don't know if that is straightforward or not.
Comment 4•8 days ago
|
||
:Gijs (he/him) This might help with understanding the problem.
The "CRC_ERROR" telemetry only happens when the updater is able to detect that an installation already has one or more mismatched files, but its detection ability is very limited. It can't detect cases where the browser simply won't launch anymore (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1938035). It also can't detect cases where there is a mismatched file that is not part of the partial MAR being applied. So this means that any number is undercounting the real frequency.
In the cases where the CRC_ERROR is detected, we fall back to doing a full MAR update, which overwrites all installation files with the correct version. This is expected to clear up the issue. So, the few tens of thousands of instances seen per day SHOULD represent unique users whose installations have just been repaired. My interpretation is that the denominator should be the population of users, not the number of attempts.
Comment 5•8 days ago
|
||
As for the impact: this is not really a C++ vs. JS issue.
The problem occurs any time we introduce a change in one file (such as C:\Program Files\Mozilla Firefox\omni.ja) that relies on a change in another file (such as C:\Program Files\Mozilla Firefox\browser\omni.ja). If one file is updated and the other is not, the feature breaks.
Reporter | ||
Comment 6•8 days ago
|
||
(In reply to Chris DuPuis from comment #4)
:Gijs (he/him) This might help with understanding the problem.
The "CRC_ERROR" telemetry only happens when the updater is able to detect that an installation already has one or more mismatched files, but its detection ability is very limited. It can't detect cases where the browser simply won't launch anymore (e.g. https://bugzilla.mozilla.org/show_bug.cgi?id=1938035). It also can't detect cases where there is a mismatched file that is not part of the partial MAR being applied. So this means that any number is undercounting the real frequency.
In the cases where the CRC_ERROR is detected, we fall back to doing a full MAR update, which overwrites all installation files with the correct version. This is expected to clear up the issue. So, the few tens of thousands of instances seen per day SHOULD represent unique users whose installations have just been repaired. My interpretation is that the denominator should be the population of users, not the number of attempts.
Thanks for the added context. Unfortunately although it helps understand what the telemetry represents it makes me more confused about the size of the underlying problem. Specifically, your comments make it sound like we would expect that 100% of the numbers from comment 2 are actually fixed by the updater so users never notice a problem, because the updater detects and fixes the problem (by applying the full MAR) before it becomes an actual, user-perceived problem.
But in bug 1941134 we found cases where users were experiencing issues (ie crashes) due to these types of mismatches. There it turned out that users were ending up in a state where the browser was started (so presumably an update had finished applying etc.) and there was still a mismatch - in fact, sometimes, a mismatch of several versions, which suggests that the problem persisted through several updates.
Do we have other telemetry or any other information that indicates how frequent that problem is, other than the frequency of crashes that we encountered in that bug? And/or do we have evidence to suggest that all the cases from that bug are ones where somehow applying the full MAR fails, ie that the numbers are likely a subset of the telemetry numbers from comment 2?
My question about a second number is also still not really answered: how many successful CRC checks happen each day, so I can compare with the number of errors and have some idea of what percentage of updates is failing? I have rough notions on the number of users we have but I don't know how to extrapolate that to number of updates (not all users will start their machine each day, the updater may or may not have connectivity to the internet at a given point, etc. etc.). It would be best to compare apples to apples.
(In reply to Chris DuPuis from comment #5)
As for the impact: this is not really a C++ vs. JS issue.
The problem occurs any time we introduce a change in one file (such as C:\Program Files\Mozilla Firefox\omni.ja) that relies on a change in another file (such as C:\Program Files\Mozilla Firefox\browser\omni.ja). If one file is updated and the other is not, the feature breaks.
Well, sure, but "the feature breaks" is less severe than "the browser refuses to start" because the C++ code is crashing.
The example you're giving here is actually one that would be somewhat improved by the switch to moz-src
as it moves all the relevant files into 1 omni.ja ... though on the way to get there, it is conceivable that module loads would fail, yes.
Comment 7•8 days ago
|
||
(In reply to :Gijs (he/him) from comment #6)
Thanks for the added context. Unfortunately although it helps understand what the telemetry represents it makes me more confused about the size of the underlying problem. Specifically, your comments make it sound like we would expect that 100% of the numbers from comment 2 are actually fixed by the updater so users never notice a problem, because the updater detects and fixes the problem (by applying the full MAR) before it becomes an actual, user-perceived problem.
Here's a concrete example:
- Partial MAR update from version 198.0.3 to 199.0
In this update, there is a change to browser/omni.ja, but when applying the update, one particular client installation fails to update browser/omni.ja.
Now you have an installation where most files are at version 199.0, but browser/omni.ja is at version 198.0.3.
However, as long as the differences between the 198.0.3 and 199.0 version are "compatible enough", the browser will be able to run. The user may notice problems, particularly if they expect some feature that was delivered in 199.0 to work.
Whether the old and new versions of a file are "compatible enough" is the difference between Firefox refusing to start or just some features looking odd.
- Partial MAR update from version 199.0 to 199.0.1
There is no change to browser/omni.ja in this patch, so the update process does not detect that the file is at the wrong version. The client installation is left in a mismatched state.
- Partial MAR update from version 199.0.1 to 200.0
There is a change to browser/omni.ja in this path. The update process notices that the installed version is incorrect, and downloads and applies a Full MAR update for version 200.0.
This fixes the client installation.
Comment 8•8 days ago
|
||
In the example above, if there was a code change in version 199.0.1 to any other file that depended on code that was introduced/changed in browser/omni.ja in version 199.0, then that code change would be broken.
This is what we saw in the incident.
Comment 9•7 days ago
|
||
The Bugbug bot thinks this bug should belong to the 'Toolkit::UI Widgets' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.
Updated•7 days ago
|
Description
•