XPCOM/XPConnect should do a "canary" load of a JS module (Was: fast startup crash in ServiceWorkerRegistrar::ProfileStarted)
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: sfink, Assigned: jstutte)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash-thunderbird, Whiteboard: [startupcrash] [qa-not-actionable])
Crash Data
Attachments
(4 files, 1 obsolete file)
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Comment 2•6 years ago
|
||
Comment 3•6 years ago
|
||
Comment 4•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Comment 5•6 years ago
|
||
Reporter | ||
Comment 6•6 years ago
|
||
Reporter | ||
Comment 7•6 years ago
|
||
Reporter | ||
Comment 8•6 years ago
|
||
Comment 9•6 years ago
|
||
Reporter | ||
Comment 10•6 years ago
|
||
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #10)
Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)
I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?
Comment 12•5 years ago
•
|
||
(In reply to Jens Stutte [:jstutte] from comment #11)
(In reply to Steve Fink [:sfink] [:s:] from comment #10)
Would it be worth filing a bug to report JS exceptions during startup
instead of swallowing them and crashing here? (I forget the exact path, but
I know there was an exception thrown that led to this.)I would second that, given that we cannot do much here in the Service Worker component, it seems. :asuth, what do you think?
I think this amounts to a request to create an infallible do_getService
variant called do_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.
Needinfo-ing :bholley the XPConnect owner because that's where the bulk of the XPConnect scenario lives.
Another possibility lower down would be that the lower level JAR protocol stuff should be performing some type of integrity check on omni.ja that validates some mixture of:
- not truncated
- passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
- verifies there's an expected build identifier signature baked in to various places in the file (possibly via sentinel zip/jar entries?)
Needinfo-ing :dragana as owner of Necko where Core::Networking: JAR lives.
:bholley/:dragana note that you may want to take a look at bug 1403348 which is where most of the diagnostic effort related to this problem has previously lived. (And where in https://bugzilla.mozilla.org/show_bug.cgi?id=1403348#c79 there are suggestions the problem isn't just an application update problem because it happens on Android, which potentially also means it's also not rogue anti-virus software.)
Comment 13•5 years ago
|
||
Michal, can you take a look?
Comment 14•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
I think this amounts to a request to create an infallible
do_getService
variant calleddo_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.
I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?
Comment 15•5 years ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
- passes some level of checksum/crc32/hash check, possibly lazily. (do we do things to trick the operating system to perform readahead that could be used for hashing purposes?)
Corrupted files should be detected when checking CRC at https://searchfox.org/mozilla-central/rev/e076e40ab1290f4e5e67ebd21dc8af753fc05be6/modules/libjar/nsJARInputStream.cpp#333
Comment 16•5 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #14)
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #12)
I think this amounts to a request to create an infallible
do_getService
variant calleddo_getServiceThatDefinitelyExists
which internalizes the error reporting logic we crammed into ServiceWorkerRegistrar because it's the first native code that tries to load a JS service and crashes instead of silently breaking.I think it would be reasonable to add a do_GetServiceInfallible, or somesuch, rather than sprinkling the all-is-lost logic in whatever callsites happen to run early in startup. Nathan, do you agree?
I'm not super-excited about this, just because of the potential for people putting it along paths that eventually get called during shutdown. Or people will start replacing code like:
nsCOMPtr<nsIObserverService> obsService = ...;
if (obsService) { ... }
with:
nsCOMPtr<nsIObserverService> obsService = do_GetServiceInfallible(...);
...stuff...
and then get bitten when that code runs in unexpected places.
If we wanted to make it a super-scary name, I guess it might be OK?
Assignee | ||
Comment 17•5 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #16)
If we wanted to make it a super-scary name, I guess it might be OK?
I would surely second that, and I find the current naming convention doXInfallible also used elsewhere always very confusing.
Comment 18•5 years ago
|
||
Hm, ok.
Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.
Perhaps right at [1], which occurs after XPConnect has been initialized [2], but before the doStartup call [3] that takes us into service workers?
[1] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4412
[2] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4404
[3] https://searchfox.org/mozilla-central/rev/cfd1cc461f1efe0d66c2fdc17c024a203d5a2fd8/toolkit/xre/nsAppRunner.cpp#4418
Comment 20•5 years ago
|
||
(In reply to Bobby Holley (:bholley) from comment #18)
Perhaps what we really want here is for XPConnect to do a "canary" service-load of the async shutdown service early in startup, before we start notifying random observers to start initializing Gecko components. That would at least prevent the problem from moving around if the observers start firing in a different order.
Since I think this proposed fix is in XPCOM, moving the bug there.
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 21•4 years ago
|
||
Should we remove the stalled keyword here since there's a possible way forward described in comment 18?
Comment 22•4 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #21)
Should we remove the stalled keyword here since there's a possible way forward described in comment 18?
Sure.
Assignee | ||
Comment 23•4 years ago
|
||
Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?
Comment 24•4 years ago
|
||
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
For example bp-4be5fed4-4b0c-4e2e-93a3-f13ab0201222
Comment 25•4 years ago
|
||
For about 98% of these crashes in the last 6 months, the nsresult error value for these crashes appears to be NS_ERROR_FILE_NOT_FOUND. The Windows error values that can return that are ERROR_FILE_NOT_FOUND, ERROR_PATH_NOT_FOUND, ERROR_INVALID_DRIVE, ERROR_NOT_READY.
Assignee | ||
Comment 26•4 years ago
|
||
This reminds me a bit of bug 1686041 - we probably would want to know, which of these OS error codes really cause this, in particular ERROR_NOT_READY would be something we cannot do much about, right?
Updated•3 years ago
|
Comment 27•3 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #24)
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
It is now lower, at #30
Comment 28•3 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #27)
(In reply to Wayne Mery (:wsmwk) from comment #24)
This signature is ranked #10 for Thunderbird 78. And so is a serious problem.
It is now lower, at #30
but higher for version 91. Currently #8 for Thunderbird 91.8.1, and looks like it has been high for all of version 91.
bp-9eadd4a9-3c42-43db-920d-8ee5f0220424 Mac
bp-b05e316c-36a7-4eb1-8598-cc5950220424 Windows
Updated•2 years ago
|
Comment 29•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #23)
Hi Nika, assuming that Nathan will not move this forward any more and being a quite frequent startup crash - can you get to this?
Nika, is still possible?
#15 crash for Firefox 103
#19 crash for Thunderibrd 102.0.3
Comment 30•2 years ago
|
||
There's not a ton actionable we can do here, other than potentially change the signature (like the bug title suggests) by moving the crash earlier during startup. This bug appears to be caused by part of the Firefox/Thunderbird install (the omnijar) not being present on the user's computer, perhaps due to a botched install or similar.
This is somewhat aligned with some of the comments on the bug which say things like "impossible d'ouvrir" (can't open), suggesting this is a persistent broken install issue.
It appears that the only practical way forward here is to somehow communicate to the user that their Firefox/Thunderbird install appears to be broken, and ask them to re-install, without crashing somewhere while trying to show said dialog. We risk losing information about how frequently this is happening if we do that though, as we won't be showing the crash report dialog, and telemetry is fairly dependent on the omnijar to work at all.
Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.
Comment 31•2 years ago
|
||
To double-check what we think the error is (omni.ja
missing) I made a copy of my install & deleted the omni.ja
file from it, getting this crash: https://crash-stats.mozilla.org/report/index/67dd89dc-d583-4b0e-bb62-b13730220802. It appears very similar to the other crashes here, so I'm inclined to believe that it is the problem.
Comment 32•2 years ago
•
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #30)
Redirecting a ni? to :bytesized in case this issue is something which the updater could help us automatically detect and/or fix a broken install in a reliable way.
I doubt that there is much that I can help with here, especially without more information about what is going wrong. The background updater probably can't fix a build that is crashing on startup because the background updater runs using the Firefox binary. So if the omnijar is missing, I don't see how it could start either.
If we are considering this to be a potential problem caused by the updater, we do already attempt to recover from bad updates. With staging enabled (the default), we should stage the whole new install into a subdirectory and move it into place when Firefox launches. If patching or extracting a file fails, staging should fail and the update will not be installed. If staging is not enabled we should roll back the changes if the update fails.
I have seen reports where it appears that this process fails. I'm not entirely sure why. I can't find the bug that I'm thinking of, but I remember the update log looking basically like this:
updating normally...
Some file is not patched successfully (permission denied error)
We start rolling back
Some of the backup files mysteriously don't exist so we are not able to roll back all files successfully.
I can only think of two possible reasons why something like this would happen: Some kind of antivirus weirdness, or two Firefox invocations managed to race so closely that they both started an updater and the two updaters interfered with each other. I've filed Bug 1783939 to do better at preventing the latter possibility. I'm not sure what we can do about the former.
I don't know how to explain the fact that the original error was a permission denied error. My best guess is that antivirus was involved. Even if the file was in use, we should still be able to move it out of the way.
Comment 33•2 years ago
|
||
There is some good news. Around April 1, Firefox crashes went into steep decline and by mid-May dropped by more than 30%. It looks to me like the majority of that reduction came from within the Windows user population.
Not possible to say whether Thunderbird will have a similar decline, for users going from esr 91 to esr 102.
There is a not insignficant Mac population
Comment 34•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criteria:
- Top 20 desktop browser crashes on release (startup)
- Top 20 desktop browser crashes on beta (startup)
For more information, please visit auto_nag documentation.
Updated•2 years ago
|
Comment 35•2 years ago
|
||
This bug would not actually fix the startup crashes which are occuring here (due to a missing omni.ja or corrupted omni.ja), and would instead just change the crash signature to something more consistent, so dropping to S3 for now.
There is unlikely to be a way to automatically repair broken installations during startup right now.
Comment 36•2 years ago
|
||
It's also worth noting that bug 1760855 migrating the async shutdown service be implemented in C++ for C++ services would also address/moot this bug.
Assignee | ||
Comment 37•8 months ago
•
|
||
I do not think the [@ nsTString<T>::get]
signature belongs here, at least from looking at some of the few instances.
Summarizing the status:
Comment 35, comment 30 and comment 31 seem to indicate that we would want to have an early check if omni.ja
exists and can be accessed such that we can inform the user about a broken install in case. The volume makes me think that it would be worth to have some comprehensive feedback for the affected users instead of crashing.
FWIW wrt comment 25 in the meantime we mapped ERROR_NOT_READY
to NS_ERROR_FILE_DEVICE_TEMPORARY_FAILURE
in D130905, so only ERROR_FILE_NOT_FOUND
, ERROR_PATH_NOT_FOUND
and ERROR_INVALID_DRIVE
remain here, which seem to be all permanent enough conditions to fall under the "omni.ja is missing" case.
Assignee | ||
Comment 38•8 months ago
|
||
Assignee | ||
Comment 39•8 months ago
|
||
Depends on D205026
Updated•8 months ago
|
Updated•8 months ago
|
Assignee | ||
Comment 40•8 months ago
|
||
FWIW wrt comment 25 in the meantime we mapped
ERROR_NOT_READY
toNS_ERROR_FILE_DEVICE_TEMPORARY_FAILURE
in D130905, so onlyERROR_FILE_NOT_FOUND
,ERROR_PATH_NOT_FOUND
andERROR_INVALID_DRIVE
remain here, which seem to be all permanent enough conditions to fall under the "omni.ja is missing" case.
Note that NS_ERROR_FILE_NOT_FOUND
is also generated by nsJAR
code for example here and by some other places in our code which are not going through ConvertWinError
.
Updated•8 months ago
|
Updated•8 months ago
|
Updated•8 months ago
|
Updated•7 months ago
|
Updated•7 months ago
|
Updated•7 months ago
|
Comment 41•6 months ago
|
||
Based on the topcrash criteria, the crash signatures linked to this bug are not in the topcrash signatures anymore.
For more information, please visit BugBot documentation.
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Updated•6 months ago
|
Updated•4 months ago
|
Comment 42•4 months ago
|
||
Assignee | ||
Updated•4 months ago
|
Comment 43•4 months ago
|
||
bugherder |
Assignee | ||
Comment 44•4 months ago
|
||
Well, fixed is a huge word. But it might make sense once the new crashes appear to open two distinct bugs, one for NS_ERROR_OMNIJAR_CORRUPT
and one for NS_ERROR_OMNIJAR_OR_DIR_MISSING
(if both appear). And we can hope that more people will take action if they see the popup and fix their installation or computer such that we will see less instances, of course.
Updated•4 months ago
|
Assignee | ||
Comment 45•3 months ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #36)
It's also worth noting that bug 1760855 migrating the async shutdown service be implemented in C++ for C++ services would also address/moot this bug.
Just for the records: Doing so would have just post-poned the problem to the next thing that wants to use the missing or corrupted omni.ja. Indeed the patch opted now to do just a canary load of the AppConstants.sys.mjs
ES module, so we can move the async shutdown service to C++ without problems, if we ever want to.
Comment 46•6 days ago
|
||
I don't see much reason to take this on ESR128, but I'd be open to an approval request if someone sees value in doing so.
Assignee | ||
Comment 47•6 days ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #46)
I don't see much reason to take this on ESR128, but I'd be open to an approval request if someone sees value in doing so.
Agreed, also because of the tree of dependencies of the regression this has caused on Android.
Description
•