Crash in [@ mozilla::ErrorLoadingSheet]
Categories
(Core :: XPCOM, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox-esr78 | --- | unaffected |
firefox79 | --- | unaffected |
firefox80 | blocking | verified |
firefox81 | blocking | verified |
firefox82 | --- | fixed |
People
(Reporter: jcristau, Assigned: alexical)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(7 files)
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-release+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details | Review |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review | |
47 bytes,
text/x-phabricator-request
|
Details | Review |
This bug is for crash report bp-0c00bee1-3fce-4e97-a43b-1a8a80200730.
Top 10 frames of crashing thread:
0 xul.dll mozilla::ErrorLoadingSheet layout/style/GlobalStyleSheetCache.cpp:529
1 xul.dll mozilla::GlobalStyleSheetCache::LoadSheet layout/style/GlobalStyleSheetCache.cpp:564
2 xul.dll mozilla::GlobalStyleSheetCache::LoadSheetURL layout/style/GlobalStyleSheetCache.cpp:509
3 xul.dll mozilla::GlobalStyleSheetCache::XULSheet layout/style/UserAgentStyleSheetList.h:37
4 xul.dll mozilla::GlobalStyleSheetCache::GlobalStyleSheetCache layout/style/GlobalStyleSheetCache.cpp:239
5 xul.dll static mozilla::GlobalStyleSheetCache::Singleton layout/style/GlobalStyleSheetCache.cpp:459
6 xul.dll mozilla::dom::Document::FillStyleSetUserAndUASheets dom/base/Document.cpp:2797
7 xul.dll mozilla::dom::Document::CreatePresShell dom/base/Document.cpp:6192
8 xul.dll nsDocumentViewer::InitPresentationStuff layout/base/nsDocumentViewer.cpp:712
9 xul.dll nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:918
Seeing a number of crashes in 80.0b1 with crash reason "LoadSheetSync failed with error 80520012 loading built-in stylesheet 'chrome://global/content/xul.css'"
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
That is NS_ERROR_FILE_NOT_FOUND
. That crash needs to happen very early at startup... Maybe broken update? Other than that not sure...
Comment 2•4 years ago
|
||
Hmm. Doesn't reproduce for me, either from an update or a new install of 80b1. I suppose this could be a corrupt omnijar, but I'd think that would get noticed even earlier than this (a CRC would fail in the ZIP reader, right?). And there's no correlation with one architecture, so a single bad update file seems impossible. I only know of one recent change to how we generate update files, but that was bug 1643211 and it landed for 79, and I can't think of anything we've done that affects how applying patches works in a very long time. I have no idea whether anything relevant around the CSS loader or GlobalStyleSheetCache has changed.
I think I'd have to see a copy of an affected installation to be able to say anything else useful.
Updated•4 years ago
|
Comment 4•4 years ago
|
||
sorry for the guesswork here, but could this be related to the changes from bug 1627075? (since it's happening on startup and changes the behaviour around omni.ja, which holds all these stylesheets)
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 5•4 years ago
|
||
It's definitely plausible. I'll poke at this today.
Reporter | ||
Comment 6•4 years ago
|
||
Can/should we somehow undo the work from bug 1627075 in 80? We're about to build the last beta, and these startup crashes are worrying.
Assignee | ||
Comment 7•4 years ago
|
||
I think we can pretty cleanly turn off the part which caches the zip central for the omnijars in the startupcache. If I had to bet where something might be breaking which was causing this, that is where it would be. (Given the error message being NS_ERROR_FILE_NOT_FOUND
). I'll put up a patch for that shortly.
Assignee | ||
Comment 8•4 years ago
|
||
This is a speculative fix for a crash we're seeing due to xul.css ostensibly
not existing. The theory is that xul.css does in fact exist and the cached
zip central for the omnijar is simply corrupt in some way. If it is corrupt in
this way, then there is a bigger issue, and we need to investigate deeper.
However, the benefit of this approach is that it is a very small and contained
patch which should be simple to uplift.
Comment 10•4 years ago
|
||
bugherder |
Updated•4 years ago
|
Reporter | ||
Comment 11•4 years ago
|
||
Unfortunately it looks like we got crashes with this signature in 81 after the patch landed, can you take a look at those?
Reporter | ||
Comment 12•4 years ago
|
||
(as per comment 6, I'm still worried about shipping 80 with these startup crashes)
Assignee | ||
Comment 13•4 years ago
|
||
To be clear, I am investigating. I think we have two options at this point.
-
Roll back all of the changes from bug 1627075 as well as all of the follow-up changes. This is somewhat risky because of the amount of code that needs to be rolled back, but it has a reasonable guarantee of working if done without error, on the assumption that the startup cache is responsible for these failures.
-
Roll back the part which caches
chrome.manifest
s in the startup cache. Given that we are seeingNS_ERROR_FILE_NOT_FOUND
this seems like the next most likely culprit. However since we don't know the underlying cause of why the cachedchrome.manifest
contents would be corrupted, there is no guarantee that other things are not corrupted and causing problems.
I am trying to get more information to propose a more informed decision WRT these options.
Comment 14•4 years ago
|
||
I'm going to move this to Core::XPCOM given the potential regressing bug, and the fact that nobody on layout is currently investigating. I don't want to mislead for future inquiries (I'm getting pinged out-of-band about it). Sounds like it's only sitting in layout because it was xul.css that was not found on startup.
Doug: Feel free to move it back if that's not the case.
Reporter | ||
Comment 15•4 years ago
|
||
(In reply to Doug Thayer [:dthayer] (he/him) from comment #13)
To be clear, I am investigating. I think we have two options at this point.
Roll back all of the changes from bug 1627075 as well as all of the follow-up changes. This is somewhat risky because of the amount of code that needs to be rolled back, but it has a reasonable guarantee of working if done without error, on the assumption that the startup cache is responsible for these failures.
Roll back the part which caches
chrome.manifest
s in the startup cache. Given that we are seeingNS_ERROR_FILE_NOT_FOUND
this seems like the next most likely culprit. However since we don't know the underlying cause of why the cachedchrome.manifest
contents would be corrupted, there is no guarantee that other things are not corrupted and causing problems.I am trying to get more information to propose a more informed decision WRT these options.
Option 1 sounds most likely to get us back to a known state for 80. Any chance you can get a patch prepped for this today so we can respin a release candidate ASAP and verify the crashes go away?
Assignee | ||
Comment 16•4 years ago
|
||
(In reply to Julien Cristau [:jcristau] from comment #15)
Option 1 sounds most likely to get us back to a known state for 80. Any chance you can get a patch prepped for this today so we can respin a release candidate ASAP and verify the crashes go away?
I should be able to get a patch up for that today, yes.
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 17•4 years ago
|
||
This patch is the result of hg backout on (nearly) every patch in bug 1627075
as well as its related bugs. Additionally two conflicts needed to be resolved,
one of which came from Nika Layzell's work to "Switch native remoteType values
to nsCString" (Bug 1650163), and Simon Giesecke's work to "Replace MakeSpan uses
by constructor calls" (Bug 1653335), both of which were trivially resolvable.
Comment 18•4 years ago
|
||
Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)
Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.
Where is my head today? Right. Give me a moment.
Assignee | ||
Comment 20•4 years ago
•
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)
Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.
Just realized release has https://hg.mozilla.org/releases/mozilla-release/rev/7f437cdb68e0, and beta seems to not. So to be clear, I should then be applying this onto mozilla-release, and not mozilla-beta? (Since that linked patch would need to be rolled back as well)
Comment 21•4 years ago
|
||
Beta is closed until Fx81 merges to it on Monday, so you can safely ignore it. Please target mozilla-release.
Assignee | ||
Comment 22•4 years ago
•
|
||
Okay, the patch in the most recent review is now based on mozilla-release. Assuming it builds (which it should, but I haven't completed a build yet,) it should be good. There was only one conflict, which was trivial.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 23•4 years ago
•
|
||
Comment on attachment 9170746 [details]
Bug 1656261 - Back out all recent StartupCache changes r?froydnj
Beta/Release Uplift Approval Request
- User impact if declined: Speculatively, startup crashes in ErrorLoadingSheet.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): It's just a bit large, and is rolling back patches which were initially applied over a month ago. The rollback was fairly clean, however, so I would be surprised but not completely shocked if there were issues. Given that it's just a change to a cache, though (i.e., it intends to be transparent), it seems rather unlikely that anything since this landed has come to depend on it. EDIT: the alternative, other than declining to do anything, is listed in comment 13, but I would wager that that alternative is a little bit more risky.
- String changes made/needed: None
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment 24•4 years ago
|
||
Comment on attachment 9170746 [details]
Bug 1656261 - Back out all recent StartupCache changes r?froydnj
Reverts us to a hopefully known-good state to resolve a number of regressions caused by bug 1627075. Approved for 80.0rc2.
Comment 25•4 years ago
|
||
bugherder uplift |
Comment 26•4 years ago
|
||
No crashes from 80.0rc2, so it appears that the backout was effective. We'll still need to figure out what to do for 81+, however.
Comment 27•4 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #26)
No crashes from 80.0rc2, so it appears that the backout was effective. We'll still need to figure out what to do for 81+, however.
Can we try alternative 2 from comment 13 for 81?
Assignee | ||
Comment 28•4 years ago
|
||
I feel rather confident that comment 13 will work, at least for this crash. I am able to produce this crash by manually corrupting the StartupCache entry for the chrome.manifest
file in question. However it's unclear whether it would just open up a new crash down the line, given we do not understand how we are seeing bad content in the chrome.manifest
file.
As far as I can tell, there are three plausible ways we could be getting a bad chrome.manifest
:
- We're somehow giving the
StartupCache
a badchrome.manifest
buffer to begin with inPutBuffer
. - We're somehow writing bad things into the buffer we pass off to the
StartupCache
viaPutBuffer
after the fact. - We're somehow caching deflate-compressed omni.ja contents inside the
StartupCache
, and then handing that to the caller without inflating.
What's unlikely is that the startup cache's file is being corrupted on disk, as we look at the checksum of the contents when decompressing. It's also unlikely that the startup cache is just erroring in some way, as that would just cause us to fall back to reading directly out of the omni.ja file.
Assignee | ||
Comment 29•4 years ago
|
||
So far I have not found a way for this issue to cause bug 1656261, but
it's definitely plausible that there's something there. If we hit a
EXCEPTION_IN_PAGE_ERROR in GetBuffer, then we will exit the
DecompressEntry method with the entry's mData being set. In any
subsequent calls to GetBuffer for that key, we will return the
uninitialized memory. If we can hit this path with chrome.manifest,
then we would end up reading garbage from it, and could conclude that
xul.css is missing.
Whether this is the culprit or not, this is a problem that needs
addressing.
Updated•4 years ago
|
Comment 30•4 years ago
|
||
Hey Doug, can we back this out of 81 please? Also it would be good if we could put this behind a Nightly pref until it's ready to ship.
Assignee | ||
Comment 31•4 years ago
|
||
I am on PTO for the next week as we're moving out of state, but I should be able to throw up a backout patch this evening.
Comment 32•4 years ago
|
||
Comment 33•4 years ago
|
||
bugherder |
Reporter | ||
Updated•4 years ago
|
Assignee | ||
Comment 34•4 years ago
|
||
Ack - sorry, forgot to set the leave-open keyword. It's entirely possible that the most recent patch will not resolve this, so I'm putting it back to open. Again, still planning on just putting up a backout patch for this for beta - did not have time for it last night.
Assignee | ||
Updated•4 years ago
|
Comment 35•4 years ago
|
||
The patch landed in nightly and beta is affected.
:dthayer, is this bug important enough to require an uplift?
If not please set status_beta
to wontfix
.
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 36•4 years ago
|
||
This is basically the same thing as the last patch you saw for
release, just with more patches to back out and more unrelated
patches to work around.
Assignee | ||
Comment 37•4 years ago
|
||
I am going to continue looking at this for the next week. If I cannot find a clear smoking gun within that time, I will fully back out all of the recent startup cache changes on Nightly until I can sort out what's going wrong. If this timeline doesn't work for anyone, please let me know.
Assignee | ||
Comment 38•4 years ago
|
||
Comment on attachment 9173655 [details]
Bug 1656261 - Back out recent startupcache changes on beta r?froydnj
Beta/Release Uplift Approval Request
- User impact if declined: Startup crashes in ErrorLoadingSheet.
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): We already uplifted a similar patch to 80 release.
- String changes made/needed:
Comment 39•4 years ago
|
||
Comment on attachment 9173655 [details]
Bug 1656261 - Back out recent startupcache changes on beta r?froydnj
Approved for 81.0b6, thanks for putting this together.
Comment 40•4 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 41•4 years ago
|
||
The nsZipCursor::ReadOrCopy API is a bit strange, and probably should be
revisited. Notably, if the compression on an item is STORED, and we are doing
a Copy, the returned value will always be null, and the caller is expected to
just get what they want from the buffer they passed into the nsZipCursor's
constructor. This functions all right, though it's quite unintuitive. However,
if we fail the CRC, the only observable consequence right now is that we
return a nullptr, which with a STORED Copy will happen anyway. Thus, we have
no means of knowing if we failed the CRC. This change simply zeroes out the
aBytesRead argument if we fail the CRC so we don't read garbage.
NOTE: this is linked to bug 1646261 because it was discovered while that bug
was being investigated. However, it's not a plausible underlying cause of
that issue.
Assignee | ||
Comment 42•4 years ago
|
||
Given all of the MOZ_TRYs in WriteToDisk, it's entirely possible that
we occasionally exit it before writing everything out. WriteToDisk
additionally adds guards to try to ensure that we do not call it twice,
and that we do not try to read things out of the file via the memmapped
buffer after it has written to it. The latter should not generally be
a problem, since we write to a temp file and move that over at the end.
However, writing twice could be a problem as we modify our values'
mOffsets to keep track of their offsets within the resulting file.
Accordingly, this change simply moves the cleanup and guards into an
RAII helper to ensure they survive all of the early returns.
Depends on D89251
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Comment 43•4 years ago
|
||
Comment hidden (offtopic) |
Comment 45•4 years ago
|
||
bugherder |
Assignee | ||
Comment 46•4 years ago
|
||
This backs out all work from bug 1627075 as well as all of its
descendents. There were a few conflicts when backing this out but
overall it was pretty clean, so I would say it's a fairly mild
level of risk. Historically Nathan Froyd has reviewed these patches,
but he is no longer at Mozilla, and no one else is particularly
familiar with the code, so I am passing this off to RyanVM who has
at least been familiar with the history of the bug.
Comment 47•4 years ago
|
||
Comment 48•4 years ago
|
||
bugherder |
Reporter | ||
Comment 49•4 years ago
|
||
It seems to me we should call this bug fixed, at this point. There's a few remaining crashes with this signature but they're presumably not related to the spikes we saw in 80 and 81.
Description
•