Closed Bug 1656261 Opened 4 years ago Closed 4 years ago

Crash in [@ mozilla::ErrorLoadingSheet]

Categories

(Core :: XPCOM, defect, P1)

Unspecified
Windows
defect

Tracking

()

RESOLVED FIXED
82 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox79 --- unaffected
firefox80 blocking verified
firefox81 blocking verified
firefox82 --- fixed

People

(Reporter: jcristau, Assigned: alexical)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(7 files)

This bug is for crash report bp-0c00bee1-3fce-4e97-a43b-1a8a80200730.

Top 10 frames of crashing thread:

0 xul.dll mozilla::ErrorLoadingSheet layout/style/GlobalStyleSheetCache.cpp:529
1 xul.dll mozilla::GlobalStyleSheetCache::LoadSheet layout/style/GlobalStyleSheetCache.cpp:564
2 xul.dll mozilla::GlobalStyleSheetCache::LoadSheetURL layout/style/GlobalStyleSheetCache.cpp:509
3 xul.dll mozilla::GlobalStyleSheetCache::XULSheet layout/style/UserAgentStyleSheetList.h:37
4 xul.dll mozilla::GlobalStyleSheetCache::GlobalStyleSheetCache layout/style/GlobalStyleSheetCache.cpp:239
5 xul.dll static mozilla::GlobalStyleSheetCache::Singleton layout/style/GlobalStyleSheetCache.cpp:459
6 xul.dll mozilla::dom::Document::FillStyleSetUserAndUASheets dom/base/Document.cpp:2797
7 xul.dll mozilla::dom::Document::CreatePresShell dom/base/Document.cpp:6192
8 xul.dll nsDocumentViewer::InitPresentationStuff layout/base/nsDocumentViewer.cpp:712
9 xul.dll nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:918

Seeing a number of crashes in 80.0b1 with crash reason "LoadSheetSync failed with error 80520012 loading built-in stylesheet 'chrome://global/content/xul.css'"

That is NS_ERROR_FILE_NOT_FOUND. That crash needs to happen very early at startup... Maybe broken update? Other than that not sure...

Flags: needinfo?(mhowell)

Hmm. Doesn't reproduce for me, either from an update or a new install of 80b1. I suppose this could be a corrupt omnijar, but I'd think that would get noticed even earlier than this (a CRC would fail in the ZIP reader, right?). And there's no correlation with one architecture, so a single bad update file seems impossible. I only know of one recent change to how we generate update files, but that was bug 1643211 and it landed for 79, and I can't think of anything we've done that affects how applying patches works in a very long time. I have no idea whether anything relevant around the CSS loader or GlobalStyleSheetCache has changed.

I think I'd have to see a copy of an affected installation to be able to say anything else useful.

Flags: needinfo?(mhowell)
Severity: -- → S2

heycam: Any ideas here?

Flags: needinfo?(cam)

sorry for the guesswork here, but could this be related to the changes from bug 1627075? (since it's happening on startup and changes the behaviour around omni.ja, which holds all these stylesheets)

Flags: needinfo?(dothayer)

It's definitely plausible. I'll poke at this today.

Flags: needinfo?(dothayer)

Can/should we somehow undo the work from bug 1627075 in 80? We're about to build the last beta, and these startup crashes are worrying.

Flags: needinfo?(dothayer)

I think we can pretty cleanly turn off the part which caches the zip central for the omnijars in the startupcache. If I had to bet where something might be breaking which was causing this, that is where it would be. (Given the error message being NS_ERROR_FILE_NOT_FOUND). I'll put up a patch for that shortly.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Flags: needinfo?(dothayer)

This is a speculative fix for a crash we're seeing due to xul.css ostensibly
not existing. The theory is that xul.css does in fact exist and the cached
zip central for the omnijar is simply corrupt in some way. If it is corrupt in
this way, then there is a bigger issue, and we need to investigate deeper.

However, the benefit of this approach is that it is a very small and contained
patch which should be simple to uplift.

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6eeb525ee65a
Disable StartupCaching of Omnijar zip central r=froydnj
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch

Unfortunately it looks like we got crashes with this signature in 81 after the patch landed, can you take a look at those?

Flags: needinfo?(dothayer)

(as per comment 6, I'm still worried about shipping 80 with these startup crashes)

To be clear, I am investigating. I think we have two options at this point.

  1. Roll back all of the changes from bug 1627075 as well as all of the follow-up changes. This is somewhat risky because of the amount of code that needs to be rolled back, but it has a reasonable guarantee of working if done without error, on the assumption that the startup cache is responsible for these failures.

  2. Roll back the part which caches chrome.manifests in the startup cache. Given that we are seeing NS_ERROR_FILE_NOT_FOUND this seems like the next most likely culprit. However since we don't know the underlying cause of why the cached chrome.manifest contents would be corrupted, there is no guarantee that other things are not corrupted and causing problems.

I am trying to get more information to propose a more informed decision WRT these options.

Flags: needinfo?(dothayer)

I'm going to move this to Core::XPCOM given the potential regressing bug, and the fact that nobody on layout is currently investigating. I don't want to mislead for future inquiries (I'm getting pinged out-of-band about it). Sounds like it's only sitting in layout because it was xul.css that was not found on startup.

Doug: Feel free to move it back if that's not the case.

Component: CSS Parsing and Computation → XPCOM

(In reply to Doug Thayer [:dthayer] (he/him) from comment #13)

To be clear, I am investigating. I think we have two options at this point.

  1. Roll back all of the changes from bug 1627075 as well as all of the follow-up changes. This is somewhat risky because of the amount of code that needs to be rolled back, but it has a reasonable guarantee of working if done without error, on the assumption that the startup cache is responsible for these failures.

  2. Roll back the part which caches chrome.manifests in the startup cache. Given that we are seeing NS_ERROR_FILE_NOT_FOUND this seems like the next most likely culprit. However since we don't know the underlying cause of why the cached chrome.manifest contents would be corrupted, there is no guarantee that other things are not corrupted and causing problems.

I am trying to get more information to propose a more informed decision WRT these options.

Option 1 sounds most likely to get us back to a known state for 80. Any chance you can get a patch prepped for this today so we can respin a release candidate ASAP and verify the crashes go away?

Flags: needinfo?(dothayer)

(In reply to Julien Cristau [:jcristau] from comment #15)

Option 1 sounds most likely to get us back to a known state for 80. Any chance you can get a patch prepped for this today so we can respin a release candidate ASAP and verify the crashes go away?

I should be able to get a patch up for that today, yes.

Flags: needinfo?(dothayer)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

This patch is the result of hg backout on (nearly) every patch in bug 1627075
as well as its related bugs. Additionally two conflicts needed to be resolved,
one of which came from Nika Layzell's work to "Switch native remoteType values
to nsCString" (Bug 1650163), and Simon Giesecke's work to "Replace MakeSpan uses
by constructor calls" (Bug 1653335), both of which were trivially resolvable.

Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.

Flags: needinfo?(dothayer)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)

Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.

Where is my head today? Right. Give me a moment.

Flags: needinfo?(dothayer)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #18)

Looks like this patch was created against m-c? It doesn't apply cleanly to Beta/Release where Fx80 resides.

Just realized release has https://hg.mozilla.org/releases/mozilla-release/rev/7f437cdb68e0, and beta seems to not. So to be clear, I should then be applying this onto mozilla-release, and not mozilla-beta? (Since that linked patch would need to be rolled back as well)

Flags: needinfo?(ryanvm)

Beta is closed until Fx81 merges to it on Monday, so you can safely ignore it. Please target mozilla-release.

Flags: needinfo?(ryanvm) → needinfo?(dothayer)

Okay, the patch in the most recent review is now based on mozilla-release. Assuming it builds (which it should, but I haven't completed a build yet,) it should be good. There was only one conflict, which was trivial.

Flags: needinfo?(dothayer)
Severity: S2 → S1
Status: REOPENED → ASSIGNED
Target Milestone: 81 Branch → ---
Keywords: regression
Regressed by: 1627075
Has Regression Range: --- → yes
Priority: -- → P1

Comment on attachment 9170746 [details]
Bug 1656261 - Back out all recent StartupCache changes r?froydnj

Beta/Release Uplift Approval Request

  • User impact if declined: Speculatively, startup crashes in ErrorLoadingSheet.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): It's just a bit large, and is rolling back patches which were initially applied over a month ago. The rollback was fairly clean, however, so I would be surprised but not completely shocked if there were issues. Given that it's just a change to a cache, though (i.e., it intends to be transparent), it seems rather unlikely that anything since this landed has come to depend on it. EDIT: the alternative, other than declining to do anything, is listed in comment 13, but I would wager that that alternative is a little bit more risky.
  • String changes made/needed: None
Attachment #9170746 - Flags: approval-mozilla-release?
Attachment #9169632 - Flags: approval-mozilla-release?
Flags: needinfo?(cam)
Attachment #9169632 - Flags: approval-mozilla-release?

Comment on attachment 9170746 [details]
Bug 1656261 - Back out all recent StartupCache changes r?froydnj

Reverts us to a hopefully known-good state to resolve a number of regressions caused by bug 1627075. Approved for 80.0rc2.

Attachment #9170746 - Flags: approval-mozilla-release? → approval-mozilla-release+

No crashes from 80.0rc2, so it appears that the backout was effective. We'll still need to figure out what to do for 81+, however.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #26)

No crashes from 80.0rc2, so it appears that the backout was effective. We'll still need to figure out what to do for 81+, however.

Can we try alternative 2 from comment 13 for 81?

I feel rather confident that comment 13 will work, at least for this crash. I am able to produce this crash by manually corrupting the StartupCache entry for the chrome.manifest file in question. However it's unclear whether it would just open up a new crash down the line, given we do not understand how we are seeing bad content in the chrome.manifest file.

As far as I can tell, there are three plausible ways we could be getting a bad chrome.manifest:

  1. We're somehow giving the StartupCache a bad chrome.manifest buffer to begin with in PutBuffer.
  2. We're somehow writing bad things into the buffer we pass off to the StartupCache via PutBuffer after the fact.
  3. We're somehow caching deflate-compressed omni.ja contents inside the StartupCache, and then handing that to the caller without inflating.

What's unlikely is that the startup cache's file is being corrupted on disk, as we look at the checksum of the contents when decompressing. It's also unlikely that the startup cache is just erroring in some way, as that would just cause us to fall back to reading directly out of the omni.ja file.

So far I have not found a way for this issue to cause bug 1656261, but
it's definitely plausible that there's something there. If we hit a
EXCEPTION_IN_PAGE_ERROR in GetBuffer, then we will exit the
DecompressEntry method with the entry's mData being set. In any
subsequent calls to GetBuffer for that key, we will return the
uninitialized memory. If we can hit this path with chrome.manifest,
then we would end up reading garbage from it, and could conclude that
xul.css is missing.

Whether this is the culprit or not, this is a problem that needs
addressing.

See Also: → 1661002

Hey Doug, can we back this out of 81 please? Also it would be good if we could put this behind a Nightly pref until it's ready to ship.

Flags: needinfo?(dothayer)

I am on PTO for the next week as we're moving out of state, but I should be able to throw up a backout patch this evening.

Flags: needinfo?(dothayer)
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e6d04f2dc9bc
Don't set scache mData until decompression done r=froydnj
Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 82 Branch

Ack - sorry, forgot to set the leave-open keyword. It's entirely possible that the most recent patch will not resolve this, so I'm putting it back to open. Again, still planning on just putting up a backout patch for this for beta - did not have time for it last night.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → ASSIGNED

The patch landed in nightly and beta is affected.
:dthayer, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(dothayer)

This is basically the same thing as the last patch you saw for
release, just with more patches to back out and more unrelated
patches to work around.

I am going to continue looking at this for the next week. If I cannot find a clear smoking gun within that time, I will fully back out all of the recent startup cache changes on Nightly until I can sort out what's going wrong. If this timeline doesn't work for anyone, please let me know.

Flags: needinfo?(dothayer)

Comment on attachment 9173655 [details]
Bug 1656261 - Back out recent startupcache changes on beta r?froydnj

Beta/Release Uplift Approval Request

  • User impact if declined: Startup crashes in ErrorLoadingSheet.
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We already uplifted a similar patch to 80 release.
  • String changes made/needed:
Attachment #9173655 - Flags: approval-mozilla-beta?

Comment on attachment 9173655 [details]
Bug 1656261 - Back out recent startupcache changes on beta r?froydnj

Approved for 81.0b6, thanks for putting this together.

Attachment #9173655 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

The nsZipCursor::ReadOrCopy API is a bit strange, and probably should be
revisited. Notably, if the compression on an item is STORED, and we are doing
a Copy, the returned value will always be null, and the caller is expected to
just get what they want from the buffer they passed into the nsZipCursor's
constructor. This functions all right, though it's quite unintuitive. However,
if we fail the CRC, the only observable consequence right now is that we
return a nullptr, which with a STORED Copy will happen anyway. Thus, we have
no means of knowing if we failed the CRC. This change simply zeroes out the
aBytesRead argument if we fail the CRC so we don't read garbage.

NOTE: this is linked to bug 1646261 because it was discovered while that bug
was being investigated. However, it's not a plausible underlying cause of
that issue.

Given all of the MOZ_TRYs in WriteToDisk, it's entirely possible that
we occasionally exit it before writing everything out. WriteToDisk
additionally adds guards to try to ensure that we do not call it twice,
and that we do not try to read things out of the file via the memmapped
buffer after it has written to it. The latter should not generally be
a problem, since we write to a temp file and move that over at the end.
However, writing twice could be a problem as we modify our values'
mOffsets to keep track of their offsets within the resulting file.

Accordingly, this change simply moves the cleanup and guards into an
RAII helper to ensure they survive all of the early returns.

Depends on D89251

Keywords: leave-open
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/d68723cf7556
Ensure we clean up after WriteToDisk r=froydnj

This backs out all work from bug 1627075 as well as all of its
descendents. There were a few conflicts when backing this out but
overall it was pretty clean, so I would say it's a fairly mild
level of risk. Historically Nathan Froyd has reviewed these patches,
but he is no longer at Mozilla, and no one else is particularly
familiar with the code, so I am passing this off to RyanVM who has
at least been familiar with the history of the bug.

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/6450088b6b73
Back out all recent StartupCache work r=RyanVM
Regressions: 1665860

It seems to me we should call this bug fixed, at this point. There's a few remaining crashes with this signature but they're presumably not related to the spikes we saw in 80 and 81.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago4 years ago
Keywords: leave-open
Resolution: --- → FIXED
Blocks: 1664229
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: