Open Bug 1535064 Opened 9 months ago Updated 6 months ago

Crash in [@ ErrorLoadingSheet]

Categories

(Toolkit :: Application Update, defect, P2, critical)

66 Branch
defect

Tracking

()

Tracking Status
firefox-esr60 --- affected
firefox65 --- unaffected
firefox66 - wontfix
firefox67 --- wontfix
firefox67.0.1 --- wontfix
firefox68 --- wontfix
firefox69 --- fix-optional

People

(Reporter: lizzard, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash, regression)

Crash Data

This bug is for crash report bp-00fa9d97-dc14-429c-be73-6b1c80190313.

We're seeing some crashes for the first time in the last 6 months in this signature with the 66 RC1 build. See also bug 315278.

Top 10 frames of crashing thread:

0 xul.dll static void MOZ_CrashOOL mfbt/Assertions.h:314
1 xul.dll static void ErrorLoadingSheet layout/style/nsLayoutStylesheetCache.cpp:277
2 xul.dll nsLayoutStylesheetCache::LoadSheet layout/style/nsLayoutStylesheetCache.cpp:304
3 xul.dll nsLayoutStylesheetCache::nsLayoutStylesheetCache layout/style/UserAgentStyleSheetList.h:34
4 xul.dll nsLayoutStylesheetCache::Singleton layout/style/nsLayoutStylesheetCache.cpp:187
5 xul.dll nsDocumentViewer::CreateStyleSet layout/base/nsDocumentViewer.cpp:2292
6 xul.dll nsresult nsDocumentViewer::InitPresentationStuff layout/base/nsDocumentViewer.cpp:732
7 xul.dll nsresult nsDocumentViewer::InitInternal layout/base/nsDocumentViewer.cpp:980
8 xul.dll nsDocumentViewer::Init layout/base/nsDocumentViewer.cpp:714
9 xul.dll nsresult nsDocShell::Embed docshell/base/nsDocShell.cpp:6256

Robert, can you take a look so we can check if this is some new problem? It didn't happen for the 65 RC so I'm concerned we may have regressed something.

Flags: needinfo?(robert.strong.bugs)

I've been trying to track down what is happening today. My findings so far

  1. Nothing changed in the app update client after 66.0b10 so an app update client change is extremely unlikely.
  2. The omni.ja that contains the css file (svg.css) is updated after firefox.exe and before xul.dll both of which have the correct version. If this were due to an update failure (e.g. low disk space, updater crash, etc.) that failed to rollback to the original state then the file versions would be different.

I did notice that the compatibility.ini file is located in the Windows roaming profile and the startup cache is located in the Windows local profile. I don't know the startup cache code well enough to know for certain but it seems that if a client with a roaming profile updated and then used that roaming profile on another system that had already been updated the profile's compatibility.ini would report it as compatible but the users startup cache in the Windows local profile would be for the old version and not be purged.

I'm looking into what could be added to the crash reports that might help narrow this down.

Flags: needinfo?(robert.strong.bugs)

I filed bug 1535181 for the possible startup cache issue.

Thanks, I'll keep an eye on it after release.

Now that I look back at the crash signature it isn't new in 66 at all. I must have been searching with some other limit without realizing.

It does look like it increased significantly in 66.0b11 so something may have changed after 66.0b10 to cause this increase. I went through all of the changesets but there were no app update changes and nothing else stood out.

This looks very similar to bug 1194856.

:heycam, is the diagnostic results from the spinoffs of bug 1225004 available anywhere? If not, could the diagnostics be added back? Thanks!

Flags: needinfo?(cam)

Another weird data point. Bug 1194856 comment #68 states that the error occurred after uninstalling / reinstalling.

heycam, after reading through bug 1194856 again I don't think adding back the diagnostics will help at this point so clearing the needinfo.

Flags: needinfo?(cam)
Priority: -- → P3

Changing the priority to p1 as the bug is tracked by a release manager for the current release.
See How Do You Triage for more information

Priority: P3 → P1

This is showing up as the #2 top crash in 66 release on the first day of the release.
Jimm, is there anything else that could be done here? Do you still think this should be P3?

Flags: needinfo?(jmathies)

Changing platform to all as I see crashes on all 3 platforms in 66 release.

OS: Windows 10 → All
Hardware: x86 → All

This bug appears to be essentially the same as Bug 1194856 and Bug 1194856 comment #68 shows this happens with the installer as well.
Also, FennecAndroid doesn't use Application Update.

Given the above two points I'm not sure this bug is properly categorized or that Application Update can do anything here.

We did a little debugging in our standup. In 65 we had debug data that helped. The consensus is this:

  1. this impacts a small number of users
  2. the crcs on certain files are wrong, but the file sizes appear correct. This implies some sort of corruption of the files on disk.
  3. bug 1303764, which is the previous incarnation of this bug (filed in 2016) had a user report that the main cause appeared to be a full disk.
  4. This isn't application update specific as this happens in cases where updater isn't involved .

The way forward here is to add a check to install/update that validates crc values after applying an update to a file which is something we currently don't do. To accomplish this we'd need crc data embedded in mar files so the work here extends beyond client.

We're going to find some time to work on this (bug 1535687) in Q2. There are no easy fixes here so I'd suggest this not block any release.

Flags: needinfo?(jmathies)
Priority: P1 → P2

Also, bug 1194856 (filed 2015) appears to be a previous incarnation of this bug which also affected people uninstalling and reinstalling. It is also the bug that added the crc values in the crash reports which has since been backed out.

Note that the crc checks of files after they have been updated won't help the FennecAndroid case since Application Update isn't used by FennecAndroid.

Signs point to omni.ja truncation.

We looked at an instance of @ ErrorLoadingSheet last week, it includes these annotations:

Error loading sheet: resource://gre/res/svg.css
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in
Real location: jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/res/svg.css
GRE directory: C:\Program Files\Mozilla Firefox
Interesting files in the GRE directory:
  C:\Program Files\Mozilla Firefox\browser\chrome.manifest (0 bytes, crc32 = 0x00000000)
  C:\Program Files\Mozilla Firefox\browser\omni.ja (41437720 bytes, crc32 = 0x7059911f)
  C:\Program Files\Mozilla Firefox\chrome.manifest (0 bytes, crc32 = 0x00000000)
  C:\Program Files\Mozilla Firefox\omni.ja (17429973 bytes, crc32 = 0x326fbb3c)
Contents of chrome.manifest:
[[[]]]
GRE omnijar URI string: jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/
Interesting files in the GRE omnijar:
  chrome/chrome.manifest (2784 bytes, crc32 = 0x0a1409b4)
  res/svg.css (2159 bytes, crc32 = 0xeae4bb67)

The released 65.0.2 Windows amd64 en-US omni.ja should have that size, but the CRC should be 45c7dd43; browser\omni.ja is also the right size but the CRC should be a7d7a15d. The later CRCs (from the files in the GRE omnijar) are correct, but these come from the Zip central directory and thus don't reflect the actual data on disk.

It turns out that the reported CRCs can be explained if the files are truncated (specifically, filled with zeroes past a certain point, this can be checked with crctrunc): omni.ja from 0x1000000, browser\omni.ja from 0xb00000. Truncating the files in this way produces the same crash in my testing on 64 bit Windows 10.

I'm not sure what to blame this on. It would be interesting to see if the Android crashes look the same, as then there would be a totally different install/update process and filesystem responsible. One thing we could do to catch this earlier would be to check for the Zip "end of central directory" record at the end of the file.

I haven't yet had the opportunity to check more of these reports; this crash could still be a catch-all for any omnijar corruption. We may want to restore the annotations removed with https://phabricator.services.mozilla.com/D14050 , with some sort of support for Android (as Android crashes like edaf2b37-2f6c-4667-8bda-3e3ba0190325 don't report the CRC for res/omni.ja).

There are also OS X and Linux crashes. I'll try to find ones that includes the crc annotations.

Sent a couple of OS X and Linux crashes to agashlin via email

I looked at the handful of crashes rstrong referred, there's a wide variety.

Using omnijars from the future (partial updates?):

From the past:

Etc:

Edit2: Reverted first edit, I was reading the wrong report.

With how FennecAndroid had a spike with 66.0.5 the same time as Firefox desktop and FennecAndroid not using app update it makes me think there is something other than app update causing this even though the FennecAndroid numbers are low.

This is fairly low volume on 67 release so far, for both Firefox desktop and mobile.

Still fairly low volume so I'm marking this fix-optional for 69 to remove it from triage.

You need to log in before you can comment on or make changes to this bug.