Startup crash in ErrorLoadingBuiltinSheet

RESOLVED FIXED in Firefox 42

Status

()

defect
--
critical
RESOLVED FIXED
4 years ago
8 months ago

People

(Reporter: dmajor, Assigned: heycam)

Tracking

(Blocks 1 bug, {crash})

42 Branch
mozilla66
x86
Windows NT
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox41+ wontfix, firefox42+ fixed, firefox43 fixed, firefox47 wontfix, firefox48 wontfix, firefox49 wontfix, firefox-esr38- wontfix, firefox-esr45 wontfix, firefox50 wontfix, firefox-esr60 wontfix, firefox64 wontfix, firefox65 wontfix, firefox66 fixed)

Details

(Whiteboard: [startupcrash][tbird crash], crash signature)

Attachments

(3 attachments)

This bug was filed from the Socorro interface and is 
report bp-93728d41-b062-40b7-94bd-c2f3a2150813.
=============================================================

This seems to have spiked in 41b1.

Frame 	Module 	Signature 	Source
0 	mozglue.dll 	mozalloc_abort(char const* const) 	memory/mozalloc/mozalloc_abort.cpp
1 	xul.dll 	NS_DebugBreak 	xpcom/base/nsDebugImpl.cpp
2 	xul.dll 	ErrorLoadingBuiltinSheet 	layout/style/nsLayoutStylesheetCache.cpp
3 	xul.dll 	nsLayoutStylesheetCache::LoadSheet(nsIURI*, nsRefPtr<mozilla::CSSStyleSheet>&, bool) 	layout/style/nsLayoutStylesheetCache.cpp
Summary: Starup crash in ErrorLoadingBuiltinSheet → Startup crash in ErrorLoadingBuiltinSheet
Roc, this may be the return of the mystery crashes from bug 1063052.
Flags: needinfo?(roc)
[Tracking Requested - why for this release]: startup crashes
The report linked in comment 0 has:

xpcom_runtime_abort([4824] ###!!! ABORT: LoadSheetSync failed with error 80520012 loading built-in stylesheet 'resource://gre-resources/noscript.css': file c:/builds/moz2_slave/rel-m-beta-w32_bld-00000000000/build/layout/style/nsLayoutStylesheetCache.cpp, line 462)


Are they all noscript.css?  Or are other extensions involved?  (Could it be related to extensions being disabled due to extension signing requirements?)
Er, never mind my previous question; noscript.css is the sheet we use to show script elements when script is disabled, not related to the noscript extension.  (Not sure why I thought it was...)
The error is NS_ERROR_FILE_NOT_FOUND. So that's not much to go on...

I'm not sure how to proceed. Maybe dump to the crash report the contents of the directory we expect to find the file in? And report a separate error if the directory itself doesn't exist?

I don't really know much about the style system, packaging, or updating, I'm just flailing in a hopefully helpful way here :-).
Flags: needinfo?(roc)
Tracked for 41 and 42, crashes are bad.
David, Robert: Would you be able to help find an owner for this bug? It is not a top crasher but I do see several crashes (~200) on Beta41. Is there a likelihood of getting a fix soon'ish? Otherwise I will have to wontfix this bug.
Flags: needinfo?(roc)
Flags: needinfo?(dbaron)
Do we know of any installer/updater changes that could have led to this?
Flags: needinfo?(robert.strong.bugs)
Nothing changed in the updater or installer that I know of that would have caused this. Perhaps it is due to people unpacking the omni.ja and our code using fallbacks when even when the build is an omni.ja build.
Flags: needinfo?(robert.strong.bugs)
Jet, can you find someone for this? I suggest trying https://bugzilla.mozilla.org/show_bug.cgi?id=1194856#c5...
Flags: needinfo?(roc) → needinfo?(bugs)
To heycam for a look. We may be in a state where this change from bug 1169514 is interacting with stale toolkit.jar:

https://bugzilla.mozilla.org/attachment.cgi?id=8613390&action=diff
Flags: needinfo?(dbaron)
Flags: needinfo?(cam)
Flags: needinfo?(bugs)
Jet, we will gtb Beta9 on Thursday, any chance we can get a fix in before that?
Flags: needinfo?(bugs)
I'll investigate by doing something like comment 5, but in the meantime, how about we make the nsLayoutStylesheetCache::No{Script,Frames}Sheet() methods load their sheets from a data: URL with the relevant sheet contents.  The two sheets added in bug 1169514, noscript.css and noframes.css, are pretty small.
Blocks: 1169514
Flags: needinfo?(cam)
(In reply to Cameron McCormack (:heycam) from comment #13)
> I'll investigate by doing something like comment 5, but in the meantime, how
> about we make the nsLayoutStylesheetCache::No{Script,Frames}Sheet() methods
> load their sheets from a data: URL with the relevant sheet contents.  The
> two sheets added in bug 1169514, noscript.css and noframes.css, are pretty
> small.

SGTM. Seems low-risk enough to uplift to the next Beta.
Flags: needinfo?(bugs)
FWIW, noscript.css (and noframes.css) is lazily loaded, and it gets loaded in response to a pref being set, in nsDocumentViewer::CreateStyleSet.  So presumably this is why it is appearing as a startup crash -- the first document that is created if the relevant pref is set will try to load noscript.css.  There are a bunch of eagerly loaded sheets (those loaded in the nsLayoutStylesheetCache constructor) that will load before noscript.css, so I don't think the problem is that we're missing all the UA sheets.
So we can collect some information through crash reports, I'm going to leave Aurora/Nightly loading noscript.css/noframes.css directly, while on Release/Beta equivalent data: URLs are used.  I'll add the crash report annotations later.
Attachment #8658523 - Flags: review?(dbaron)
(In reply to Cameron McCormack (:heycam) from comment #15)
> So presumably this is why it is appearing as a startup crash

Note that "startup crash" just means it mostly crashes within the first 60 seconds of browser uptime.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #18)
> (In reply to Cameron McCormack (:heycam) from comment #15)
> > So presumably this is why it is appearing as a startup crash
> 
> Note that "startup crash" just means it mostly crashes within the first 60
> seconds of browser uptime.

That is the meaning in general, but in this particular case I would call it a "true" startup crash, of the nature that makes the browser completely unusable. There is an nsAppStartup frame on the stack, and a large proportion of reports have uptime of 0 or 1 seconds.
For some reason this crash is especially popular with ESR. Perhaps related to the omni.ja theory from comment 9?

Version facet for the last 7 days:
Rank 	Version 	Count 	%
1 	38.2.1esr 	617 	52.38 %
2 	38.0.1esr 	178 	15.11 %
3 	41.0b6 	84 	7.13 %
4 	40.0.3 	64 	5.43 %
5 	38.2.0 	60 	5.09 %
6 	41.0b1 	26 	2.21 %
Comment on attachment 8658523 [details] [diff] [review]
Load noscript.css and noframes.css from data: URLs in release builds.

I'm not crazy about RELEASE_BUILD #ifdefs, since they add risk to aurora->beta migrations -- maybe it's better to just change this unconditionally?  Or was there a reason for making it conditional?
Attachment #8658523 - Flags: review?(dbaron) → review+
I was thinking of not changing how Aurora/Nightly work so that we can still add some crash report annotations to figure out what the underlying problem actually is.  (Alternatively I could figure out how Telemetry/FHR really work and report info about the style sheet load failure through that, and avoid crashing on Aurora/Nightly too.)
(In reply to David Baron [:dbaron] ⌚UTC-7 from comment #21)
> I'm not crazy about RELEASE_BUILD #ifdefs, since they add risk to
> aurora->beta migrations -- maybe it's better to just change this
> unconditionally?  Or was there a reason for making it conditional?

And I'm going to land this on beta/aurora/inbound, so we shouldn't have to worry about a future uplift breaking (unless someone changes the code within the #ifdef RELEASE_BUILD on inbound, of course).  It will be temporary anyway until we get the information we need from the crash reports.
Comment on attachment 8658523 [details] [diff] [review]
Load noscript.css and noframes.css from data: URLs in release builds.

Approval Request Comment
[Feature/regressing bug #]: bug 1169514
[User impact if declined]: some people on beta/release will continue to experience startup crashes
[Describe test coverage new/current, TreeHerder]: I've manually tested that both arms of the #ifdef RELEASE_BUILD work as expected (i.e. that the sheets load), and done a try run on beta and m-c.
[Risks and why]: low, the change is small
[String/UUID change made/needed]: N/A
Attachment #8658523 - Flags: approval-mozilla-beta?
Attachment #8658523 - Flags: approval-mozilla-aurora?
Comment on attachment 8658523 [details] [diff] [review]
Load noscript.css and noframes.css from data: URLs in release builds.

Patch looks simple enough. Based on DMajor's assessment, this is safe to uplift to Beta41 and Aurora42.
Attachment #8658523 - Flags: approval-mozilla-beta?
Attachment #8658523 - Flags: approval-mozilla-beta+
Attachment #8658523 - Flags: approval-mozilla-aurora?
Attachment #8658523 - Flags: approval-mozilla-aurora+
Keeping this bug open for the crash log annotation.
Keywords: leave-open
Blocks: 1208818
This crash is still happening after the patch, because other stylesheets are also affected:

002fed68  "[5064] ###!!! ABORT: LoadSheetSy"
002fed88  "nc failed with error 8052000b lo"
002feda8  "ading built-in stylesheet 'resou"
002fedc8  "rce://gre-resources/html.css': f"

0021e870  "[3852] ###!!! ABORT: LoadSheetSy"
0021e890  "nc failed with error 8052000b lo"
0021e8b0  "ading built-in stylesheet 'resou"
0021e8d0  "rce://gre-resources/forms.css': "

0015e70c  "[1136] ###!!! ABORT: LoadSheetSy"
0015e72c  "nc failed with error 8052000b lo"
0015e74c  "ading built-in stylesheet 'resou"
0015e76c  "rce://gre-resources/counterstyle"
0015e78c  "s.css': file c:/builds/moz2_slav"

003d7f18  "[2016] ###!!! ABORT: LoadSheetSy"
003d7f38  "nc failed with error 8052000b lo"
003d7f58  "ading built-in stylesheet 'resou"
003d7f78  "rce://gre/res/contenteditable.cs"
003d7f98  "s': file c:/builds/moz2_slave/re"
Flags: needinfo?(cam)
(In reply to David Major [:dmajor] from comment #31)
> This crash is still happening after the patch, because other stylesheets are
> also affected:
> 
> 002fed68  "[5064] ###!!! ABORT: LoadSheetSy"
> 002fed88  "nc failed with error 8052000b lo"
> 002feda8  "ading built-in stylesheet 'resou"
> 002fedc8  "rce://gre-resources/html.css': f"

This is NS_ERROR_FILE_CORRUPTED, now, incidentally.  The fact that we're getting this on different style sheet URLs is disconcerting, too.  NS_ERROR_FILE_CORRUPTED is returned from various bits of zip/jar parsing code.

I might try attaching to the crash report (a) the list of files in the Firefox installation directory, and (b) the list of files found in the omni.ja, if we don't fail parsing it.  Benjamin, do you think there would be privacy concerns doing this?
Flags: needinfo?(benjamin)
Good question. omni.ja would be fine, except that it's going to be a lot of data and probably therefore shouldn't be in every report. I worry about the bare filelist a little bit since various enterprise customizations live there. But if we treat it as private data on the server I think it should be fine. Clearly diagnostic data of the sort we need to know to fix crashes.
Flags: needinfo?(benjamin)
Is registerAppMemory the right way to add private data to the crash report?  (I've only used appendAppNotesToCrashReport previously.)  What practical limits are there to the size of the data I can provide?
Flags: needinfo?(benjamin)
RegisterAppMemory is for the case where you have a page of memory with special data and you want to save it in the minidump file. You have to allocate it using a page-allocation function (VirtualAlloc) and keep it alive for the life of the program.

Annotations (AnnotateCrashReport) typically shouldn't be larger than a few K but are much easier to deal with.
Flags: needinfo?(benjamin)
I submitted https://crash-stats.mozilla.com/report/index/02e4198f-e9a0-410f-ac36-15cae2151009 which should have the annotation from this patch.  How can I get access to it to verify?
Attachment #8672185 - Flags: review?(benjamin)
An earlier version of the patch listed all files in the omni.ja and under the GRE directory, but that caused the annotation to be hundreds of kilobytes, so I scoped it down to the files that are likely to be relevant.
(In reply to Cameron McCormack (:heycam) from comment #37)
> How can I get access to it to verify?

KaiRo verified for me.
Attachment #8672185 - Flags: review?(benjamin) → review+
Comment on attachment 8672185 [details] [diff] [review]
Add crash report annotations just before crashing in ErrorLoadingBuiltinSheet.

We should uplift this so we can get the crash annotations more quickly.

Approval Request Comment
[Feature/regressing bug #]: bug 1169514
[User impact if declined]: less chance of resolving this bug
[Describe test coverage new/current, TreeHerder]: manual testing; just landed on inbound
[Risks and why]: low; the only new code being run is just before we are about to crash
[String/UUID change made/needed]: N/A
Attachment #8672185 - Flags: approval-mozilla-beta?
Attachment #8672185 - Flags: approval-mozilla-aurora?
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak | ErrorLoadingBuiltinSheet] → [@ mozalloc_abort(char const* const) | NS_DebugBreak | ErrorLoadingBuiltinSheet] [@ mozalloc_abort | NS_DebugBreak | ErrorLoadingBuiltinSheet]
Next time, please don't hesitate to open a new bug for follow up uplifts, this is easier for relman & sheriffs.
Comment on attachment 8672185 [details] [diff] [review]
Add crash report annotations just before crashing in ErrorLoadingBuiltinSheet.

Anything which can help fixing startup crashes.
Taking it. Should be in 42 beta 7.
Attachment #8672185 - Flags: approval-mozilla-beta?
Attachment #8672185 - Flags: approval-mozilla-beta+
Attachment #8672185 - Flags: approval-mozilla-aurora?
Attachment #8672185 - Flags: approval-mozilla-aurora+
FF41 was EOL'd last week, so it's a wontfix.
I've taken a look at a handful of crash reports.  They didn't have any stray chrome.manifest files and the logging was able to open the omni.ja files to find the .css file that we're looking for.  NS_ERROR_FILE_CORRUPTED is still being returned.  The next step is to find out where we're returning NS_ERROR_FILE_CORRUPTED.  Per comment 43 I'll look at some further crash report annotation in a dependent bug.
Depends on: 1225004
David, would you be able to look into this while I'm away?

So far I don't think we're dealing with an unzipped omni.ja -- the crash reports don't have any significant chrome.manifest files lying about.  And the reported size of the omni.ja file is correct.

So next I'm trying to determine whether (a) something has modified some of the contents of the omni.ja, corrupting it, by reporting a crc32 value in the crash reports, and (b) which specific |return NS_ERROR_FILE_CORRUPTED| we're encountering while trying to get the .css file from the omni.ja.  That should be reported by bug 1225004, which has just landed, but hasn't gathered any data yet.  Hopefully in a few days it will've.  (See the "SheetLoadFailure" value in the rawdump.json.)
Flags: needinfo?(dbaron)
There are crashes from beta 5, but that doesn't have bug 1225004.  No crashes from beta 6 yet, but it only shipped a few hours ago.
And the first beta 6 crash arrives:  bp-e5efd8c8-605f-4b0b-ba94-26a0f2151125

Error loading sheet: resource://gre/res/contenteditable.css
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in
Real location: jar:file:///D:/Browser/Mozilla%20Firefox/omni.ja!/res/contenteditable.css
GRE directory: D:\Browser\Mozilla Firefox
Interesting files in the GRE directory:
  D:\Browser\Mozilla Firefox\browser\chrome.manifest (40 bytes, crc32 = 0x2eed26f5)
  D:\Browser\Mozilla Firefox\browser\extensions\{972ce4c6-7e08-4474-a285-3208198ce6fd}\chrome.manifest (20763 bytes, crc32 = 0x0ebc75e2)
  D:\Browser\Mozilla Firefox\browser\omni.ja (12318874 bytes, crc32 = 0x91f104a6)
  D:\Browser\Mozilla Firefox\omni.ja (12454928 bytes, crc32 = 0x8ff853d7)
  D:\Browser\Mozilla Firefox\webapprt\omni.ja (84309 bytes, crc32 = 0xd88e7bf3)
GRE omnijar URI string: jar:file:///D:/Browser/Mozilla%20Firefox/omni.ja!/
Interesting files in the GRE omnijar:
  res/contenteditable.css (10028 bytes, crc32 = 0xd6f3fed2)
  chrome/chrome.manifest (1839 bytes, crc32 = 0x18f1a3c9)
But now a sequence of 5 reports in rapid succession with a different reason:

Error loading sheet: resource://gre-resources/forms.css
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: error while inflating
Real location: jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/chrome/toolkit/res/forms.css
GRE directory: C:\Program Files\Mozilla Firefox
Interesting files in the GRE directory:
  C:\Program Files\Mozilla Firefox\browser\chrome.manifest (40 bytes, crc32 = 0x2eed26f5)
  C:\Program Files\Mozilla Firefox\browser\extensions\{972ce4c6-7e08-4474-a285-3208198ce6fd}\chrome.manifest (20763 bytes, crc32 = 0x0ebc75e2)
  C:\Program Files\Mozilla Firefox\browser\omni.ja (12318874 bytes, crc32 = 0x91f104a6)
  C:\Program Files\Mozilla Firefox\omni.ja (12454928 bytes, crc32 = 0xe8dd2bf5)
  C:\Program Files\Mozilla Firefox\webapprt\omni.ja (84309 bytes, crc32 = 0xd88e7bf3)
GRE omnijar URI string: jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/
Interesting files in the GRE omnijar:
  chrome/toolkit/res/forms.css (29179 bytes, crc32 = 0xc2ef97f7)
  chrome/chrome.manifest (1839 bytes, crc32 = 0x18f1a3c9)
Link for crashes just in beta 6 is:
https://crash-stats.mozilla.com/signature/?product=Firefox&version=43.0b6&signature=mozalloc_abort+|+NS_DebugBreak+|+ErrorLoadingBuiltinSheet&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&page=1#reports

I notice that comment 52 and comment 53 have different crc32's on the GRE omni.ja, although the files are the same size.  (Mental note:  I want to check that the mmap of that file isn't writable!)

There have been a bunch of additional beta 6 crashes overnight.

There were a cluster of 4 around the same time:
bp-c7e6f07d-cbe6-4730-b9b4-a260f2151125
bp-156bea64-13b1-44b7-ad1c-6e6e52151125
bp-84804aa7-3a47-4e47-b3b7-d34a72151125
bp-b90ac4d4-9515-4f02-818f-77b2f2151125
that have:
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in
...
  C:\Program Files\Mozilla Firefox\omni.ja (12454928 bytes, crc32 = error 0x80004005)
...
GRE omnijar URI string: jar:file:///C:/Program%20Files/Mozilla%20Firefox/omni.ja!/
Interesting files in the GRE omnijar:
  res/contenteditable.css (10028 bytes, crc32 = 0xd6f3fed2)
  chrome/chrome.manifest (1839 bytes, crc32 = 0x18f1a3c9)

This is interesting because of:
 * error computing CRC for omnijar
 * able to CRC contenteditable.css (consistent with comment 52), yet still report NS_ERROR_FILE_CORRUPTION when trying to open it.


Then there were 3 more well-spaced-out crashes:

bp-0b511ebe-e17e-4ab9-9dd2-f91bc2151125
Error loading sheet: resource://gre/res/contenteditable.css
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in
...
  C:\Program Files\Mozilla Firefox\omni.ja (3940352 bytes, crc32 = 0xa064befe)
...
Interesting files in the GRE omnijar:
  res/contenteditable.css (10028 bytes, crc32 = 0xd6f3fed2)
  chrome/chrome.manifest (1839 bytes, crc32 = 0x18f1a3c9)

Here the GRE omnijar is the wrong size (way too small).  But again we can actually read the contenteditable.css and report its CRC correctly.


And, finally, a cluster of 2 probably from the same user (but with 40 minutes separation):
bp-4860b9b2-333a-4bb6-bc2f-c69eb2151125
bp-719c1d56-3638-43e4-bf2d-37ea72151125

Error loading sheet: resource://gre/res/contenteditable.css
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in
...
  C:\Program Files (x86)\Mozilla Firefox\omni.ja (12470762 bytes, crc32 = 0xd28d9ee3)
...
Interesting files in the GRE omnijar:
  res/contenteditable.css (10028 bytes, crc32 = 0xd6f3fed2)
  chrome/chrome.manifest (1704 bytes, crc32 = 0x6c956600)

Here the omni.ja is slightly larger than that from comment 52 and comment 53.



I should check what the correct size/CRC is for the omni.ja.
(In reply to David Baron [:dbaron] ⌚UTC-8 from comment #54)
> I should check what the correct size/CRC is for the omni.ja.

... which I believe requires the locale, which I don't see how to get from the crash reports.


At the very least, for en-US from:
https://ftp-ssl.mozilla.org/pub/firefox/candidates/43.0b6-candidates/build1/win32/en-US/firefox-43.0b6.zip
the correct data for the GRE omni.ja should be:
  size 12454928
  crc32 0xcf997171
and the correct data for the browser omni.ja should be:
  size 12318874
  crc32 0x91f104a6

So both comment 52 and comment 53 have a valid en-US browser omni.ja and a corrupted en-US GRE omni.ja (correct size, incorrect crc32).


I need to recheck more data on the more recent crash reports.
(In reply to David Baron [:dbaron] ⌚UTC-8 from comment #54)
> I notice that comment 52 and comment 53 have different crc32's on the GRE
> omni.ja, although the files are the same size.  (Mental note:  I want to
> check that the mmap of that file isn't writable!)

I looked at the first nsZipHandle::Init method (called from nsZipArchive::OpenArchive) and the underlying NSPR code (making some assumptions about which NSPR code calls what, and assuming I end up in nsprpub/pr/src/md/windows/ntmisc.c in _MD_CreateFileMap and _MD_MemMap), and that all seems OK.

The other nsZipHandle::Init methods are invoked from nsJAR.cpp.  It seems like they wouldn't be invoked in this case.
(In reply to David Baron [:dbaron] ⌚UTC-8 from comment #55)
> ... which I believe requires the locale, which I don't see how to get from
> the crash reports.

The "Metadata" tab of a crash report contains a useragent_locale field. Note that this is the locale selected for the UI to be displayed in and in some cases does not match the locale the builds was shipped with (when language packs are installed, it's likely that it doesn't match).
OK, let's go through all the crashes from beta 6 again:

bp-e5efd8c8-605f-4b0b-ba94-26a0f2151125
comment 52
valid en-US browser omni.ja
corrupted GRE omni.ja with correct size

bp-af4bdd74-0908-4683-a25f-ec72d2151125 
bp-2d208d3f-3c23-4877-8bd7-d7ec72151125 
bp-c56b51cf-76b1-4ae9-801d-d2f7a2151125 
bp-d5d84ec6-1b65-49ac-b11e-e1f6f2151125 
bp-a1e6037c-5fc2-411e-920c-cd9bf2151125 
5 crashes in succession, presumably from one user
comment 53
valid en-US browser omni.ja
corrupted GRE omni.ja with correct size
(different crc32 from previous user)
I think these are the only crashes that have:
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: error while inflating
whereas all the others have:
NS_ERROR_FILE_CORRUPTION reason: nsJARInputStream: !mZs.next_in

bp-c7e6f07d-cbe6-4730-b9b4-a260f2151125 
bp-156bea64-13b1-44b7-ad1c-6e6e52151125 
bp-84804aa7-3a47-4e47-b3b7-d34a72151125 
bp-b90ac4d4-9515-4f02-818f-77b2f2151125 
4 crashes in succession, presumably from one user
(part of comment 54)
valid en-US browser omni.ja
useragent_locale is en_US
error computing the CRC of the GRE omnijar, which means PR_Read returned failure (this is the only set where that happened)
GRE omnijar is correct size


bp-0b511ebe-e17e-4ab9-9dd2-f91bc2151125 
(part of comment 54)
valid en-US browser omni.ja
GRE omnijar is barely a third of the correct size (3,940,352 versus 12,454,928)
This is the only crash where the GRE omnijar is an incorrect size; in all others it is the correct size but has incorrect CRC or error from PR_Read while computing the CRC


bp-4860b9b2-333a-4bb6-bc2f-c69eb2151125 
bp-719c1d56-3638-43e4-bf2d-37ea72151125
2 crashes with 40 minutes separation, presumably from one user
(part of comment 54)
locale is ro
GRE omnijar is correct size for ro, but incorrect CRC
browser omnijar is correct size and CRC32 for ro


Note that the size/CRC data for the items *in* the omnijar comes
from the zip header (nsZipItem objects set up in
nsZipArchive::BuildFileList), so those being correct doesn't mean
the data on disk are also correct.
Any bright ideas as to what could be causing corrupt omni.ja files?

Does it seem more likely to be disk corruption or something that we're doing, either in the installer or later?

(How would we end up failing if the browser omni.ja were corrupt rather than the GRE one?  Would we end up with a crash report, or just failing to start?  Should we make failing to read from any omnijar an NS_RUNTIMEABORT so that we get the crash reports?)
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(benjamin)
Separate question, does the updater check that what it wrote is correct?
I've been trying to think of what could be causing this and haven't come up with anything close to a significant lead yet. A long time ago we used to have cases where the dll versions mismatched in crash reports for as many as 8% of crash reports in Firefox 3.5 which I was able to reduce to around 0.02% of crash reports in Firefox 4. I've also followed up periodically to verify that these numbers haven't changed.

https://bugzilla.mozilla.org/show_bug.cgi?id=635161#c25
and
https://bugzilla.mozilla.org/show_bug.cgi?id=635161#c26

For a complete update the file is extracted as is and I don't think this is the case where it fails. The file we write to when patching is a new empty file with the size allocated to the correct size to prevent fragmentation. So it might be that it is opened by another Mozilla type application (firefox.exe is locked to prevent it from opening) though it could be a non Mozilla type application (e.g. virus scanner) interfering with the update. If the patch fails then we try to roll back to the original state and it appears that might be that something is preventing the update from completing as well as rolling back.

Is there a clear release where this started happening?
Flags: needinfo?(robert.strong.bugs)
(In reply to Mike Hommey [:glandium] from comment #60)
> Separate question, does the updater check that what it wrote is correct?
For patching (this appears to apply to patching and not laying down an entire files), it checks that the original file has the correct crc and writes out the bits to the sections as specified in the patching instructions. In this instance it appears that the entire patch for the file is not getting applied since the file is the correct size. See comment #61 for more info.
I have no particular insight to offer here. I wonder if we can get a copy of the bad omnijar from some people experiencing this?
Flags: needinfo?(benjamin)
I'd really like to see the update.log, last-update.log, and backup-update.log files for an affected system.
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #64)
> I'd really like to see the update.log, last-update.log, and
> backup-update.log files for an affected system.

Would it be acceptable (privacy-wise) and doable to store those as annotations on a crash report?
Privacy-wise, this is rather sensitive (contains usernames and other data), but no more sensitive than what's typically in a minidump.

Technically, submitting this with crash-stats would be difficult but not impossible. But I'd prefer if we could contact individual users who are submitting emails and ask them to send their update logs by hand. Do we have any emails?
Note that this particular crash is somewhat easier to do that sort of thing for than the typical crash, since it's an NS_RUNTIMEABORT, and we can add crash annotations immediately before calling NS_RUNTIMEABORT (which is how we have the data we have so far).
FYI I had the problem yesterday. I am not sure about the version number as 45final was installed but 46b2 was about to be installed.

Alas I do not know exactly what happened as I left my PC unattended, it rebooted, and after the reboot Firefox crashed every startup. I was unable to restart it (even in safe mode or changing profile).

After uninstalling and re-installing it crashes where uploaded and I was able to find the bug number.
Please tell me if you think logs are still there and valid after all this. I will upload them.

Crashes:

bp-4df5f7d1-222d-4365-aa44-21d3d2160321
	21/03/2016	13:49
bp-37d6cec0-8dd5-46bd-a504-e82d12160321
	21/03/2016	12:41
bp-b84df3c2-986f-4da6-9ef8-cbcff2160321
	21/03/2016	12:41
bp-6690a99f-57f9-4d84-97f4-4e3fb2160321
	21/03/2016	12:39
bp-2dcd0f7b-cfcb-40bb-b525-386332160321
	21/03/2016	12:38
bp-384f6a60-e067-499e-916f-9b2782160321
	21/03/2016	12:38
bp-fe3fa7bd-50ce-4554-b083-8f80b2160321
	21/03/2016	12:37
bp-24271025-f314-443a-a6b4-addec2160321
	21/03/2016	12:37
bp-5dd32018-0fae-4fae-ab6b-8d73a2160321
	21/03/2016	11:26
bp-386e97e7-9b98-45f9-aac9-52f4c2160321
	21/03/2016	11:23
bp-49b2846c-1f77-450f-bbcc-11f762160321
	21/03/2016	11:23
bp-193c4513-b1e9-4286-b639-9ba2e2160321
	21/03/2016	11:23
bp-4e4ba08e-41af-4d80-8027-c1b7f2160321
	21/03/2016	11:23
This changed frequency, #2 crash for Thunderbird 45. In version 38 not even top 150.
examples where user has 10+ crashes bp-8ff76817-ed84-49e0-a3ae-009932160422 bp-a993610b-a73f-4136-a327-09bba2160422
Whiteboard: [tbird crash]
This should already be in esr45; removing the esr38 tracking request.
user in https://support.mozilla.org/en-US/questions/1127088  Windows 10
bp-0a59f41c-a6f6-4c02-b0ef-c5e322160614  "Thunderbird krasched when I tried to update. Now it can't launch, get a crash report when I try to start. Have installed newest version [Thunderbird] 45.1.1 above the old but it didn't solve the problem"
Thunderbird crashes may not all be alike - some users have reported creating a new pofile solved their problem.
Crash volume for signature 'mozalloc_abort | NS_DebugBreak | ErrorLoadingBuiltinSheet':
 - nightly (version 50): 7 crashes from 2016-06-06.
 - aurora  (version 49): 10 crashes from 2016-06-07.
 - beta    (version 48): 120 crashes from 2016-06-06.
 - release (version 47): 362 crashes from 2016-05-31.
 - esr     (version 45): 36 crashes from 2016-04-07.

Crash volume on the last weeks:
             Week N-1   Week N-2   Week N-3   Week N-4   Week N-5   Week N-6   Week N-7
 - nightly          5          0          0          0          0          1          1
 - aurora           0          0         10          0          0          0          0
 - beta             5          9          4         17         21         34         29
 - release          4          8          1          3         53        178        112
 - esr              0          0          0          0          0          3          0

Affected platforms: Windows, Linux
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #70)
> This changed frequency, #2 crash for Thunderbird 45. In version 38 not even
> top 150.
> examples where user has 10+ crashes bp-8ff76817-ed84-49e0-a3ae-009932160422
> bp-a993610b-a73f-4136-a327-09bba2160422

Was still a topcrash for TB 45.1.1.  But gone in TB 45.2.0 (eg. esr)
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #75)
> (In reply to Wayne Mery (:wsmwk, NI for questions) from comment #70)
> > This changed frequency, #2 crash for Thunderbird 45. In version 38 not even
> > top 150.
> > examples where user has 10+ crashes bp-8ff76817-ed84-49e0-a3ae-009932160422
> > bp-a993610b-a73f-4136-a327-09bba2160422
> 
> Was still a topcrash for TB 45.1.1.  But gone in TB 45.2.0 (eg. esr)

Ah, so for Thunderbird it's changed to this and friends (and not a topcrash) https://crash-stats.mozilla.com/signature/?product=Thunderbird&signature=Abort%20%7C%20LoadSheetSync%20failed%20with%20error%20804b0012%20loading%20built-in%20stylesheet%20%27data%3Ate...%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20ErrorLoadingBuiltinSheet&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_sort=-date&page=1#reports

Any workaround?
Whiteboard: [tbird crash] → [startupcrash][tbird crash]
FWIW there are some crashes in ErrorLoadingBuiltinSheet in bug 1300720 which turn out to be due to network.protocol-handler.external.file=true and capability.policy.xxx.checkloaduri.enabled="allAccess".  I am not sure whether they are related to the crashes here where we're failing on CRC checks etc.
The leave-open keyword is there and there is no activity for 6 months.
:heycam, maybe it's time to close this bug?
Flags: needinfo?(cam)
I don't see any crashes post Firefox 51, so probably it can be closed, but I should back out the diagnostics code added here first.
Flags: needinfo?(cam)
Flags: needinfo?(cam)
This backs out the main patch landed earlier in bug 1194856 and the
patch from bug 1225004.
Pushed by cmccormack@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5093779d1638
Remove UA style sheet load crash report annotations r=gsvelto,dbaron
https://hg.mozilla.org/mozilla-central/rev/5093779d1638
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.