Closed Bug 1262731 Opened 4 years ago Closed 3 years ago

startup crash in mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2, because JS_Init() fails

Categories

(Core :: JavaScript Engine, defect, blocker)

Unspecified
Windows NT
defect
Not set
blocker

Tracking

()

VERIFIED FIXED
mozilla50
Tracking Status
firefox46 - ---
firefox47 - ---
firefox48 + fixed
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: njn, Assigned: njn)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [startupcrash])

Crash Data

Attachments

(5 files, 6 obsolete files)

5.59 KB, patch
sfink
: review+
njn
: checkin+
Details | Diff | Splinter Review
2.44 KB, text/plain
Details
1.51 KB, text/plain
Details
2.22 KB, patch
Details | Diff | Splinter Review
2.57 KB, patch
ted
: review+
njn
: checkin+
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-709fc1b1-2c91-44c4-9844-db1772160406.
=============================================================

I see 156 crashes with this signature, all since March 31. All the reports I looked at showed us hitting a NS_RUNTIMEABORT because JS_Init() fails.

It's a mix of Firefox and Thunderbird, and even one SeaMonkey. It's mostly in 48.0a1, which suggests it's gotten a lot more frequent recently.

I see 7 different ways that JS_Init() could fail early and return false. Maybe we should change its API (or introduce a compansion JS_Init2() function) to return a reason for the failure, and then the NS_RUNTIMEABORT message could be chosen accordingly? That would give us a better idea what's going wrong.
It's a start-up crash, of course. It occurred 7 times in Nightly 20160406030221, all to the same user. So it's the kind of crash that could cause someone to switch browsers quickly.
There are now 35 crashes over the last three days, making this 3% of all crashes and the #2 topcrash.
I'll try to write a diagnostic patch to give us more info.
Assignee: nobody → n.nethercote
I looked at all the functions called by JS_Init(). They more or less all have a
single reason they can go wrong, so I think one string per fallible function
should be good enough to make progress on this bug.
Attachment #8739252 - Flags: review?(sphink)
Comment on attachment 8739252 [details] [diff] [review]
Add JS_InitWithFailureDiagnostic()

Review of attachment 8739252 [details] [diff] [review]:
-----------------------------------------------------------------

r=me for diagnostic usage, though I probably would've gone for a MOZ_RELEASE_ALWAYS_TRUE or something, to avoid needing to add all these strings. But since you've gone and done it, fine by me.
Attachment #8739252 - Flags: review?(sphink) → review+
must be a regression, no?

interesting that there are more Thunderbird crashes than FF.
I'm crashing on every TB start bp-331cc40c-0cad-40ce-a89b-959972160408
Severity: critical → blocker
Keywords: dogfood
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #7)
> must be a regression, no?
> 
> interesting that there are more Thunderbird crashes than FF.
> I'm crashing on every TB start bp-331cc40c-0cad-40ce-a89b-959972160408

I forgot about bug 1262561
https://hg.mozilla.org/mozilla-central/rev/8ea65a181e4b
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla48
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
Target Milestone: mozilla48 → ---
(In reply to Wayne Mery (:wsmwk, use Needinfo for questions) from comment #8)
> > 
> > interesting that there are more Thunderbird crashes than FF.
> > I'm crashing on every TB start bp-331cc40c-0cad-40ce-a89b-959972160408
> 
> I forgot about bug 1262561

We have crash reports now with the diagnostic in, and all the ones I've seen have:

> ABORT: JS_InitWithFailureDiagnostic: u_init() failed

Which is a failure in ICU initialization, which is consistent with bug 1262561.

Furthermore, the last crash like this seen in Firefox was in Nightly 20160408030212 (April 8). There are no crashes like this in builds from April 9 and 10. So this is good! But it's unclear to me how it got fixed...

Trying to untangle things here:

- Bug 1262561 is about this NS_InitXPCOM2() abort in Thunderbird. It is marked as fixed.

- It depended on the following two bugs:

  * Bug 1262567, which is the Thunderbird port of bug 1186060: "Use Visual Studio 2015 Update 1 for release builds"
  * Bug 1262636, which is the Thunderbird port of bug 1239083: "Stop using ICU's autoconf build system"

Those two Firefox bugs landed more than 2 days ago, so I don't see why the April 9 Nightly would be the one where the crash disappeared. Maybe something else got fixed then -- ted, any ideas?
Blocks: vs2015, 1239083
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Flags: needinfo?(ted)
Resolution: --- → FIXED
BTW, I'm inclined to keep the JS_InitWithFailureDiagnostic() patch in the tree, given that it will be useful if something else goes wrong during JS initialization in the future.
I haven't touched anything related to ICU since landing bug 1239083 last week, so I don't have any real clues here.
Flags: needinfo?(ted)
This signature started popping up in Firefox 46.0b10 but it was not present in 46.0b9
[Tracking Requested - why for this release]:

(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #13)
> This signature started popping up in Firefox 46.0b10 but it was not present
> in 46.0b9

The obvious question is whether 46.0b10 has the vs2015 and ICU build changes from the blocking bugs.

KaiRo, what's the general procedure for seeing what changes have gone into a particular release, whether it be Nightly, Aurora, Beta or Release? Thank you.
Flags: needinfo?(kairo)
(In reply to Nicholas Nethercote [:njn] from comment #14)
> KaiRo, what's the general procedure for seeing what changes have gone into a
> particular release, whether it be Nightly, Aurora, Beta or Release? Thank
> you.

Looking at the changelog/pushlog. Releases and betas you can look by tags, for nightly/aurora you need to find the changesets in the builds or on archive.m.o

For the range I mentioned, the pushlog is https://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=FIREFOX_46_0b9_RELEASE&tochange=FIREFOX_46_0b10_RELEASE
Flags: needinfo?(kairo)
That said, it looks like it's gone again in b11 and no other signature rose that same amount. Either it spread into multiple smaller signatures, or it was fixed again, or it only happens with certain compiler optimizations and we hit those with that one build.
Sounds like this was from Dragana's temporary diagnostic patch in bug Bug 1257216. That landed in beta 10 and backed out in beta 11. Nicholas does that help?
Flags: needinfo?(n.nethercote)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #17)
> Sounds like this was from Dragana's temporary diagnostic patch in bug Bug
> 1257216. That landed in beta 10 and backed out in beta 11. Nicholas does
> that help?

Do you mean the "Increase toolkit.asyncshutdown.timeout.crash for WinXP" patch? That affected shutdown, but this bug is about XPCOM and JS initialization at start-up, so I don't think they are related.
Flags: needinfo?(n.nethercote)
Ah, ok. Thanks for the explanation, I jumped to conclusions!   
It also looks like in beta 10 we had 565 crashes all for one installation. Maybe this was someone doing automated tests (fuzzing?) Since we don't see the signature for beta 11 or 46.0 RC, and this is one install's crash spike, I don't think we need to track this for 46.
Just seen this on Ubuntu wily 15.04 / Firefox Aurora Mozilla Firefox 48.0a2 48.0~a2~hg20160509r318685-0u.

Backtrace:

```
(gdb) backtrace 
#0  0x00007fb45120abd9 in raise (sig=sig@entry=11) at ../sysdeps/unix/sysv/linux/pt-raise.c:36
#1  0x00007fb443a9afa7 in nsProfileLock::FatalSignalHandler (signo=11, info=0x7ffdda92ceb0, context=0x7ffdda92cd80)
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/toolkit/profile/nsProfileLock.cpp:181
#2  <signal handler called>
#3  mozalloc_abort (
    msg=msg@entry=0x7ffdda92d34c "[4058] ###!!! ABORT: JS_InitWithFailureDiagnostic: u_init() failed: file /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp, line 709")
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/memory/mozalloc/mozalloc_abort.cpp:33
#4  0x00007fb4420b715b in Abort (
    aMsg=0x7ffdda92d34c "[4058] ###!!! ABORT: JS_InitWithFailureDiagnostic: u_init() failed: file /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp, line 709")
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/base/nsDebugImpl.cpp:447
#5  NS_DebugBreak (aSeverity=<optimised out>, aSeverity@entry=3, aStr=0x7fb444a15eb8 "JS_InitWithFailureDiagnostic: u_init() failed", aExpr=aExpr@entry=0x0, 
    aFile=aFile@entry=0x7fb444532cef "/build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp", aLine=aLine@entry=709)
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/base/nsDebugImpl.cpp:434
#6  0x00007fb4421119c1 in NS_InitXPCOM2 (aResult=aResult@entry=0x7fb43f1c5488, aBinDirectory=<optimised out>, aAppFileLocationProvider=<optimised out>)
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp:709
#7  0x00007fb442111bca in NS_InitXPCOM2 (aResult=aResult@entry=0x7fb43f1c5488, aBinDirectory=<optimised out>, aAppFileLocationProvider=<optimised out>)
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp:485
#8  0x00007fb443a9df04 in ScopedXPCOMStartup::Initialize (this=0x7fb43f1c5488) at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/toolkit/xre/nsAppRunner.cpp:1533
#9  0x00007fb443aa2177 in XREMain::XRE_main (this=this@entry=0x7ffdda92d9c8, argc=argc@entry=1, argv=argv@entry=0x7ffdda92eed8, aAppData=aAppData@entry=0x7ffdda92dbc8)
    at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/toolkit/xre/nsAppRunner.cpp:4447
#10 0x00007fb443aa23c6 in XRE_main (argc=1, argv=0x7ffdda92eed8, aAppData=0x7ffdda92dbc8, aFlags=<optimised out>) at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/toolkit/xre/nsAppRunner.cpp:4559
#11 0x0000561a1e05a3a4 in do_main (argc=1, argv=0x7ffdda92eed8, envp=<optimised out>, xreDirectory=0x7fb450174780) at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/browser/app/nsBrowserApp.cpp:220
#12 0x0000561a1e059a5a in main (argc=1, argv=0x7ffdda92eed8, envp=0x7ffdda92eee8) at /build/firefox-S3Cgxu/firefox-48.0~a2~hg20160509r318685/browser/app/nsBrowserApp.cpp:360
```
The output
> ABORT: JS_InitWithFailureDiagnostic: u_init() failed: file /firefox-48.0~a2~hg20160509r318685/xpcom/build/XPCOMInit.cpp, line 709")
is mentioned in bug 1262731, bug 1264836, bug 1272246, bug 1272209. (In case some might be dups.)
[Tracking Requested - why for this release]:
this crash is spiking up in early 48.0b1 data, where it's making up 11.75% of crashes currently, always on startup. it seems to happen on all windows versions but predominantly on windos xp (two thirds of the time)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Keywords: topcrash
I wonder if there's some Antivirus removing the ICU .dat file or something like that?
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #23)
> I wonder if there's some Antivirus removing the ICU .dat file or something
> like that?

I just tried removing that file on my Windows installation of Beta and I got exactly this crash signature (https://crash-stats.mozilla.com/report/index/e662c47f-52d4-4a38-96a1-4e28a2160610). So it's a plausible hypothesis.
I did some digging into ICU init failure modes. I built with UDATA_DEBUG enabled and got this output when icudt56l.dat is present and Firefox starts up normally.
Here's the UDATA_DEBUG output when icudt56l.dat is missing.
So basically, ICU starts by looking for an individual file (cnvalias.icu). That fails, so it looks for the all-in-one file (icudt56l.dat), which fails, which causes u_init() to fail.

I traced through this with a debugger, and the point where the failure is detected is here:
https://dxr.mozilla.org/mozilla-central/rev/f8e3b81a79f45ef8647c98281a9a00d1ddb28b73/intl/icu/source/common/udata.cpp#764

The internal error code is U_FILE_ACCESS_ERROR, which has the value 4.

I could augment my previous diagnostic patch to print the ICU error number. If all the crashes with the extra diagnostic show an error code of 4, then we'll know it's because the icudt56l.dat file is missing, or in some other way broken. (It's possible there are other code paths under u_init() that can cause an U_FILE_ACCESS_ERROR failure.)
sfink: please review the code.

ted: do you think this is a helpful change? It changes the abort message on ICU
init failure to something like this:

> ABORT: JS_InitWithFailureDiagnostic: u_init() failed with U_FILE_ACCESS_ERROR
Attachment #8761914 - Flags: review?(sphink)
Attachment #8761914 - Flags: feedback?(ted)
Attachment #8761914 - Flags: feedback?(ted) → feedback+
I think it can't hurt, certainly. I'm pretty sure we'll find that all of these crashes are just "we couldn't load the ICU data file".
Comment on attachment 8761914 [details] [diff] [review]
Add more detail to JS_InitWithFailureDiagnostic()'s failure message when ICU fails

Review of attachment 8761914 [details] [diff] [review]:
-----------------------------------------------------------------

Pedantically, this is incorrect on OOM. In fact, an OOM would make it appear to succeed. The "proper" way to do this would be to use an infallible allocator.

But this seems good enough for its intended purpose. (If jorendorff were still doing his oom static analysis thing, it probably would flag this, but I think he abandoned it in favor of the Result<> stuff.)
Attachment #8761914 - Flags: review?(sphink) → review+
> Pedantically, this is incorrect on OOM. In fact, an OOM would make it appear
> to succeed. The "proper" way to do this would be to use an infallible
> allocator.

Good point. 
How would I do infallible allocation in SM?

Alternatively, I could separate success/failure indication from the error message, e.g. by returning |bool| and having an outparam for the message.

I'll fix it one way or the other before landing.
This is an updated version that handles OOM properly. But I now think this
isn't the best approach, because:

(a) it gives us a little more information about the ICU failure, but we could
do better;

(b) returning an error code instead of a string would probably be better.

In terms of (a), we could instead do a more specific check, e.g. check if the
icu .dat file is present, and if it is, how big it is, and include that info in
the abort message. That would give a clearer diagnosis.
Attachment #8761914 - Attachment is obsolete: true
> In terms of (a), we could instead do a more specific check, e.g. check if the
> icu .dat file is present, and if it is, how big it is, and include that info
> in
> the abort message. That would give a clearer diagnosis.

I'm trying to work out how to do this. Some pieces:

- The filename can be obtained via ICU_DATA_FILE -- we'll need to pass it in, probably by specifying a -D flag in xpcom/build/moz.build for XPCOMInit.cpp.

- The file path... it's probably the distdir? I'm not sure how to get that from C++ code.

- Use PR_Access and/or PR_GetFileInfo to get info about the file.
Crash Signature: [@ mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2] → [@ mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2] [@ Abort | JS_InitWithFailureDiagnostic: u_init() failed | mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2]
(In reply to Nicholas Nethercote [:njn] (on vacation until July 11) from comment #33)
> - The filename can be obtained via ICU_DATA_FILE -- we'll need to pass it
> in, probably by specifying a -D flag in xpcom/build/moz.build for
> XPCOMInit.cpp.

This is correct. You should just be able to do `DEFINES['ICU_DATA_FILE'] = CONFIG['ICU_DATA_FILE']`.

> - The file path... it's probably the distdir? I'm not sure how to get that
> from C++ code.

It's NS_GRE_DIR:
https://dxr.mozilla.org/mozilla-central/rev/51377a64158941f89ed73f388ae437cfa494c030/xpcom/build/XPCOMInit.cpp#696
Duplicate of this bug: 1281751
Tracking as it is an important crash (startup)
tagging this so that it shows up as regression in firefox 48 (introduced by bug 926980, bug 1239083).
Attachment #8762243 - Attachment is obsolete: true
Attachment #8739252 - Flags: checkin+
I was going to ask ted to review this, but he's not accepting review requests.
glandium, maybe you can take a look?

I tested this on Windows, Mac and Linux. It'll give a clearer abort cause.

But I'm also wondering if we should just switch back to embedding the ICU data
into the binary. AIUI the motivation for switching to a separate file was
that on Mac we included the data twice due to universal binaries, is that
right?
Attachment #8771198 - Flags: review?(mh+mozilla)
That's correct. What's the platform breakdown on these crashes? If they're mostly Windows we could switch back on Windows-only. We're already building a linkable version of the data file (we link it in the JS shell), we'd just have to do some moz.build fiddling to do the right thing.
The other motivation for the change was to be able to commit the pre-built data file to the tree instead of having to run ICU's build to generate it, which is a pretty good thing, but my workaround to link the data into the JS shell is a simple yasm file that includes the binary data.
> That's correct. What's the platform breakdown on these crashes? If they're
> mostly Windows we could switch back on Windows-only

Among crashes in the past week it's 100% Windows. And it's the #10 top crasher on 48 Beta.
Comment on attachment 8771198 [details] [diff] [review]
Abort more informatively when the ICU data file is missing or unreadable

Review of attachment 8771198 [details] [diff] [review]:
-----------------------------------------------------------------

::: xpcom/build/XPCOMInit.cpp
@@ +708,5 @@
> +  // JS_InitWithFailureDiagnostic() will fail anyway. By checking here the
> +  // cause of the abort will be clearer.)
> +  nativeGREPath.Append(FILE_PATH_SEPARATOR);
> +  nativeGREPath.Append(NS_STRINGIFY(ICU_DATA_FILE));
> +  if (PR_Access(nativeGREPath.get(), PR_ACCESS_EXISTS) == PR_FAILURE) {

nsIFile has a `boolean exists()` method. (so, in C++, something like nsresult Exists(bool *)). There are also isFile and isReadable.

@@ +709,5 @@
> +  // cause of the abort will be clearer.)
> +  nativeGREPath.Append(FILE_PATH_SEPARATOR);
> +  nativeGREPath.Append(NS_STRINGIFY(ICU_DATA_FILE));
> +  if (PR_Access(nativeGREPath.get(), PR_ACCESS_EXISTS) == PR_FAILURE) {
> +    NS_RUNTIMEABORT("ICU data file is missing");

Is there a less abrupt way we could fail here? Like, can't we have the js engine be happy with not having the data, but have the intl api degraded? That sounds like a better outcome than crashing at startup with a message that users won't understand (if it's presented to them at all. It most certainly isn't for most people on Linux and Mac. I don't know if that opens a popup on Windows)

What happens if we still go through startup but don't call u_setDataDirectory?
Attachment #8771198 - Flags: review?(mh+mozilla)
> Is there a less abrupt way we could fail here? Like, can't we have the js
> engine be happy with not having the data, but have the intl api degraded?

I don't know. I'm just trying to get a clearer idea of what went wrong by making the crash reports more informative.

> What happens if we still go through startup but don't call
> u_setDataDirectory?

I tried commenting out the u_setDataDirectory() call. The call to u_init() fails, which meands JS_InitWithDiagnosticFailure() fails, which causes us to abort.
I switched to using nsIFile methods.
Attachment #8771198 - Attachment is obsolete: true
Attachment #8771253 - Attachment is obsolete: true
Nick what's the status? Do we want to land this patch for 48? Comment #41 says this is a top crash there.
Flags: needinfo?(n.nethercote)
Jeff can you comment on https://bugzilla.mozilla.org/show_bug.cgi?id=1262731#c42 please?
Flags: needinfo?(jwalden+bmo)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #46)
> Nick what's the status? Do we want to land this patch for 48? Comment #41
> says this is a top crash there.

The status is unclear. My patch would give slightly more clarity about what's going wrong -- i.e. is the file missing, as we suspect? -- but it won't get us any closer to a fix.

The best option, IMO, is to do what Ted suggested in comment 39 and comment 40: on Windows, link the data file back into the binary, which is how it used to be done. Ted would have to be the one to do that, AFAIK.

Another possibility is to try and continue with the Intl API degraded, as suggested in comment 42. But I don't know if that's feasible, and it would result in a small number of users having different behaviour, which doesn't sound good.
Flags: needinfo?(n.nethercote)
(In reply to Nicholas Nethercote [:njn] from comment #48)
> Another possibility is to try and continue with the Intl API degraded, as
> suggested in comment 42. But I don't know if that's feasible, and it would
> result in a small number of users having different behaviour, which doesn't
> sound good.

Doesn't sound good, but it sounds better than users not being able to start Firefox at all.
(In reply to Naveed Ihsanullah [:naveed] from comment #47)
> Jeff can you comment on
> https://bugzilla.mozilla.org/show_bug.cgi?id=1262731#c42 please?

I tried making ICU start-up deliberately fail and then used the Intl API. I got a JS exception instead of the normal behaviour. So, better than a crash, but has the potential to lead to lots of mysterious behaviour.

I'm looking at the "don't have a separate file on Windows" option. IIUC it's not that hard. I'm testing a patch locally right now.
Flags: needinfo?(jwalden+bmo)
I've tested this locally and it seems to work. I'll do a try run shortly.

glandium said on IRC "I'm tempted to say we should keep things similar on all
platforms, whether we do use a separate data file or not". But then we'd go
back to the data duplication on Mac... I can live with Windows being different.
Attachment #8772215 - Flags: review?(ted)
The previous version of this patch broke Windows packaging on try. Hopefully
this will work better; I just started another try run.
Attachment #8772228 - Flags: review?(ted)
Attachment #8772215 - Attachment is obsolete: true
Attachment #8772215 - Flags: review?(ted)
> The previous version of this patch broke Windows packaging on try. Hopefully
> this will work better; I just started another try run.

Looking better:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3fb8d98fbbda (original, Windows is busted)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=235423a0c30e (with Windows package error fixed)
Comment on attachment 8772228 [details] [diff] [review]
Don't use a separate ICU data file on Windows

Review of attachment 8772228 [details] [diff] [review]:
-----------------------------------------------------------------

::: browser/installer/package-manifest.in
@@ +100,5 @@
>  @BINPATH@/api-ms-win-*.dll
>  @BINPATH@/ucrtbase.dll
>  #endif
>  #endif
> +#ifdef MOZ_ICU_DATA_ARCHIVE

This won't work because MOZ_ICU_DATA_ARCHIVE isn't in DEFINES, you'll have to set it here:
https://dxr.mozilla.org/mozilla-central/rev/4c05938a64a7fde3ac2d7f4493aee1c5f2ad8a0a/browser/installer/Makefile.in#121

::: build/autoconf/icu.m4
@@ +85,5 @@
>      dnl JS standalone so that embedders don't have to deal with it.
> +    dnl We also don't do it on Windows because sometimes the file goes
> +    dnl missing -- possibly due to overzealous antivirus software? --
> +    dnl which prevents the browser from starting up :(
> +    if test -z "$JS_STANDALONE" -a -z "$MOZ_SYSTEM_ICU" -a "$OS_TARGET" != WINNT; then

I bet we can move this all to moz.configure now! (You don't have to do that in this patch.)
In my local testing I couldn't see any effect from the additional
browser/installer/Makefile.in change, but I'll take your word that it's
necessary.
Attachment #8773100 - Flags: review?(ted)
Attachment #8772228 - Attachment is obsolete: true
Attachment #8772228 - Flags: review?(ted)
Attachment #8773100 - Flags: review?(ted) → review+
Keywords: leave-open
Attachment #8773100 - Flags: checkin+
Comment on attachment 8773100 [details] [diff] [review]
Don't use a separate ICU data file on Windows

Approval Request Comment

[Feature/regressing bug #]: bug 1239083.

[User impact if declined]: Firefox won't start. A topcrash on Beta.

[Describe test coverage new/current, TreeHerder]: just landed on inbound. All the existing Intl tests should cover this.

[Risks and why]: low? It's a build system change. Any problems are likely to be obvious, e.g. Firefox won't start.

[String/UUID change made/needed]: none.
Attachment #8773100 - Flags: approval-mozilla-beta?
Attachment #8773100 - Flags: approval-mozilla-aurora?
Comment on attachment 8773100 [details] [diff] [review]
Don't use a separate ICU data file on Windows

Review of attachment 8773100 [details] [diff] [review]:
-----------------------------------------------------------------

This patch fixes a topcrash. Let's take it in 48 beta 10 and aurora.
Attachment #8773100 - Flags: approval-mozilla-beta?
Attachment #8773100 - Flags: approval-mozilla-beta+
Attachment #8773100 - Flags: approval-mozilla-aurora?
Attachment #8773100 - Flags: approval-mozilla-aurora+
not on central right now but will uplift as soon as this hits central
Flags: needinfo?(cbook)
https://hg.mozilla.org/mozilla-central/rev/59e9f5b2c246
Status: REOPENED → RESOLVED
Closed: 4 years ago3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Summary: crash in mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2, because JS_Init() fails → startup crash in mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2, because JS_Init() fails
Whiteboard: [startupcrash]
Blocks: 1288651
Depends on: 1289002
I can verify that this has been fixed.

- Nightly: The last Nightly build on which this crash occurred was 20160721030216, which matches the m-c push date.

- Aurora: The last Aurora build on which this crash occurred was 20160721004019, which also matches the m-c push date.

- Beta: In the last week there are only 3 of these crashes in 48.0b10. I'm not sure what caused those (possibly a different kind of ICU data corruption to what the landed patch addresses) but in comparison 48.0b9 had 17,380 crashes.
Status: RESOLVED → VERIFIED
Ted, what do you think about comment #64? Thanks
Flags: needinfo?(ted)
Flags: needinfo?(n.nethercote)
Looking at the comments, many users say that Firefox started crashing after a system restore, after forcibly turning off the computer, after a power outage, after updating to Windows 10.

Also, interestingly, looks like 100% of crashes with this signature occur when e10s is disabled (~87% of crashes occur when e10s is disabled overall in Beta): https://raw.githubusercontent.com/mozilla/stab-crashes/1936b4980cd619b45d0ca82f782bcc1b2542b3bb/plots/beta/Abort%20%7C%20JS_InitWithFailureDiagnostic%3A%20u_init()%20failed%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20NS_InitXPCOM2.png
Do we know if they crash only once or always?
(In reply to Sylvestre Ledru [:sylvestre] from comment #67)
> Do we know if they crash only once or always?

I see a lot of crashes with the same install time, so I guess they're crashing multiple times.

I can reproduce the crash by installing 48.0b9 and removing the ICU data file before starting Firefox. I always crash (I'm not able to launch Firefox).

I'll test 48.0b10 next.
(In reply to Marco Castelluccio [:marco] from comment #68)
> (In reply to Sylvestre Ledru [:sylvestre] from comment #67)
> > Do we know if they crash only once or always?
> 
> I see a lot of crashes with the same install time, so I guess they're
> crashing multiple times.
> 
> I can reproduce the crash by installing 48.0b9 and removing the ICU data
> file before starting Firefox. I always crash (I'm not able to launch
> Firefox).
> 
> I'll test 48.0b10 next.

Well, obviously I can't reproduce with 48.0b10, since the ICU data file is no more (so I
can't simply remove it to trigger the crash).
I don't know offhand, but that's plausible. The comments about system restore etc support my theory from comment 23. This sounds similar to the situation that caused us to rename omni.jar -> omni.ja.

I think the e10s stat is not interesting since this is a startup crash, so we probably don't have the actual e10s enabled/disabled data yet.
Flags: needinfo?(ted)
I agree that e10s is irrelevant here. I don't have anything else to add, sorry.
Flags: needinfo?(n.nethercote)
OK, so, let's try to summarize that:
* we think it might be caused by anti virus removing the dat file
* we had a startup crash spike since July 22
* maybe this spike can be explained by an (incorrect) update of an antivirus removing this file
* Nicholas fixed in beta 10
* we don't have any signature of this crash in beta 10 or the two RC

While we can't really explain the spike, I guess we can say that it is not blocking the 48 release.

Nicholas, do you agree with all that? Thanks
Flags: needinfo?(n.nethercote)
(In reply to Sylvestre Ledru [:sylvestre] from comment #72)
> OK, so, let's try to summarize that:
> * we think it might be caused by anti virus removing the dat file
> * we had a startup crash spike since July 22
> * maybe this spike can be explained by an (incorrect) update of an antivirus
> removing this file
> * Nicholas fixed in beta 10
> * we don't have any signature of this crash in beta 10 or the two RC
> 
> While we can't really explain the spike, I guess we can say that it is not
> blocking the 48 release.

We know it's something going wrong with the loading of the data file -- it must be missing, corrupted, or something else like that. Missing due to AV is our best guess.

> Nicholas, do you agree with all that? Thanks

Yes.
Flags: needinfo?(n.nethercote)
Crash volume for signature 'Abort | JS_InitWithFailureDiagnostic: u_init() failed | mozalloc_abort | NS_DebugBreak | NS_InitXPCOM2':
 - nightly(version 50):457 crashes from 2016-06-06.
 - aurora (version 49):1403 crashes from 2016-06-07.
 - beta   (version 48):59015 crashes from 2016-06-06.
 - release(version 47):0 crashes from 2016-05-31.
 - esr    (version 45):0 crashes from 2016-04-07.

Crash volume on the last weeks:
            W. N-1  W. N-2  W. N-3  W. N-4  W. N-5  W. N-6  W. N-7
 - nightly      99       3       1       4      23       0       0
 - aurora      515      47      64      25      26       0       0
 - beta       7819    1896    1521    1755    1037       0       0
 - release       0       0       0       0       0       0       0
 - esr           0       0       0       0       0       0       0

Affected platform: Windows
Andrei, I was wondering if you could help identifying which anti virus is causing that mess?
This is impacting 48 beta 9.

I would like to have a better understanding of the situation before we release 48
Flags: needinfo?(andrei.vaida)
(In reply to Sylvestre Ledru [:sylvestre] from comment #75)
> I would like to have a better understanding of the situation before we
> release 48

There should be nothing to worry about for the release channel: 47 didn't have the file that was removed by $whatever (AV or otherwise) and caused crashes in 48 betas. And 48 release won't have it either. So $whatever caused the problem by removing the file in the first place is not going to cause problems by removing the file because the file is not there to be removed anymore.

The remaining problem, however, is people who were on the beta channel, had the file removed, and won't get autoupdated because they can't even start Firefox.
(In reply to Mike Hommey [:glandium] from comment #76)
> The remaining problem, however, is people who were on the beta channel, had
> the file removed, and won't get autoupdated because they can't even start
> Firefox.
Is there anything we can do here? I guess they will have to reinstall...
I've found a post on the Google Chrome help forum with a similar problem: https://productforums.google.com/forum/#!topic/chrome/oWe87QVrmZA.
(In reply to Sylvestre Ledru [:sylvestre] from comment #77)
> (In reply to Mike Hommey [:glandium] from comment #76)
> > The remaining problem, however, is people who were on the beta channel, had
> > the file removed, and won't get autoupdated because they can't even start
> > Firefox.
> Is there anything we can do here? I guess they will have to reinstall...

Short of actively developing something that can autoupdate a firefox that crashes on startup, there is nothing that can be done. And even then, that would only save future problems like this one, not this one.
(In reply to Sylvestre Ledru [:sylvestre] from comment #75)
> Andrei, I was wondering if you could help identifying which anti virus is
> causing that mess?
> This is impacting 48 beta 9.
> 
> I would like to have a better understanding of the situation before we
> release 48

I've reproduced the crash on 48.0b9-build1 by manually removing the icudt56l.dat file.

Unfortunately, I was unable to find a scenario in which this happens as a result of an antivirus software's interference.

I've tested on Windows XP Professional x64 Edition (SP2, build 3790.srv03_sp2_rtm.070216-1710), using the following antivirus solutions:
- AVG Antivirus Business Edition
- AVG Internet Security 2016
- Avast Free AntiVirus 2016 (which also installs the Avast Online Security 12.0.88 and Avast SafePrice 10.3.5.39 add-ons)
- Comodo Internet Security 2016 (which also installs GeekBuddy)
- ESET NOD32 Antivirus (Trial, 9.0.386.0)

I've also tried looking for other clues, to see if this file was ever suspected of being malicious, but Virus Total's analysis shows that this file is trusted [1] on all its 57 sources. 

According to Comment 66, this looks like a bug that requires very specific conditions to manifest itself, but at this point it's like a needle in the haystack.


[1] https://www.virustotal.com/en/file/b5df330ddb768687717b552be3699e5c442e9c69085dd8bc3da1abedb8375df4/analysis/
Flags: needinfo?(andrei.vaida)
The System Restore theory from comment 70 seems pretty plausible too. I wonder if icudt56l.dat would be present if you deleted the intact Firefox install dir and then ran a System Restore.
Windows 7 represents ~88% of the crashes, so I got the impacted Windows 7 version:
6.1.7601 Service Pack 1 	32394 	60.98 %
6.1.7600                        20688 	38.95 %
6.1.7600 Service Pack 1 	14 	0.03 %
6.1.7601 Service Pack 2 	10 	0.02 %
6.1.7600 Service Pack 3 	8 	0.02 %
6.1.7601                        4       0.01%

and I found this:
https://support.microsoft.com/en-us/kb/3161647

an update has been made the 2016-06-21 (crashes appeared the 22nd of June)

So maybe we have a correlation here.
I just confirmed that comment 70/81 are in fact a way of getting into this state. In a Windows 7 VM:
* Install Fx48b9, confirm that icudt56l.dat is present.
* Create a new System Restore point.
* Delete the Firefox install directory.
* Restore your earlier Restore Point.
* icudt56l.dat is no longer present in the install dir and "Could load XPCOM" errors trying to launch Firefox.
(In reply to Ryan VanderMeulen [:RyanVM] from comment #83)
> * icudt56l.dat is no longer present in the install dir and "Could load
> XPCOM" errors trying to launch Firefox.

*Couldn't* of course :)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #84)
> (In reply to Ryan VanderMeulen [:RyanVM] from comment #83)
> > * icudt56l.dat is no longer present in the install dir and "Could load
> > XPCOM" errors trying to launch Firefox.
> 
> *Couldn't* of course :)

We have a support article on this that suggests a reinstall. Could probably use some cleanup work though.

https://support.mozilla.org/en-US/questions/943596
You need to log in before you can comment on or make changes to this bug.