Closed Bug 633869 Opened 9 years ago Closed 9 years ago

Investigate 3.6.14 startup crashes in [@ js_Enumerate], possible regression

Categories

(Toolkit :: Application Update, defect, critical)

1.9.2 Branch
x86
All
defect
Not set
critical

Tracking

()

RESOLVED FIXED

People

(Reporter: christian, Unassigned)

References

Details

(Keywords: regression)

js_Enumerate is an existing signature that seems to have spiked in 3.6.14. It looks like there may be multiple crashes with the sme signature: 1) an existing crash that happens after some uptime and 2) a startup crash that happens on startup after an update to 3.6.14. This bug tracks investigating #2.

The crash is currently #8 on the 3.6.14 crash list (in beta). You can see crashes here:

https://crash-stats.mozilla.com/report/list?range_value=2&range_unit=weeks%2014%3A00%3A00&signature=js_DestroyScriptsToGC&version=Firefox%3A3.6.14

We believe we introduced an extension incompatibility based on the crash signature. Unfortunately, the crash is too early for the crash reports to have the list of extensions.
CCing some Js folks that fixed bugs in 3.6.14. Here is the list of fixed bugs in 3.6.14 to refresh people's memories:

https://bugzilla.mozilla.org/buglist.cgi?quicksearch=ALL%20status1.9.2:.14-fixed&order=map_products.name%2Cmap_components.name%2Cbugs.bug_id
Nothing on the list sticks out as an obvious cause. This will need a bit investigation.
better crash-stats link: https://crash-stats.mozilla.com/report/list?signature=js_Enumerate seems this was also in Firefox 3.6.8 ?
Like I said, it looks like there are a couple crashes with the same signature. The ones to look at have < 10s uptime and have < 3 mins since last installed time.
The new crash
 1) has only been seen on windows (clue? or just small beta audience?)
 2) always accessing address 0x113a8
    (in Thunderbird 3.1.8 it's always 0x9848)
 3) has the following stack:

   js_Enumerate    js/src/jsobj.cpp:4893
   JS_Enumerate    js/src/jsapi.cpp:3964
   JS_SealObject   js/src/jsapi.cpp:2926
   XPC_SJOW_AttachNewConstructorObject
   nsXPConnect::InitClassesWithNewWrappedGlobal
   mozJSComponentLoader::GlobalForLocation
   mozJSComponentLoader::LoadModule
   nsComponentManagerImpl::AutoRegisterComponent
   nsLocalFile::IsDirectory

(slightly different in Thunderbird, but it's during component registration and ends the same from JS_SealObject up).

As far as I can tell this stack doesn't show up in the low-level js_Enumerate crashes we see in older Firefox versions. Two or three pre-existing js_Enumerate crashes show up in 3.6.14 betas, but it's overwhelmingly the above.
Christian, some of the reports have email addresses we could use to ask for a list of installed add-ons.
I'll work with support to get some lists.
> The new crash
>  1) has only been seen on windows (clue? or just small beta audience?)

In fact, it has only been seen on Windows XP (SP2 and SP3). True on Thunderbird as well. Is it a Firefox difference, or a difference in some 3rd party component that's only installed on WinXP?

   4) it has only been seen in build2, after the back-out of bug 599610
      (due to the new crash bug 631105).

Since our fixes are back-ported from the trunk, is it possible some other JS fix we landed depended on the changes in bug 599610 without knowing it did?
Depends on: 634036
The dll correlation reports are up, no smoking guns (go to link in comment 4, click correlations tab)
Assignee: general → cdleary
Chris and I looked at this today. We came up with two theories:

1. The crash is happening trying to use a native enumerator hash. The hash has the form of a bunch of words that are either ((shape) | 1) OR (nativeEnumeratorPointer | 0). Given that we're trying to access relatively low address consistently, it's possible that somehow we're losing the tag bit telling us that we're looking at a shape causing us to treat a shape (a number) as a JSNativeEnumerator *. This is somewhat unlikely, especially because it appears that the cache slot we're looking at is essentially random (each SJOW prototype has its own shape).

2. The patch for bug 600853 changed the layout of JSThread and since JSThreadData is embedded in JSThread, it's possible that an extension that's compiled against 1.9.2.13 that was previously mutating gcMallocBytes is now mutating the hash instead. It would have to be using something other than the public JS API, since we don't expose anything that gives anyone access to the JSThread (or JSThreadData for that matter).

It looks like option 2 is more likely, so having a list of extensions would be really useful here.
Unfortunately there is only one valid email people have given to us in crash reports. We haven't heard back from the user yet.
> so having a list of extensions would be really useful here.

addon correlations are at

http://people.mozilla.org/crash_analysis/20110214/20110214_Firefox_3.6.14-interesting-addons.txt

search down to js_Enumerate
(In reply to comment #13)

Is it possible to correlate the user IDs that submitted these (startup) crash reports at address 0x113a8 against the binary extensions in their other crash reports?
The most viable option may be to back out bug 600853 (in order to switch the JSThreadData layout back) and see if that avoids the startup crashes. I could make the backout patch now, or we can wait for Igor's daytime in order to get a second opinion.
Let's wait for Igor to weigh in as well. If need be we can backout bug 600853 tomorrow morning PST and kick off a build at that time.
(In reply to comment #14)
> (In reply to comment #13)
> 
> Is it possible to correlate the user IDs that submitted these (startup) crash
> reports at address 0x113a8 against the binary extensions in their other crash
> reports?


we don't have any user_id info to make these correlations due to concerns about privacy.  Its possible that we could attempt to do some fingerprinting to try and reverse engineer a user id, but we don't have anything like that set up right now.
Have we stopped pointing 3.6.14pre updates at that build? We don't want to lose
the users having the issue because of inability to update after a startup
crash. (Excuse my ignorance, not totally privy to the release/update process.
:-)
We have not, though I am planning to do so tomorrow. I was hoping to get additional data to help us diagnose (or get someone complaining via the support forums or a bug).
Try the back-out soon to get data, then. Unless you're going to get data on addons with binary components, staying the course won't reveal much more, while backing out the patch for bug 600853 could win.

/be
We're not seeing the crash on 3.6.15pre (likely due to low ADUs), so not sure how much data backing out will give us without releasing all the way to beta. In any case, it's probably best to land the backout on both the relbranch and the default branch to be prepared for a build tomorrow.

Chris, do you want to take care of that or do you want me to? The relbranch is GECKO19214_2011012112_RELBRANCH.
Are we willing to have another weeklong beta to verify that the crashes are gone before we ship this? I know that we have other pressures on us to ship very soon.
(In reply to comment #13)
> > so having a list of extensions would be really useful here.
> 
> addon correlations are at [...]
> search down to js_Enumerate

None of the reports with the stack this bug is tracking have submitted add-on data. The add-on correlations therefore must be for the pre-existing js_Enumerate crash(es) with different stacks and can only be misleading in this case.

(In reply to comment #11)
> 2. The patch for bug 600853 changed the layout of JSThread and since
> JSThreadData is embedded in JSThread, it's possible that an extension that's
> compiled against 1.9.2.13 that was previously mutating gcMallocBytes is now
> mutating the hash instead. It would have to be using something other than the
> public JS API, since we don't expose anything that gives anyone access to the
> JSThread (or JSThreadData for that matter).
> 
> It looks like option 2 is more likely, so having a list of extensions would be
> really useful here.

That would mean a binary component, right? We ought to be able to see that in the module correlation list, but nothing really pops (see comment 10).

Unless maybe our own files aren't getting upgraded properly? Note that firefox.exe and xpcom.dll appear to be a different build than brwsrcmp.dll and browserdirprovider.dll:

    100% (14/14) vs.  75% (1718/2298) brwsrcmp.dll
          0% (0/14) vs.  33% (767/2298) 1.9.2.4038
        100% (14/14) vs.  41% (951/2298) 1.9.2.4055
    100% (14/14) vs.  76% (1754/2298) browserdirprovider.dll
          0% (0/14) vs.  34% (782/2298) 1.9.2.4038
        100% (14/14) vs.  42% (972/2298) 1.9.2.4055
    100% (14/14) vs.  76% (1757/2298) firefox.exe
          0% (0/14) vs.   0% (2/2298) 1.9.2.3615
          0% (0/14) vs.   0% (5/2298) 1.9.2.3989
        100% (14/14) vs.  35% (793/2298) 1.9.2.4038
          0% (0/14) vs.  42% (957/2298) 1.9.2.4055
    100% (14/14) vs.  77% (1760/2298) xpcom.dll
          0% (0/14) vs.   0% (2/2298) 1.9.2.3615
          0% (0/14) vs.   0% (2/2298) 1.9.2.3989
        100% (14/14) vs.  35% (799/2298) 1.9.2.4038
          0% (0/14) vs.  42% (957/2298) 1.9.2.4055

Did we package a mixture of .4038 and .4055 files? Or is that upgrade failure?
I think backing out the bug 600853 would be the best course of action at this point.
(In reply to comment #24)
> I think backing out the bug 600853 would be the best course of action at this
> point.

An alternative would be to change the patch for bug 600853 so it would not change the layout. But that needs some time so lets do that for the next release.
Thunderbird crashes with this stack show the same mixed-build pattern:

thunderbird.exe   1.9.2.4040   (build1)
xpcom_core.dll    1.9.2.4055
jsd3250.dll       1.9.2.4055
js3250.dll         <blank>

Firefox module lists consistently include

firefox.exe 	1.9.2.4038     (build1)
xpcom.dll 	1.9.2.4038     (build1)
xul.dll 	1.9.2.4038     (build1)
browserdirprovider.dll 	1.9.2.4055
brwsrcmp.dll 	1.9.2.4055
js3250.dll       <blank>


js3250.dll  has a blank version so I can't tell. It does have the same "debug identifier" as build1 ("D1F0AAA4196049DD94E4E3DBA15E34DD2") and different than 3.6.13 crashes so I think that means it's also build1.
bhearsum confirmed that firefox.exe, xul.dll, and xpcom.dll in the installer package are 1.9.2.4055
 --> these victims have frankenbuilds.
The good news for support is that downloading a fresh install will probably fix these folks. Backing out bug 600853 isn't likely to help unless we never plan to fix that bug -- eventually we'd check it in and whatever update bug causes frankenbuilds will land us right back here.
What happens if we only push a full update for this release?  Will that help pave over some of these cases, and prevent them for the wider audience?
Do we need a bug for the installer to use transactions (expand, then delete old then rename) to avoid this in the future?
We do the rename thing currently.
Wow so how did this happen? Well but we clearly don't delete old data first or don't check the error code for that delete othereise we would have partial but not mixed installs.
(In reply to comment #29)
> What happens if we only push a full update for this release?  Will that help
> pave over some of these cases, and prevent them for the wider audience?

the build1 -> build2 updates were all complete updates. We don't usually do partials on the beta channel (the partial .MARs generated in the build process are from the previous release so that they're ready to ship) and I confirmed with bhearsum that was the case for this set up updates.
We could have introduced a bug in the installer in 3.6.13. Curious why we can't reproduce internally and all QA's BFTs and update tests passed.
Maybe it's showing up in other crashes, but none of the 3.6.14 build1 js_Enumerate crashes show a frankenbuild.
(In reply to comment #34)
> We could have introduced a bug in the installer in 3.6.13. Curious why we can't
> reproduce internally and all QA's BFTs and update tests passed.

Users could still have other Firefox processes open (i.e. other Windows user) and we fail here? I'm not familiar with all the details, so CC'ing Robert.
Just in case anyone is wondering, the fourth version component in the version numbers is the number of days since January 1, 2000 to the date the binary was built.

i.e., in Python:
>>> datetime.datetime(2000, 01, 01, 0, 0) + datetime.timedelta(days=4038)
datetime.datetime(2011, 1, 21, 0, 0)
(In reply to comment #34)
> We could have introduced a bug in the installer in 3.6.13. Curious why we can't
> reproduce internally and all QA's BFTs and update tests passed.

1) marking as blocking bug#624496.

2) (Obvious question, but dont see here in comments so far, so asking): Once we know the root cause of this, if it is confirmed to be already released in FF3.6.13, would we consider shipping FF3.6.14 with the same error as already in FF3.6.13, and then go do a future FF3.6.x release with this fix?
Blocks: 624496
I'm not sure what we're doing here. Are we doing a build backing out bug 600853, as comment 24 suggested, in the near future?

I'm worried that we don't know how people got into a frankenbuild state. It feels like we're still guessing a lot and don't know that backing out bug 600853 is really going to fix this since we don't know the cause.
We're not backing it out. We're looking in crash stats to see how pervasive this is and if it has happened before or if it is new in 3.6.13 or 3.6.14 build #1. That  would point to where an updater regression might have been introduced (bug 634343)
Windows-specific App Updater bug 466778 landed in 1.9.2.14 dealing with in-use windows executables.
(In reply to comment #41)
> Windows-specific App Updater bug 466778 landed in 1.9.2.14 dealing with in-use
> windows executables.

Sure, but these users are using the 1.9.2.13 updater to get this broken build, no?
(In reply to comment #42)
> (In reply to comment #41)
> > Windows-specific App Updater bug 466778 landed in 1.9.2.14 dealing with in-use
> > windows executables.
> 
> Sure, but these users are using the 1.9.2.13 updater to get this broken build,
> no?

No, not necessarily. They could be going from 1.9.2.14 build 1 to build 2.
And, indeed, I believe we're seeing a mixed build1/build2 franken right?
Correct. That would explain why we didn't see it in build 1 and are now seeing it in build 2 (the 3.6.14 updater is active with the fix from bug 466778). If it is a problem with bug 466778 we really dodged a bullet for 3.6.15.
so do we back out 466778 and ship only full updates onto beta channel for build3? that seems like it would let us ship.
Assignee: cdleary → nobody
Component: JavaScript Engine → Application Update
Product: Core → Toolkit
QA Contact: general → application.update
I have jimm and rs looking into 466778. Right now we don't really have data pointing to it being the culprit, though all signs point in that direction.

The updates on beta are full regardless, as the patch is to to 3.6.13. Also, we won't *really* know this is fixed until we do an update on top of builds with the fix/backout as well.
Bug 466778 is a "very nice to have" but it has been around forever so I do think it would be safest to back that out even if it turns out not to be the cause. I'm in the countryside today and would appreciate it if someone else could back that out today.

I'll check tomorrow to see if this could affect trunk in anyway since Bug 466778 has been on trunk for quite some time now.
Actually, it is probably best to back the last four changesets out after Josh's Mac OS X patch just to be safe.
along with the removal of tests... here are the changesets
http://hg.mozilla.org/releases/mozilla-1.9.2/pushloghtml?changeset=0b38c0bb138d

The more I think about it the more I think that one of these changes is the cause. I've got a rather slow connection so this is it for me until tomorrow.
FWIW, we turned updates off but they should still be on the "betatest" channel so we can test/repo.
(In reply to comment #48)
> I'll check tomorrow to see if this could affect trunk in anyway since Bug
> 466778 has been on trunk for quite some time now.

We're checking on that next in bug 634343.

(In reply to comment #50)
> The more I think about it the more I think that one of these changes is the
> cause. I've got a rather slow connection so this is it for me until tomorrow.

When people update does the new version's updater do the heavy lifting? We were already seeing problems in the 3.6.13 to .14-build1 update: See bug 634343 comment 12. If that's caused by the 3.6.13 update code then we have to look at a different set of fixes entirely.
(In reply to comment #52)
> When people update does the new version's updater do the heavy lifting? 

No, we use the existing updater. bug 366846 tracks changing that.
In that case, if we're seeing mixed 3.6.13/3.6.14 builds that would imply the bug is in the 3.6.13 updater (or likely before) and won't be helped by the proposed backout in comment 50.

I don't see how a mixed-version build passes the updater's validity checks. Shouldn't it notice the problem and roll-back the install? Do we need to add a start-up sanity check that compares file hashes against an installed list? On failure what do we do, tell users to download a fresh copy?
whats the status of this? when is 3.6.14 gonna be released? next week?
We're assessing the extent of the issue by puling data over in bug 634343. Day-for-day slip.
so it could be out any day just a matter of when the issue is resolved?
noticed there is a build #3 now.
has this bug been fixed in 3.6.14 build 3?
Yes, we determined it was an existing issue (mismatched firefox.exes and dlls) we happened to have stumbled over. The investigation is done.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
good to hear. so its probably safe to upgrade from 3.6.13 to 3.6.14 now? and build 3 will be the final version.
So what happens to bug 570058, bug 466778, bug 601518, bug 316890, bug 535942 ? They are marked fixed but the fixes were backed out to investigate this bug.
We use the status field to track the bugs status on the trunk and the statusN.N.N (e.g. status1.9.2 for the 1.9.2 branch) to track the status of a bug on a particular branch.

Though it would be nice to have those bugs fixed on the 1.9.2 branch for updates to Firefox 4 the risk / reward isn't worth all that much with Firefox 4 coming out in the near future.
More users than normal on SUMO have been reporting installation errors, such as "the partial update could not be applied", apparently caused by firefox.exe not exiting completely.  The affected users are able to get back into 3.6.13 to ask for help, but the update fails every time.

If this behavior is also caused by malformed install folders, it could explain the reduced effect on Firefox vs Thunderbird.
It is possible that something may have changed that is causing Firefox not to exit in a reasonable amount of time. If this is the case the change would have happened during the 3.6.13 cycle.

It might be interesting to check for pre 3.6.13 users reporting problems updating since that would further indicate that something changed during the 3.6.13 cycle.

note: in the case where the update fails the user doesn't have to do anything to get back to 3.6.13 since the update is just rolled back to 3.6.13 and Firefox is relaunched automatically.

What is a malformed install folder?
(In reply to comment #65)
> What is a malformed install folder?

An install folder containing mismatched files from more than one release
What we have seen on 3.6.x are a significant quantity of earlier firefox.exe versions when compared to the dll's and the dll's typically having a newer version that is the same as the other dll's. The firefox.exe mainly just calls into xul.dll and afaik this wouldn't cause this problem. Bug 635161 will hopefully reduce the number of frankenbuilds further.
(In reply to comment #67)
> What we have seen on 3.6.x are a significant quantity of earlier firefox.exe
> versions when compared to the dll's and the dll's typically having a newer
> version that is the same as the other dll's. The firefox.exe mainly just calls
> into xul.dll and afaik this wouldn't cause this problem. Bug 635161 will
> hopefully reduce the number of frankenbuilds further.

Note that Thunderbird is a static build on 1.9.2, so most of it is in the exe, apart from xpcom_core (and other third-party bits).
Thanks Mark, 3.6 should be quite a bit better about handling updates due to bug 525390 and the numbers in bug 635161 appear to confirm that it is. Hopefully the vast majority of the remaining issues will be fixed by bug 635161.
Blocks: 600853
So, backing out the fix in bug 633869 makes it so a 3.6.13 build + 3.6.14 (w/ backout) dlls doesn't crash on startup. Doing the same with 3.6.14 dlls with bug 633869 causes a startup crash.
Err, I mean bug 600853 above
(In reply to comment #70)
> So, backing out the fix in bug 633869 makes it so a 3.6.13 build + 3.6.14 (w/
> backout) dlls doesn't crash on startup. Doing the same with 3.6.14 dlls with
> bug 633869 causes a startup crash.

I've just verified the same with Thunderbird 3.1.7/3.1.8.
You need to log in before you can comment on or make changes to this bug.