Closed Bug 1057868 Opened 5 years ago Closed 5 years ago

crash in nsTArray_Impl<nsComponentManagerImpl::ComponentLocation, nsTArrayInfallibleAllocator>::InsertElementAt(unsigned __int64)

Categories

(Toolkit :: Startup and Profile System, defect, critical)

34 Branch
All
Windows NT
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla34
Tracking Status
firefox34 --- verified

People

(Reporter: lizzard, Assigned: aklotz)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: win64-only? or may have a different signature on win32)

Crash Data

Attachments

(1 file)

This bug was filed from the Socorro interface and is 
report bp-3c7f6aaa-ae71-4049-8ef3-aee3d2140823.
=============================================================

This is the #8 topcrasher for Firefox 31.0a1 with 117/5922 crashes in the past 7 days.    The crashes started with the 2014081703 build. 


Crashing thread:

0 	xul.dll 	nsTArray_Impl<nsComponentManagerImpl::ComponentLocation, nsTArrayInfallibleAllocator>::InsertElementAt(unsigned __int64) 	xpcom/glue/nsTArray.h
1 	xul.dll 	nsComponentManagerImpl::Init() 	xpcom/components/nsComponentManager.cpp
2 	xul.dll 	NS_InitXPCOM2 	xpcom/build/nsXPComInit.cpp
3 	xul.dll 	ScopedXPCOMStartup::Initialize() 	toolkit/xre/nsAppRunner.cpp
4 	xul.dll 	ProfileMissingDialog 	toolkit/xre/nsAppRunner.cpp
5 	xul.dll 	XREMain::XRE_mainStartup(bool*) 	toolkit/xre/nsAppRunner.cpp
6 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp
7 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp
8 	firefox.exe 	do_main 	browser/app/nsBrowserApp.cpp
9 	firefox.exe 	NS_internal_main(int, char**) 	browser/app/nsBrowserApp.cpp
10 	firefox.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp
11 	firefox.exe 	__tmainCRTStartup 	f:/dd/vctools/crt_bld/self_64_amd64/crt/src/crtexe.c:552
12 	kernel32.dll 	BaseThreadInitThunk 	
Ø 13 	ntdll.dll 	ntdll.dll@0x734a4 	
14 	KERNELBASE.dll 	BasepIsDebugPortPresent
ComponentLocation* cl = sModuleLocations->InsertElementAt(0);

I originally suspected bug 286355, but that didn't land in the right timeframe.

Liz, can you provide a regression range link? I checked the reports and this is definitely more than a single user, but I only see reports of this on win64 nightlies, with crash address 0xffffffffffffffff.

dmajor can you check the minidump and see what exactly we're dereferencing here? "rax": "0x5a5a5a5a5a5a5a5a" is suspicious, but I'm not clear on what is supposed to be at rax based just on the crash link http://hg.mozilla.org/mozilla-central/annotate/64c4ec2df3d4/xpcom/glue/nsTArray.h#l1167
Flags: needinfo?(lhenry)
Flags: needinfo?(dmajor)
The failed deref happens while trying to pull the length out of the header. mHdr is 5a5a so the array itself is dead.
Flags: needinfo?(dmajor)
There are a couple stray hits from 0817 but it really started spiking with 20140822030201. Given the world Profile on the stack, possibly related to the unlocker work?
https://bugzilla.mozilla.org/show_bug.cgi?id=1057753#c2 :

I think that it has something to do with HTML5 video - usually happens after a video on youtube gets stuck and doesn't want to play, but i am not 100% sure.
I was hoping that the reports would be enough to tell what was at fault.
The crash here is unrelated to videos: Firefox hanging on shutdown can trigger the conditions necessary to reproduce this bug, but the shutdown hang should be filed and fixed separately from this crash.

Bug 286355 landed for the 0822 nightly, and looking at https://crash-stats.mozilla.com/report/list?product=Firefox&signature=nsTArray_Impl%3CnsComponentManagerImpl%3A%3AComponentLocation%2C+nsTArrayInfallibleAllocator%3E%3A%3AInsertElementAt%28unsigned+__int64%29#tab-graph this seems to have regressed on that day. There are two crashes on a prior day, but not at volume.

I'm pretty sure that we're re-initializing XPCOM from http://hg.mozilla.org/mozilla-central/annotate/64c4ec2df3d4/toolkit/xre/nsAppRunner.cpp#l3644 and that's what's causing the problem. Since the code from bug 286355 runs in SelectProfile() which happens *after* this, it's not clear to me why that patch might be causing this. But since this has fairly clear STR, let's see what aklotz figures out.
Blocks: 286355
Component: XPCOM → Startup and Profile System
Keywords: regression
Product: Core → Toolkit
Whiteboard: win64-only? or may have a different signature on win32
Assignee: nobody → aklotz
Flags: needinfo?(lhenry)
For me this shutdown hang was happening before bug 286355 got fixed. I just wasn't able to report it because the "submit bug report" dialog never appeared before.
The problem is the way that we are handling the error codes from SelectProfile:

http://dxr.mozilla.org/mozilla-central/source/toolkit/xre/nsAppRunner.cpp#3650

Here we only exit with failure when the error code must be either NS_ERROR_LAUNCHED_CHILD_PROCESS or NS_ERROR_ABORT. Other errors fall through into a path that calls ProfileMissingDialog.

The problem is that SelectProfile might or might not invoke ProfileLockedDialog or ShowProfileManager. When it does, it returns their error codes directly. Those error codes might be something other than NS_ERROR_ABORT, triggering the ProfileMissingDialog code path with its second XPCOM initialization.
I don't know whether you're going to like this patch, but here goes :-)

This patch adds a special return type that automatically converts the returned nsresult to NS_ERROR_ABORT when the nsresult is an error. It uses implicit conversions to be convertible to/from nsresult. Essentially we end up annotating the functions that need special treatment by using this return type.

I chose this approach for several reasons:

1) If we modify the return values of every possible path in the UI displaying functions, we introduce the possibility of introducing more bugs if we're not careful. In particular, cases where we are handling the result of a call to another function.

2) Ensuring that this is correct at the callsite could be prone to bugs if we accidentally omit the fix in any locations or (more importantly) if somebody adds a new call in the future and forgets to handle the error.

3) I didn't really like the idea of changing how the return values of SelectProfile work because there are many possible error codes that we might want to handle by showing ProfileMissingDialog.
Attachment #8478445 - Flags: review?(benjamin)
Comment on attachment 8478445 [details] [diff] [review]
Ensure that ProfileLockedDialog and ShowProfileManager return NS_ERROR_ABORT upon failure

Well, it's kinda yucky but not too bad. One other option that would change more lines of code but might be more readable is to just use an enum return type instead of nsresult:

enum SelectProfileResult
{
  SELECT_OK,
  SELECT_FAILED,
  SELECT_RESTARTED
}

I don't have a strong-enough opinion to insist on this style, but if you think it would make the code more readable, I'll be happy.
Attachment #8478445 - Flags: review?(benjamin) → review+
https://hg.mozilla.org/mozilla-central/rev/06363c2a6af1
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
There are no more crashes in Socorro over the past 4 weeks.
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.