Closed Bug 515211 Opened 15 years ago Closed 15 years ago

topcrash:[@arena_chunk_init ]

Categories

(Core :: JavaScript Engine, defect, P2)

1.9.1 Branch
x86
Windows XP
defect

Tracking

()

RESOLVED FIXED
Tracking Status
status1.9.2 --- beta3-fixed
blocking1.9.1 --- .8+
status1.9.1 --- .8-fixed

People

(Reporter: balgus82, Assigned: dmandelin)

References

(Depends on 1 open bug)

Details

(Keywords: crash, hang, topcrash, Whiteboard: [crashkill][crashkill-fix][tb3wants])

Crash Data

Attachments

(2 files, 4 obsolete files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 1.1.4322)
Build Identifier: http://www.mozilla.com/en-US/firefox/ie.html

since the new firefox update (version 3.5.2)  every time i open firefox the browser freezes for around half an hour, and then crashes.  before finishing loading ANY pages.  I've tried at least 7 or 8 times with the same results. 

Reproducible: Always

Steps to Reproduce:
1.open firefox
2.
3.
Actual Results:  
it freezes, then crashes minutes later. 

Expected Results:  
it should have done what it was supposed to do.  let me browse the internet.
Not seeing this with Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.3a1pre) Gecko/20090907 Minefield/3.7a1pre (.NET CLR 3.5.30729) ID:20090907051047 or 3.5.2 on another XP machine I have. 
However, Can you do a couple of things for me:
1. Give a stacktrace from those crashes, https://developer.mozilla.org/En/How_to_get_a_stacktrace_for_a_bug_report
2. Try to start Firefox in Safe Mode, http://support.mozilla.com/en-US/kb/Safe+Mode
3. if safe mode does not work, try creating a test profile, and see if it happens there, http://support.mozilla.com/en-US/kb/Managing+profiles. (make sure you do not make any changes in that new profile except the home pages to the ones that are causing the problem. so don't install extensions, customize, etc.)
Also, do any of the pages the browser is starting up with have flash, quicktime, etc.? Just double check that adobe flash, adobe reader, quicktime, etc. are all updated to the latest version that you can download from the developer's website.
Keywords: crash, hang
Version: unspecified → 3.5 Branch
here is the stacktrace Crash ID: bp-ea707c8b-d12e-42c3-b1c5-18ebb2090908  
 
safemode wouldnt work either.  
 
 
ok it DOES work with a new profile. is there a way to import all my bookmarks into the new profile?
well, maybe some addon is causing this. Can you please list all the ones you have? Also, support.mozilla.com can help you move your settings.
Summary: repetive crashing. → repetive crashing. [@arena_chunk_init ]
Signature	arena_chunk_init
UUID	ea707c8b-d12e-42c3-b1c5-18ebb2090908
Time 	2009-09-08 10:28:17.274022
Uptime	2400
Last Crash	900116 seconds before submission
Product	Firefox
Version	3.5.2
Build ID	20090729225027
Branch	1.9.1
OS	Windows NT
OS Version	5.1.2600 Service Pack 2
CPU	x86
CPU Info	AuthenticAMD family 6 model 10 stepping 0
Crash Reason	EXCEPTION_ACCESS_VIOLATION
Crash Address	0x48b00000
User Comments	
Processor Notes 	
Related Bugs

Crashing Thread
Frame 	Module 	Signature [Expand] 	Source
0 	mozcrt19.dll 	arena_chunk_init 	obj-firefox/memory/jemalloc/src/jemalloc.c:3374
1 	mozcrt19.dll 	arena_run_alloc 	obj-firefox/memory/jemalloc/src/jemalloc.c:3535
2 	mozcrt19.dll 	arena_malloc_large 	obj-firefox/memory/jemalloc/src/jemalloc.c:4100
3 	mozcrt19.dll 	arena_ralloc 	obj-firefox/memory/jemalloc/src/jemalloc.c:4738
4 	mozcrt19.dll 	realloc 	obj-firefox/memory/jemalloc/src/jemalloc.c:6356
5 	js3250.dll 	XMLArrayAddMember 	js/src/jsxml.cpp:1042
6 	js3250.dll 	Append 	js/src/jsxml.cpp:3199
7 	js3250.dll 	GetNamedProperty 	js/src/jsxml.cpp:3930
8 	js3250.dll 	GetProperty 	js/src/jsxml.cpp:4005
9 	js3250.dll 	xml_appendChild 	js/src/jsxml.cpp:5550
10 	js3250.dll 	js_Interpret 	js/src/jsinterp.cpp:5147
11 	js3250.dll 	js_Invoke 	js/src/jsinterp.cpp:1394
12 	xul.dll 	nsXPCWrappedJSClass::CallMethod 	js/src/xpconnect/src/xpcwrappedjsclass.cpp:1697
13 	xul.dll 	nsXPCWrappedJS::CallMethod 	js/src/xpconnect/src/xpcwrappedjs.cpp:569
14 	xul.dll 	PrepareAndDispatch 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:114
15 	xul.dll 	SharedStub 	xpcom/reflect/xptcall/src/md/win32/xptcstubs.cpp:141
16 	xul.dll 	nsTimerImpl::Fire 	xpcom/threads/nsTimerImpl.cpp:465
17 	xul.dll 	nsTimerEvent::Run 	xpcom/threads/nsTimerImpl.cpp:512
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Version: 3.5 Branch → 1.9.1 Branch
Assignee: general → dmandelin
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Summary: repetive crashing. [@arena_chunk_init ] → topcrash:[@arena_chunk_init ]
Blocking 1.9.2+ per CrashKill effort.
Flags: blocking1.9.2+
Keywords: topcrash
Marking all topcrash bugs as P2 (3.6 release blockers, but not 3.6b1 blockers)
Priority: -- → P2
Most of these crashes are on uniprocessor systems (assuming I know how to read that off the crashreport), so it seems not to be MP-related. Also, I find lots of different stacks (even getting both large and small arena_malloc functions as frame 2).

I also looked at a minidump of this today. The crash is in arena_chunk_init, called ultimately from malloc/realloc, here:

static void
arena_chunk_init(arena_t *arena, arena_chunk_t *chunk)
{
	arena_run_t *run;
	size_t i;

	VALGRIND_MALLOCLIKE_BLOCK(chunk, (arena_chunk_header_npages <<
	    pagesize_2pow), 0, false);
#ifdef MALLOC_STATS
	arena->stats.mapped += chunksize;
#endif

	chunk->arena = arena;
     *** crash here

	/*
	 * Claim that no pages are in use, since the header is merely overhead.
	 */
	chunk->ndirty = 0;

	/* Initialize the map to contain one maximal free untouched run. */
	run = (arena_run_t *)((uintptr_t)chunk + (arena_chunk_header_npages <<
	    pagesize_2pow));

It appears that the computation of |run| is moved ahead of the stores to members of |chunk|. In the minidump I took, it appears the values of some locals are these:

    chunk                      0x05000000    (edi)
    arena                      0x008a0040    (esi)
    run                        0x05001000    (ebp)
    pagesize_2pow               12           (ecx)
    arena_chunk_header_npages    1           (inferred from previous vars)

The crash is for attempting to store to 0x05000000 as [edi], thus the crash point I identified above. (This is not the line number in the crash report.)
Continuing the analysis in comment 7, the 'bad' value of |chunk| comes from an argument to |arena_chunk_init|. |arena_chunk_init| is here called from |arena_run_alloc| in a code path where there are no existing usable runs, and a new chunk needs to be created. So, the main possibilities are:

1. |chunk_alloc| returns an invalid chunk
2. A compiler bug causes the wrong value to get passed.

|chunk_alloc| tries |chunk_recycle_reserve| and |chunk_alloc_mmap| in turn. Both of those are moderately complicated, so it's not immediately obvious which is going wrong. Using a skid mark to find out which one returned the chunk could be useful.
Depends on: 521436
Archaeology reveals that the first crash with this signature occurred in 3.1a1pre on July 18, 2008:

http://crash-stats.mozilla.com/report/index/8aa443fa-5388-11dd-b73f-001321b13766

3.1a1pre is also the earliest version for which I could find an instance of the crash. (I would need fully query access to be entirely sure that it wasn't in 3.0. But I'm 99% sure.)

The last change to jemalloc.c before 18 Jul 2008 was:

changeset:   15610:446be0510148
parent:      15608:e6cf93ad1e3a
user:        Jason Evans <jasone@canonware.com>
date:        Tue Jul 01 15:41:14 2008 -0700
files:       memory/jemalloc/Makefile.in memory/jemalloc/jemalloc.c memory/jemalloc/jemalloc.h
description:
Bug 427109: Add memory reserve and xmalloc() API, r=benjamin

Interestingly, this introduced the |chunk_recycle_reserve| function that was mentioned as a possible cause of the problem in comment 8. 

That patch was originally landed a bit earlier:

changeset:   15482:c14ab4f6cec6
user:        Jason Evans <jasone@canonware.com>
date:        Mon Jun 23 07:46:37 2008 -0700
files:       memory/jemalloc/Makefile.in memory/jemalloc/jemalloc.c memory/jemalloc/jemalloc.h memory/jemalloc/ql.h memory/jemalloc/qr.h
description:
Bug 427109: Add memory reserve and xmalloc() API, r=benjamin

Add support for a memory reserve, which is managed via the reserve_*() API.

Add xmalloc() variants of malloc()-like functions.  These functions never
return NULL.

and backed out the next day:

changeset:   15486:2cbe07bd5857
user:        Robert O'Callahan <robert@ocallahan.org>
date:        Tue Jun 24 19:52:40 2008 +1200
files:       memory/jemalloc/Makefile.in memory/jemalloc/jemalloc.c memory/jemalloc/jemalloc.h
description:
Backing out bug 427109 to try to fix Linux bustage

A group of several patches went in around 20 Jun, mostly bugfixes. Another change on that day was replacement of 'tree.h' by 'rb.h', a red-black tree macro library. Red-black trees are notoriously hard to implement correctly.

Prior to that were some patches in May.
Based on the above and reading the history of bug 427109, my current best guess is that this crash is in the reserve/recycling feature, and may be a race condition.

It doesn't look like the reserve feature is being used on trunk, so I recommend disabling it on trunk and seeing what happens. I'm not sure how easy it is to do that. Commenting out the call to |chunk_recycle_reserve| seems reasonable, but that might cause other problems if done by itself. This crash happens 1-2x/week on trunk, so we would not get results back very quickly, but it could still work. If the reserve feature isn't being used yet, then it seems OK to disable it and then slowly re-enable conservative versions, and hopefully it would be ready by the time we really needed it.

If it is a race condition, the most likely culprits would seem to be the temporary lock drops in |chunk_recycle_reserve| and |arena_run_alloc|. Another option is to disable (i.e., don't drop the lock| one or both of those and see what happens. (The one in |arena_run_alloc| is to avoid deadlocks that would otherwise occur if a low memory callback calls into jemalloc. I'm not sure what the lock drop in |chunk_recycle_reserve| is for.)

Another way to go forward on this would be to try to stress test jemalloc in isolation to try to make this failure happen faster than it does in browsing. That may be hard--if it depends on race conditions, we would need to be able to generate the right thread schedules, and we might need special controls to do this.
(In reply to comment #7)
> Most of these crashes are on uniprocessor systems (assuming I know how to read
> that off the crashreport), so it seems not to be MP-related. Also, I find lots

Actually, anomolously many of these are on uniprocessor systems; the bulk of crashes cluster around 40%-60% on uniprocessor systems; this one is at 80%.  I have no idea why that would happen.
I suggest that we rip out the reserve code entirely, since it is unused and buggy.

On a related note I spent some time looking at the reserve code recently to try to fix a memory statistics bug Vlad uncovered.  The code made my head hurt, and I still haven't figured out how it is messing up the statistics.
(In reply to comment #12)
> I suggest that we rip out the reserve code entirely, since it is unused and
> buggy.

I'm happy to take your recommendation on this. :-) What do you think is the best way? I think you've made a few small fixes since then. Should we try backing out the changesets that introduced the reserve feature?
blocking1.9.1: --- → ?
(In reply to comment #13)
> Should we try
> backing out the changesets that introduced the reserve feature?

Yes, this should work fine.  If this change is urgent, I'll be happy to review a patch.  If it can wait a bit, I should be able to put together a patch in the next week or so.
(In reply to comment #14)
> If it can wait a bit, I should be able to put together a patch in the
> next week or so.

That sounds great.
Whiteboard: [crashkill]
blocking1.9.1: ? → ---
Attached patch Remove jemalloc memory reserve. (obsolete) — Splinter Review
Attachment #408104 - Flags: review?(dmandelin)
The patch in comment #16 builds and appears to function correctly on my 32-bit Ubuntu Linux system, but I did not conduct rigorous tests.
(In reply to comment #17)
> The patch in comment #16 builds and appears to function correctly on my 32-bit
> Ubuntu Linux system, but I did not conduct rigorous tests.

Thanks! I'll do the testing.
This passed try server except for a build timeout on Win32. (Exactly the platform that needs to be test most, sigh.) I'm doing a local Win32 build to check that separately.
also top crash for Thunderbird, #21 for both 3.0b4 and 3.0pre
Whiteboard: [crashkill] → [crashkill][tb3wants]
(In reply to comment #11)
> Actually, anomolously many of these are on uniprocessor systems; the bulk of
> crashes cluster around 40%-60% on uniprocessor systems; this one is at 80%.  I
> have no idea why that would happen.

The comments on another crash with the same pattern mention out-of-memory dialogs.  This actually fits with the other analysis in this bug; crashes that show up as disproportionately on uniprocessor systems seem likely to be out-of-memory issues.
Attached patch Patch 2 (fix WinCE build error) (obsolete) — Splinter Review
The original patch failed on the try server with some symbol issues, fixing a .def file seems to take care of that.
Attachment #408104 - Attachment is obsolete: true
Attachment #408104 - Flags: review?(dmandelin)
Attachment #408508 - Attachment is obsolete: true
Jason, I can't complete a Windows build with your patch. It fails trying to run a program 'shlibsign.exe' that we also build. A dialog box appears with title bar "shlibsign.exe - Application Error" and message "The application failed to initialize properly (0xc0000142). Click OK to terminate the application.

Try server builds also timed out without giving much info, but I think this same failure could reasonably show up as a timeout there.

Advice?
(In reply to comment #24)
> Jason, I can't complete a Windows build with your patch. It fails trying to run
> a program 'shlibsign.exe' that we also build. A dialog box appears with title
> bar "shlibsign.exe - Application Error" and message "The application failed to
> initialize properly (0xc0000142). Click OK to terminate the application.

My first guess is that it has something to do with the code that allocates chunk-aligned memory.  Unfortunately, the router for my home network failed yesterday, so I am unable to get a Windows machine online to do any testing.
I'm working on the error 0xc0000142. I put a DebugBreak() at the start of malloc_init_hard and was able to find out that malloc_init_hard is returning false. It's not very easy to debug in an optimized build. If it's not too hard, I'll try to hack the makefile to build jemalloc.c in a debug configuration. Otherwise I'll do something else weird and painful.
Unfortunately we can't build the actual debug CRT currently, because it would require more patching. (Microsoft's CRT sources don't work out of the box, of course.) There's a patch in bug 429745 if you're adventurous.
The bug was pretty weird. It turned out that |malloc_init_hard| has a return type of |unsigned char| in jemalloc.c (actually |bool|, which is #define'd to |jemalloc_bool|, which is a typedef for |unsigned char|). But the CRT has it with return type |BOOL|. As a result, the implementation of |malloc_init_hard| implemented 'return false' (the success case) by |xor al, al| (i.e., just zero the least significant byte of eax). But the calling code in the CRT initialization function required eax to be 0 for success. Changing the return type to |BOOL| in jemalloc.c fixed it.

That makes sense by itself, but then I wonder how it ever worked before. Did the CRT source change? Or did it work by coincidence? Anyway, I'm currently testing a version of the patch that has |jemalloc_bool| typedef'd to |BOOL|.
Are you building with Visual C++ 2005 or 2008? Our build machines use 2005. It's possible the CRT changed from 2005 to 2008.
I'm using 2005.
Removing the reserve stuff should also fix many (if not all) of the about:memory issues in bug 515556.
Attachment #408509 - Attachment is obsolete: true
Comment on attachment 408962 [details] [diff] [review]
Patch 4 (fixes Windows startup/bool type issue)

Well, this seems to work on the try server (there are a few failures but other users seem to have had the same errors at the same time). The browser I built with it also seems to work fine. 

One thing I was wondering about is that before the reserve feature was added there was a single spare chunk that was apparently an optimization for cases where live allocations were about on the boundary of needing one more chunk (to avoid oscillations) (as I'm sure you already know). Do we need to put that back in?
Attachment #408962 - Flags: review?(jasone)
(In reply to comment #33)
> One thing I was wondering about is that before the reserve feature was added
> there was a single spare chunk that was apparently an optimization for cases
> where live allocations were about on the boundary of needing one more chunk (to
> avoid oscillations) (as I'm sure you already know). Do we need to put that back
> in?

Yes, the spare chunk logic is important to performance, so if there are portions of that code missing, we need to add it back.  I remember seeing the "spare" pointer still in the arena data structures, but I didn't think to check for the associated code.  I'll take a closer look when I review the patch later today, and incorporate any necessary changes.
Attachment #408962 - Flags: review?(jasone) → review+
Comment on attachment 408962 [details] [diff] [review]
Patch 4 (fixes Windows startup/bool type issue)

I suspect that this should be removed:

+#else
+#define bool unsigned char

or the #else should be #elif (defined(MOZ_MEMORY_WINDOWS)), because stdbool.h should correctly define bool in all other cases (and I think bool tends to be int or unsigned, rather than a char).  In practice though, I doubt it really matters much.
(In reply to comment #35)
> (From update of attachment 408962 [details] [diff] [review])
> I suspect that this should be removed:
> 
> +#else
> +#define bool unsigned char

Makes sense. I pushed it with that change as 109b74e8e902. We should be able to see if it affects the crashes on trunk over the next few days.
I had to back this out due to Linux build failures on tinderbox. I think it's just some stdbool.h weirdness; I compiled with -save-temps and it shows that stdbool.h is included, but it lists no code from that file. I'll sort it out and try again later.
stdbool.h is nothing but macro definitions, so it's not going to show anything in -save-temps.  You could try -E -dD.
Thanks, Zack. -dD proved that it was in fact including stdbool.h just fine, which told me to look harder, whence I realized the problem with '#undef bool'. I'm testing this fixed version on try now. I'm not sure it needs another review for a small thing like this, but you can tell me at least if you don't like the style I used in fixing it up. I'm currently testing it on try; I plan to land Monday if all goes well.
Attachment #408962 - Attachment is obsolete: true
Attachment #409412 - Flags: review?(jasone)
Attachment #409412 - Flags: review?(jasone) → review+
(In reply to comment #39)
> Created an attachment (id=409412) [details]
> Patch 5 (fix 'bool' definition for linux)

This won't compile on Windows, since;

+#ifdef _MSC_VER
+#  define bool BOOL
+#else
+#  include <stdbool.h>
+#endif

will not be called because its inside of the statement (starting line 291) :

#ifndef MOZ_MEMORY_WINDOWS
(In reply to comment #40)
> (In reply to comment #39)
> > Created an attachment (id=409412) [details] [details]
> > Patch 5 (fix 'bool' definition for linux)
> 
> This won't compile on Windows, since;
> 
> +#ifdef _MSC_VER
> +#  define bool BOOL
> +#else
> +#  include <stdbool.h>
> +#endif
> 
> will not be called because its inside of the statement (starting line 291) :
> 
> #ifndef MOZ_MEMORY_WINDOWS

I found that out the hard way. After looking more closely, I decided the right thing to do is add '#define bool BOOL' to the windows define section (next to true and false, so it seems to be the proper place).
(In reply to comment #41)
> I found that out the hard way. After looking more closely, I decided the right
> thing to do is add '#define bool BOOL' to the windows define section (next to
> true and false, so it seems to be the proper place).

Highlight in Visual Studio is very helpful :-)
Whiteboard: [crashkill][tb3wants] → [crashkill][crashkill-fix][tb3wants]
http://hg.mozilla.org/mozilla-central/rev/b884112e0922

It will take at least a few days to see if this cures the crash on trunk.
I did a little stats to get a handle on how long we need to wait before we consider this a fix. This crash appeared 8 times in the 36 days (semi-arbitrary sample period) before the candidate fix was pushed. That means it appeared .22 times per day. If we assume the crashes are a Poisson process, then that means the number of crashes per day is distributed Po(.22), aka Poisson distribution with mean .22. That also means the wait time between crashes is an exponential distribution with rate parameter .22, which is equivalent to saying crashes occur with a half-life of ln(2)/.22 = 3.1 days.

How long should we wait? It depends on how sure we think we have to be before wanting to land this on 1.9.2 and 1.9.1. If we assume hypothetically the patch *didn't* fix the crash, then there is a 95% probability of seeing a crash within 14 days. (That also assumes my estimate of the mean is correct.)
http://hg.mozilla.org/releases/mozilla-1.9.2/rev/62ca883b3086

Pushed to 1.9.2 to start getting beta coverage per today's CrashKill meeting discussion.
here is the kind of crash volume we saw on this over the life of 3.5.4.

looks like we will need north of a half million 3.6 beta testers to really understand how much this fix helps.  somewhere between 1 and 2 million beta users would get us to a better understanding.

#adu's  #crashes  date    signature   release

 0.50M    1 20091021-crashdata.csv arena_chunk_init 3.5.4
 0.50M    7 20091022-crashdata.csv arena_chunk_init 3.5.4
 0.48M    8 20091023-crashdata.csv arena_chunk_init 3.5.4
 0.42M    8 20091024-crashdata.csv arena_chunk_init 3.5.4
 0.43M    4 20091025-crashdata.csv arena_chunk_init 3.5.4
 0.51M    8 20091026-crashdata.csv arena_chunk_init 3.5.4
 0.51M    9 20091027-crashdata.csv arena_chunk_init 3.5.4
 1.82M  124 20091028-crashdata.csv arena_chunk_init 3.5.4
22.94M  350 20091029-crashdata.csv arena_chunk_init 3.5.4
32.56M  464 20091030-crashdata.csv arena_chunk_init 3.5.4
31.53M  443 20091031-crashdata.csv arena_chunk_init 3.5.4
35.62M  533 20091101-crashdata.csv arena_chunk_init 3.5.4
43.71M  546 20091102-crashdata.csv arena_chunk_init 3.5.4
46.18M  548 20091103-crashdata.csv arena_chunk_init 3.5.4
46.84M  497 20091104-crashdata.csv arena_chunk_init 3.5.4
47.06M  437 20091105-crashdata.csv arena_chunk_init 3.5.4
44.54M  303 20091106-crashdata.csv arena_chunk_init 3.5.4
19.12M  188 20091107-crashdata.csv arena_chunk_init 3.5.4
12.45M  115 20091108-crashdata.csv arena_chunk_init 3.5.4
in the last few days 3.6b1 is running around 198k-213k adu's

this hopefully will grow with the release of beta2
No crashes on trunk in the 9 days since this landed (under dmandelin's poisson assumptions, not exactly cause for a party yet, but < 15% likely) but additionally this seems to have picked up a 6% win on Dromaeo, 6% win on Tp4 Private Bytes and maybe a 2% Ts win on Linux. So... nice work.
Can this be closed?
Comment on attachment 409412 [details] [diff] [review]
Patch 5 (fix 'bool' definition for linux)

(In reply to comment #49)
> Can this be closed?

We just need to land it on 1.9.1.
Attachment #409412 - Flags: approval1.9.1.7?
It won't ever get approved until you mark it FIXED on trunk ;-)
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
blocking1.9.1: --- → ?
blocking1.9.1: ? → .7+
It's unlikely we'll be able to find a test case. We've never even had STR.
Comment on attachment 409412 [details] [diff] [review]
Patch 5 (fix 'bool' definition for linux)

Approved for 1.9.1.7, a=dveditz for release-drivers
Attachment #409412 - Flags: approval1.9.1.7? → approval1.9.1.7+
A little rebasing was needed.
Crash Signature: [@arena_chunk_init ]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: